MTH 448/563 Data-Oriented Computing

Fall 2019

Day 14, Monday, October 14

NHTSA Complaints Report 3 reviews

Movie ratings database

Your movie reviews

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q

To download, copy, paste, and run this code:

sheets = {
'f19_reviewer.xlsx':'1ZJpIeML2f2t6UxxGFzOTOQPM0xtyUgJVVBrW3I9nyjU', # except change first Z to K
'f19_movie.xlsx'   :'1rZAUcQxNTGcwZTnUEI5kZJvDGjLE7V1LvB5Rey5rY1w', # except change first Z to r
'f19_rating.xlsx'  :'108Z2iAlSiq8AQUJCOVa5uj8lkjIpBIPLIr83C41amYw'} # except change first Z to q

import requests
for sheet in sheets:
                 url = 'https://docs.google.com/spreadsheets/d/' + sheets[sheet] + '/export?format=xlsx'
                 r = requests.get(url)
                 with open(sheet,'wb') as f:
                         f.write(r.content)
                 print('Downloaded '+sheet)

Note: Can do a.merge(b) as well as pandas.merge(a,b)

Discussion: Why are databases factored into multiple tables like this?

20191014_155738.jpg

Your movie questions

Quiz: "movie question". A question about the movie reviews you'd like answered. (Please write your question on a single line.)

Whiteboard for today: https://docs.google.com/spreadsheets/d/12lZirnXrP8uHQuJkk8xLNcrNnraHSahi8A3JLljfQsk/edit?usp=sharing except change first Z to U

Enter your team personnel and invent a name for your team.

Coming next ...

SQL

Relational DataBase Management Systems (RDBMSs)

A relation is set of tuples with elements drawn respectively from a list of specified sets. For example, if we have sets R, S, T, then any collection of tuples {(ri, si, ti)}, where each ri ∈ R, si ∈ S, and ti ∈ T, is a relation on those sets.

In databases, we allow tuples to be repeated, so we're generalizing to "multisets" (or "bags") of tuples.

SQL: a "backronym" for "Simple Query Language" or "Structured Query Language" (the latter being actually a whopping misrepresentation).

SQLite3 and some basic SQL queries

A standalone program: install it.

Re-download the 3 tables, but modify filenames and export format to csv.

Then run sqlite3:

sqlite3 mydb.sqlite
.mode csv
.import 448f18_reviewer.csv reviewer
.tables
.import 448f18_rating.csv rating
.import 448f18_movie.csv movie
.tables

joins

inner join

select * from A join B on A.foo = B.foo;

Remember the default inner join omits rows where data absent in other table.

natural join - inner join on common column(s)

select * from tableA natural join tableB;

outer left join

select * from A outer left join B  on A.foo = B.foo;

SQLite does not support right joins, but if you need one, just swap the left and right tables:

select * from B outer left join A  on A.foo = B.foo;

where

select * from A outer left join B  on A.foo = B.foo where A.blah = 'Hey!';

select * from A outer left join B  on A.foo = B.foo where A.blah in (2018,2017,2016);

movie queries revisited in SQL

select * from rating natural join movie where stars>4.5 order by year desc;

Can assign short aliases for tables:

select * from rating r outer left join movie m on m.movie_id = r.movie_id where stars>4.5;

aggregation

To aggregate (e.g. sum, average, count, min, max) by groups, use "GROUP BY"

SELECT a,b,c AGG(d) FROM table GROUP BY a,b;

where AGG stands for one of SUM, COUNT, MAX, AVG, etc.

Returns a table with (here) 4 columns. Groups the rows by distinct (a,b), and returns 1 row per group.

If a requested column is neither aggregated nor grouped by, you get a seemingly random choice of the values occurring in the group.

creating VIEWs

CREATE VIEW foo AS SELECT ...

This construct can be used in the way we set variables in most other programming languages.

Unfortunately to re-run code, you'll have to first drop an existing VIEW:

DROP VIEW IF EXISTS foo;

SQLite3 from Python

I find writing complex SQL queries somewhat aggravating because of the lack of constructs in the language that would allow you "divide and conquer" problems.

This can be substantially resolved by using Python to assist in construction of the queries ...

import sqlite3

con = sqlite3.connect('foo.sqlite')
cur = con.cursor()

q = 'select * from movie'
cur.execute( q )

for row in cur.fetchall():
        print( row )

Using Python to organize construction of compound queries: You can give names to queries and compose them by string concatenation, resulting in nested queries that are actually readable!

q0 = 'select foo from blah where thatthing = blah'
q1 = 'select otherthing from (' + q0 + ')'

column names?

SELECT sql FROM sqlite_master WHERE name='movie';

Notes:

df.columns = df.columns.get_level_values(0)