MTH 448/563 Data-Oriented Computing

Fall 2019

Day 15, Wednesday, October 16

class.groupby( teamname from last time ).agg( discuss and work together )

NHTSA Complaints Report 3 reviews

Movie ratings database

Your movie reviews

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q

Your movie questions

Whiteboard from last time: https://docs.google.com/spreadsheets/d/12lZirnXrP8uHQuJkk8xLNcrNnraHSahi8A3JLljfQsk/edit?usp=sharing except change first Z to U

SQL

Relational DataBase Management Systems (RDBMSs)

A relation is set of tuples with elements drawn respectively from a list of specified sets. For example, if we have sets R, S, T, then any collection of tuples {(ri, si, ti)}, where each ri ∈ R, si ∈ S, and ti ∈ T, is a relation on those sets.

In databases, we allow tuples to be repeated, so we're generalizing to "multisets" (or "bags") of tuples.

SQL: a "backronym" for "Structured Query Language" ("Simple Query Language"?) (the former being actually a whopping misrepresentation).

SQLite3 and some basic SQL queries

A standalone program: install it.

Re-download the 3 tables, but modify filenames and export format to csv.

sheets = {
'f19_reviewer.xlsx':'1ZJpIeML2f2t6UxxGFzOTOQPM0xtyUgJVVBrW3I9nyjU', # except change first Z to K
'f19_movie.xlsx'   :'1rZAUcQxNTGcwZTnUEI5kZJvDGjLE7V1LvB5Rey5rY1w', # except change first Z to r
'f19_rating.xlsx'  :'108Z2iAlSiq8AQUJCOVa5uj8lkjIpBIPLIr83C41amYw'} # except change first Z to q

import requests
for sheet in sheets:
        url = 'https://docs.google.com/spreadsheets/d/' + sheets[sheet] + '/export?format=xlsx'
        r = requests.get(url)
        if r.status_code == 200: # success
            with open(sheet,'wb') as f:
                f.write(r.content)
            print('Downloaded '+sheet)
        else:
            print('Failed to get '+sheet)

Then run sqlite3:

sqlite3 mydb.sqlite
.mode csv
.import f19_reviewer.csv reviewer
.tables
.import f19_rating.csv rating
.import f19_movie.csv movie
.tables

basic select

select * from reviewer;

nicer formatting in sqlite3:

.header on
.mode column

elaborations of basic select: as, order by, limit

joins

inner join

select * from A join B on A.foo = B.foo;

Remember the default inner join omits rows where data absent in other table.

natural join - inner join on common column(s)

select * from tableA natural join tableB;

outer left join

select * from A outer left join B on A.foo = B.foo;

SQLite does not support right joins, but if you need one, just swap the left and right tables:

select * from B outer left join A  on A.foo = B.foo;

where

select * from A outer left join B  on A.foo = B.foo where A.blah = 'Hey!';

select * from A outer left join B  on A.foo = B.foo where A.blah in (2018,2017,2016);

movie queries revisited in SQL

select * from rating natural join movie where stars>4.5 order by year desc;

Can assign short aliases for tables:

select * from rating r outer left join movie m on m.movie_id = r.movie_id where stars>4.5;

aggregation

To aggregate (e.g. sum, average, count, min, max) by groups, use "GROUP BY"

SELECT a,b,c AGG(d) FROM table GROUP BY a,b;

where AGG stands for one of SUM, COUNT, MAX, AVG, STDEV, etc.

Returns a table with (here) 4 columns. Groups the rows by distinct (a,b), and returns 1 row per group.

If a requested column is neither aggregated nor grouped by, you get a seemingly random choice of the values occurring in the group.

creating VIEWs

CREATE VIEW foo AS SELECT ...

This construct can be used in the way we set variables in most other programming languages.

Unfortunately to re-run code, you'll have to first drop an existing VIEW:

DROP VIEW IF EXISTS foo;

SQLite3 from Python

I find writing complex SQL queries somewhat aggravating because of the lack of constructs in the language that would allow you "divide and conquer" problems.

This can be substantially resolved by using Python to assist in construction of the queries ...

import sqlite3

con = sqlite3.connect('foo.sqlite')
cur = con.cursor()

q = 'select * from movie'
cur.execute( q )

for row in cur.fetchall():
        print( row )

Note: fetchall() "uses up" the results: you can't run it twice.

Using Python to organize construction of compound queries: You can give names to queries and compose them by string concatenation, resulting in nested queries that are actually readable!

q0 = 'select foo from blah where thatthing = blah'
q1 = 'select otherthing from (' + q0 + ')'

column names?

SELECT sql FROM sqlite_master WHERE name='movie';

of query result in python sqlite3?

[c[0] for c in cur.description]

Exercise: Let us redo our movie database queries in SQL.

Notes:

df.columns = df.columns.get_level_values(0)