MTH 448/563 Data-Oriented Computing

Fall 2018

Day 15, Tuesday, Oct 16

Movie review database

In case we need to re-download:

Exercise: Download the three tables to your computer using this code:

sheets = {
'448f18_reviewer.xlsx':'1qFVivZwo43u5xfLss5u0T_Xs3agUZl5jR1Gkmnaov-Y',
'448f18_movie.xlsx'   :'1MnXZfPCX9CHAJIP0Y5VgnF97gXg1f7m-3HHaRZWkhTM',
'448f18_rating.xlsx'  :'1vs4VUHHJ0UxAwCXl1FzBMblVzZKK6tVt2Ym0HI0DOQI'}

for sheet in sheets:
         url = 'https://docs.google.com/spreadsheets/d/' + sheets[sheet] + '/export?format=xlsx'
         r = requests.get(url)
         with open(sheet,'wb') as f:
             f.write(r.content)
         print('Downloaded '+sheet)

Questions asked and answered

Exercise: Let's here some pandas queries you've come up with to answer the questions you posted on the share board except change the 8 to a 7.

SQLite3

sheets = {
'448f18_reviewer.csv':'1qFVivZwo43u5xfLss5u0T_Xs3agUZl5jR1Gkmnaov-Y',
'448f18_movie.csv'   :'1MnXZfPCX9CHAJIP0Y5VgnF97gXg1f7m-3HHaRZWkhTM',
'448f18_rating.csv'  :'1vs4VUHHJ0UxAwCXl1FzBMblVzZKK6tVt2Ym0HI0DOQI'}

for sheet in sheets:
         url = 'https://docs.google.com/spreadsheets/d/' + sheets[sheet] + '/export?format=csv'
         r = requests.get(url)
         with open(sheet,'wb') as f:
             f.write(r.content)
         print('Downloaded '+sheet)
.mode csv
.import 448f18_reviewer.csv reviewer
.tables
.import 448f18_rating.csv rating
.import 448f18_movie.csv movie
.tables

joins

inner join

select * from A join B on A.foo = B.foo;

Careful! Inner join omits rows where data absent in other table.

natural join - inner join on common column(s)

select * from tableA natural join tableB;

outer left join

select * from A outer left join B  on A.foo = B.foo;

SQLite does not support right joins, but if you need one, just swap the left and right tables:

select * from B outer left join A  on A.foo = B.foo;

where

select * from A outer left join B  on A.foo = B.foo where A.blah = 'Hey!';

select * from A outer left join B  on A.foo = B.foo where A.blah in (2018,2017,2016);

movie queries revisited in SQL

select * from rating natural join movie where stars>4.5 order by year desc;

Can assign short aliases for tables:

select * from rating r outer left join movie m on m.movie_id = r.movie_id where stars>4.5;

aggregation

To aggregate (e.g. sum, average, count, min, max) by groups, use "GROUP BY"

SELECT a,b,c AGG(d) FROM table GROUP BY a,b;

where AGG stands for one of SUM, COUNT, MAX, AVG, etc.

Returns a table with (here) 4 columns. Groups the rows by distinct (a,b), and returns 1 row per group.

If a requested column is neither aggregated nor grouped by, you get a seemingly random choice of the values occurring in the group.

creating VIEWs

CREATE VIEW foo AS SELECT ...

This construct can be used in the way we set variables in most other programming languages.

Unfortunately to re-run code, you'll have to first drop an existing VIEW:

DROP VIEW IF EXISTS foo;

SQLite3 from Python

I find writing complex SQL queries somewhat aggravating because of the lack of constructs in the language that would allow you "divide and conquer" problems.

This can be substantially resolved by using Python to assist in construction of the queries ...

import sqlite3

con = sqlite3.connect('foo.sqlite')
cur = con.cursor()

q = 'select * from movie'
cur.execute( q )

for row in cur.fetchall():
        print( row )

Using Python to organize construction of compound queries: You can give names to queries and compose them by string concatenation, resulting in nested queries that are actually readable!

q0 = 'select foo from blah where thatthing = blah'
q1 = 'select otherthing from (' + q0 + ')'

column names?

SELECT sql FROM sqlite_master WHERE name='movie';