SQL Quiz Solutions

First a little prep for submitting queries to SQlite3 and displaying results

In [116]:
import sqlite3
import pandas
pandas.set_option('display.max_rows', 200)

def run(q):
    cur.execute(q)
    return pandas.DataFrame(cur.fetchall(),columns=[d[0] for d in cur.description])  

# just using Pandas for nice tabular display of results, nothing more

from IPython.display import Image

Part 1: single-table queries

Here is a SQLite database containing preliminary 2016 presidential election results by county: pres.db. (Surprisingly I was unable to find final official totals by county, which were somewhat larger.)

In [86]:
con = sqlite3.connect('pres.db')
cur = con.cursor()

1. What is wrong with data for the state of Alaska (AK)?

Well let's take a look at the table and see ...

In [87]:
q = 'select * from pres limit 10'
run(q)
Out[87]:
unused votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name combined_fips
0 0 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2013
1 1 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2016
2 2 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2020
3 3 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2050
4 4 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2060
5 5 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2068
6 6 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2070
7 7 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2090
8 8 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2100
9 9 93003 130413 246588 0.37715947248 0.528870018006 37,410 15.17% AK Alaska 2105

It appears that the same result is replicated for every FIPS code in Alaska. Ignoring this will therefore result in overcounting Alaska's votes.

2. What was the total number of votes cast for Clinton (dem) and for Trump (gop)?

In [92]:
q = 'select sum(votes_dem) as Clinton from pres where state_abbr != "AK"'
run(q)
Out[92]:
Clinton
0 62428732
In [91]:
q = 'select sum(votes_gop) as Trump from pres where state_abbr != "AK"'
run(q)
Out[91]:
Trump
0 61064843

So, at this stage of the count, Clinton had 62.4 million and Trump about 61.1 million.

3. How many counties did each candidate win?¶

In [93]:
q = 'select count(combined_FIPS) as Clinton_counties from pres where votes_dem > votes_gop and state_abbr != "AK"'
run(q)
Out[93]:
Clinton_counties
0 487
In [95]:
q = 'select count(combined_FIPS) as Trump_counties from pres where votes_dem < votes_gop and state_abbr != "AK"'
run(q)
Out[95]:
Trump_counties
0 2625

So Trump won the vast majority of counties. Clinton nevertheless winning the popular vote is possible due to Trump winning low-population (rural) counties and Clinton winning high-population (urban) ones.

4. (a) In which county or counties did Trump (gop) have the greatest percentage victory, and how many votes did each candidate get in those counties?

Be careful that in SQLite 9/10 = 0. You can convert an int to a float by multiplying by 1.0.

In [113]:
maxperc = 'select max((votes_gop-votes_dem)*1.0/total_votes) as maxperc from pres where state_abbr != "AK"'
run(maxperc)
Out[113]:
maxperc
0 0.916364
In [114]:
q = 'select county_name, state_abbr, votes_gop, votes_dem, (votes_gop-votes_dem)*1.0/total_votes as gop_perc_margin'
q += ' from pres where gop_perc_margin=(' + maxperc + ') and state_abbr != "AK"'
print(q)
run(q)
select county_name, state_abbr, votes_gop, votes_dem, (votes_gop-votes_dem)*1.0/total_votes as gop_perc_margin from pres where gop_perc_margin=(select max((votes_gop-votes_dem)*1.0/total_votes) as maxperc from pres where state_abbr != "AK") and state_abbr != "AK"
Out[114]:
county_name state_abbr votes_gop votes_dem gop_perc_margin
0 Roberts County TX 524 20 0.916364

Answer: In (only) Roberts County, TX, where he got 524 votes to Clinton's 20.

In [118]:
Image('roberts_county_tx_sv2.png')
Out[118]:

(b) In which county or counties did Clinton (dem) have the greatest percentage victory, and how many votes did each candidate get in those counties?

In [111]:
maxperc = 'select max((votes_dem-votes_gop)*1.0/total_votes) as maxperc from pres where state_abbr != "AK"'
q  = 'select county_name, state_abbr, votes_gop, votes_dem, (votes_dem-votes_gop)*1.0/total_votes as dem_perc_margin'
q += ' from pres where dem_perc_margin=(' + maxperc + ')'
print(q)
run(q)
select county_name, state_abbr, votes_gop, votes_dem, (votes_dem-votes_gop)*1.0/total_votes as dem_perc_margin from pres where dem_perc_margin=(select max((votes_dem-votes_gop)*1.0/total_votes) as maxperc from pres where state_abbr != "AK")
Out[111]:
county_name state_abbr votes_gop votes_dem dem_perc_margin
0 District of Columbia DC 11553 260223 0.887245

Answer: In (only) Washington DC, Clinton got 260,223 votes to Trump's 11,553, that being about 96% of the votes for the two major parties.

(c) Which county had the largest number of voters and how many votes did each candidate get in that county?

In [11]:
maxvotes = 'select max(total_votes) as maxvotes from pres'
q = 'select county_name, state_abbr, votes_gop, votes_dem, total_votes from pres where total_votes=(' + maxvotes + ')'
print(q)
run(q)
select county_name, state_abbr, votes_gop, votes_dem, total_votes from pres where total_votes=(select max(total_votes) as maxvotes from pres)
Out[11]:
county_name state_abbr votes_gop votes_dem total_votes
0 Los Angeles County CA 620285 1893770 2652072

Answer: Los Angeles county CA had over 2.6 million voters. Clinton got 1,893,770 to Trump's 620,285.

Part 2: multiple-table data

The following questions pertain to this movie review database moviedb_quiz.db.

In [123]:
con = sqlite3.connect('moviedb_quiz.db')
cur = con.cursor()

First check the tables so we're sure what we're dealing with.

In [41]:
q = 'select * from sqlite_master where type="table"'
run(q)
Out[41]:
type name tbl_name rootpage sql
0 table Reviewer Reviewer 2 CREATE TABLE Reviewer(rID int UNIQUE, name text)
1 table Movie Movie 4 CREATE TABLE Movie(mID int UNIQUE, title text,...
2 table Rating Rating 6 CREATE TABLE Rating(rID int, mID int, stars in...
In [37]:
q = 'select * from movie limit 3'
run(q)
Out[37]:
mID title year director
0 56 Captain America-The First Avenger 2011 Joe Johnson
1 1 Birdman 2014 Alejandro González Iñárritu
2 2 Pulp Fiction 1994 Quentin Tarantino
In [38]:
q = 'select * from rating limit 3'
run(q)
Out[38]:
rID mID stars ratingDate
0 17 56 4.8 2017-03-07
1 100 1 5.0 2017-03-07
2 666 3003 5.0 2017-03-07
In [39]:
q = 'select * from reviewer limit 3'
run(q)
Out[39]:
rID name
0 17 Adhish
1 184 Sanjeevani Choudhery
2 100 John Ringland

5. Make a list of movies (titles) with their average ratings, ordered from best to worst, and showing only the top 20.

In [44]:
averages  = 'select movie.title as title, avg(stars) as avstars from rating,movie'
averages += ' on rating.mid=movie.mid group by movie.mid order by avstars desc, title limit 20'
print(averages)
run(averages)
select movie.title as title, avg(stars) as avstars from rating,movie on rating.mid=movie.mid group by movie.mid order by avstars desc, title limit 20
Out[44]:
title avstars
0 Batman: Mask of the Phantasm 5.000000
1 Birdman 5.000000
2 Fight Club 5.000000
3 Frozen 5.000000
4 Gattaca 5.000000
5 Get Out 5.000000
6 Hidden Figures 5.000000
7 Jurassic World 5.000000
8 Lion 5.000000
9 Red Cliff 5.000000
10 Slumdog Millionaire 5.000000
11 Spotlight 5.000000
12 Teenage Mutant Ninja Turtles 5.000000
13 The Village 5.000000
14 The legend of 1900 5.000000
15 Twilight 5.000000
16 Bee Movie 4.950000
17 Pulp Fiction 4.666667
18 Harry Potter and Sorcerers Stone 4.600000
19 The Shining 4.500000

6. Make a list of all of people in the reviewer table along with the number of reviews they did.

Briefly say why it is important to use an outer join for this, rather than the default inner join.

Preliminary query that avoids any subtleties of joins:

In [54]:
counts = 'select rid, count(mid) as n from rating group by rid order by rid'
run(counts)
Out[54]:
rID n
0 2 4
1 3 2
2 12 1
3 13 3
4 17 2
5 20 2
6 24 1
7 67 4
8 97 2
9 100 10
10 101 1
11 115 1
12 120 1
13 164 1
14 199 1
15 410 2
16 419 3
17 425 2
18 444 1
19 466 1
20 496 1
21 536 1
22 542 1
23 555 7
24 666 7
25 999 5
In [ ]:
So we have 26 people who reviewed at least one movie.
In [55]:
q = 'select rid from reviewer order by rid'
run(q)
Out[55]:
rID
0 2
1 3
2 13
3 17
4 20
5 24
6 29
7 67
8 97
9 100
10 101
11 115
12 120
13 132
14 155
15 164
16 184
17 199
18 410
19 419
20 425
21 444
22 466
23 496
24 536
25 542
26 555
27 666
28 999

and 29 people listed in the reviewer table.

In [61]:
revrids = 'select distinct(rid) as revrid from reviewer'
ratrids = 'select distinct(rid) as ratrid from rating'
q = ratrids + ' where ratrid not in (' + revrids + ')'
print(q)
run(q)
select distinct(rid) as ratrid from rating where ratrid not in (select distinct(rid) as revrid from reviewer)
Out[61]:
ratrid
0 12
In [62]:
revrids = 'select distinct(rid) as revrid from reviewer'
ratrids = 'select distinct(rid) as ratrid from rating'
q = revrids + ' where revrid not in (' + ratrids + ')'
print(q)
run(q)
select distinct(rid) as revrid from reviewer where revrid not in (select distinct(rid) as ratrid from rating)
Out[62]:
revrid
0 29
1 132
2 155
3 184

So it's a bit complicated. We have 4 people listed in reviewer who did not review any movie. And we have 1 rid in rating that does not correspond to a row in reviewer.

In any case, here's my answer:

In [127]:
q = 'select name, count(stars) as "number of reviews" from reviewer rev outer left join rating on rev.rid=rating.rid'
q += ' group by rev.rid order by "number of reviews" desc, name'
print(q)
run(q)
select name, count(stars) as "number of reviews" from reviewer rev outer left join rating on rev.rid=rating.rid group by rev.rid order by "number of reviews" desc, name
Out[127]:
name number of reviews
0 John Ringland 10
1 Andrew Forthman 7
2 Jonathan Boatrite 7
3 Jonathan Lottes 5
4 Aishani Bhalla 4
5 Anthony Taboni 4
6 Meg Arnold 3
7 Samuel Siegart 3
8 Adhish 2
9 Anna Kwietniewski 2
10 Anonymous Chupacabera 2
11 Daniel Seaberg 2
12 Maggie 2
13 Robert Kazmierczak 2
14 Adi 1
15 Dwayne The Rock Johnson 1
16 Edward Crow 1
17 Hui Duan 1
18 Kasparov 1
19 Krithika Krishnan 1
20 Megan Tuskes 1
21 Mia Sette 1
22 Prashant Shekhar 1
23 Xuli Shen 1
24 Yiqing Zhong 1
25 Hedy Chen 0
26 Sakar Sawarkar 0
27 Sanjeevani Choudhery 0
28 batman897 0

If we were to use the default inner join, we miss the 4 people who did not review any movie:

In [130]:
### INCORRECT USE OF INNER JOIN ###
q = 'select name, count(stars) as "number of reviews" from reviewer rev,rating on rev.rid=rating.rid'
q += ' group by rev.rid order by "number of reviews" desc, name'
print(q)
run(q)
select name, count(stars) as "number of reviews" from reviewer rev,rating on rev.rid=rating.rid group by rev.rid order by "number of reviews" desc, name
Out[130]:
name number of reviews
0 John Ringland 10
1 Andrew Forthman 7
2 Jonathan Boatrite 7
3 Jonathan Lottes 5
4 Aishani Bhalla 4
5 Anthony Taboni 4
6 Meg Arnold 3
7 Samuel Siegart 3
8 Adhish 2
9 Anna Kwietniewski 2
10 Anonymous Chupacabera 2
11 Daniel Seaberg 2
12 Maggie 2
13 Robert Kazmierczak 2
14 Adi 1
15 Dwayne The Rock Johnson 1
16 Edward Crow 1
17 Hui Duan 1
18 Kasparov 1
19 Krithika Krishnan 1
20 Megan Tuskes 1
21 Mia Sette 1
22 Prashant Shekhar 1
23 Xuli Shen 1
24 Yiqing Zhong 1

Alternative solution provided by a student

In [136]:
q = 'select r1.name, (select count(*) from rating ra where ra.rID=r1.rID) as num_revs from reviewer r1 order by num_revs desc,r1.name'
run(q)
Out[136]:
name num_revs
0 John Ringland 10
1 Andrew Forthman 7
2 Jonathan Boatrite 7
3 Jonathan Lottes 5
4 Aishani Bhalla 4
5 Anthony Taboni 4
6 Meg Arnold 3
7 Samuel Siegart 3
8 Adhish 2
9 Anna Kwietniewski 2
10 Anonymous Chupacabera 2
11 Daniel Seaberg 2
12 Maggie 2
13 Robert Kazmierczak 2
14 Adi 1
15 Dwayne The Rock Johnson 1
16 Edward Crow 1
17 Hui Duan 1
18 Kasparov 1
19 Krithika Krishnan 1
20 Megan Tuskes 1
21 Mia Sette 1
22 Prashant Shekhar 1
23 Xuli Shen 1
24 Yiqing Zhong 1
25 Hedy Chen 0
26 Sakar Sawarkar 0
27 Sanjeevani Choudhery 0
28 batman897 0

7. Which movies from the 1990s had an average rating less than 3.5 stars?

In [74]:
badavg = 'select mid, avg(stars) as avstars from rating group by mid having avstars<3.5'
q = 'select title,avstars from movie natural join (' + badavg + ') where year like "199%"'
run(q)
Out[74]:
title avstars
0 The Lost World: Jurassic Park 1.666667
1 Mulan 3.000000

8. Which movie(s) had the largest variation in their ratings (and what where the worst and best ratings of those movies)?

In [84]:
vars = 'select mid, max(stars)-min(stars) as variation from rating group by mid'
maxvar = 'select max(variation) as maxvar from (' + vars + ')'
mostvariable = 'select mid, max(stars) as maxstars, min(stars) as minstars from rating natural join (' + vars + ') where variation=(' + maxvar + ')'
q = 'select title, maxstars, minstars from (' + mostvariable + ') natural join movie'
run(q)
Out[84]:
title maxstars minstars
0 Star Wars: Episode IV - A New Hope 5 0

This would have captured all movies with the largest variation if there had been more than one.

9. (Extra credit.) What are all pairs of reviewers that reviewed a movie in common?

In [85]:
pairs = 'select r1.rid as r1id, r2.rid as r2id, r1.mid as mid from rating r1, rating r2 where r1.mid=r2.mid and r1.rid<r2.rid'
q = 'select rev1.name as a, rev2.name as b, movie.title as "a movie in common" from (' + pairs + \
') pairs, reviewer rev1, reviewer rev2, movie where pairs.r1id=rev1.rid and pairs.r2id=rev2.rid and pairs.mid=movie.mid group by a,b order by rev1.name, rev2.name'
run(q)
Out[85]:
a b a movie in common
0 Adhish Samuel Siegart Captain America-The First Avenger
1 Adi Jonathan Boatrite Harry Potter and Sorcerers Stone
2 Aishani Bhalla Jonathan Lottes The Lost World: Jurassic Park
3 Andrew Forthman Jonathan Lottes Transformers:Revenge of the Fallen
4 Anna Kwietniewski Andrew Forthman Star Wars: Episode IV - A New Hope
5 Anna Kwietniewski Daniel Seaberg Star Wars: Episode IV - A New Hope
6 Anna Kwietniewski John Ringland Star Wars: Episode IV - A New Hope
7 Anna Kwietniewski Robert Kazmierczak Star Wars: Episode IV - A New Hope
8 Anonymous Chupacabera Aishani Bhalla Hidden Figures
9 Anthony Taboni Aishani Bhalla The Lost World: Jurassic Park
10 Anthony Taboni Andrew Forthman Jurassic Park
11 Anthony Taboni Jonathan Lottes The Lost World: Jurassic Park
12 Daniel Seaberg Andrew Forthman Star Wars: Episode IV - A New Hope
13 Daniel Seaberg John Ringland Star Wars: Episode IV - A New Hope
14 Daniel Seaberg Jonathan Lottes Jurassic World
15 Daniel Seaberg Robert Kazmierczak Star Wars: Episode IV - A New Hope
16 Dwayne The Rock Johnson Yiqing Zhong Get Out
17 Edward Crow Andrew Forthman Pulp Fiction
18 John Ringland Adi Harry Potter and Sorcerers Stone
19 John Ringland Andrew Forthman Star Wars: Episode IV - A New Hope
20 John Ringland Jonathan Boatrite Harry Potter and Chamber of Secrets
21 John Ringland Robert Kazmierczak Star Wars: Episode IV - A New Hope
22 John Ringland Samuel Siegart 2001: A Space Odyssey
23 Maggie Andrew Forthman Cars
24 Meg Arnold John Ringland Harry Potter and Chamber of Secrets
25 Meg Arnold Jonathan Boatrite Harry Potter and Chamber of Secrets
26 Meg Arnold Kasparov The Shining
27 Robert Kazmierczak Andrew Forthman Star Wars: Episode IV - A New Hope
28 Robert Kazmierczak Samuel Siegart 2001: A Space Odyssey
In [ ]:

In [ ]:

In [ ]: