Day 21

Thursday, April 20, 2017

Pandas, cont'd

Fine-grained histogram

Often, when making histograms we try to find a happy medium between low resolution (too few bins) and noisy bin heights (too many bins). This is appropriate when the data values are believed to be essentially random numbers with a probability density that varies slowly on the scale of bins.

Bin size experiment: binsize.py

from numpy import *
from histogram1d import histogram1d
import matplotlib.pyplot as pl
N = 10000
x = random.rand(N)
pl.ion()

while True:
        n=int(input('n = '))
        if n<=0: break
        print('N/n = ',round(N/n,1),'sqrt(N/n) = ',round(sqrt(N/n),1),'sqrt(N/n)/(N/n) = ', sqrt(N/n)/(N/n) )
        lefts,counts = histogram1d(x,0,1,n,rawcounts=True)
        w = lefts[1]-lefts[0]
        pl.clf()
        pl.bar(lefts,counts,w,color='#ccccff')
        pl.xlim(0,1)
        pl.draw()

pl.ioff()
input()

Mean bin occupancy: 1000, 100, 10 respectively (data the same for all 3 cases):

binsize10.png binsize100.png binsize1000.png

Quiz: "bin size": What should the average bin occupancy be so that the noise is about 5%?

[Poisson variance is N/n. Binomial variance is (N/n)(1-1/n) ~ N/n.]

But other times, like the present context of hospital charges, a very fine grain may be appropriate, and large differences in adjacent bar heights can be real and meaningful.

Exercise: Make a very fine-grained histogram of the log10 of the charges for "LIVEBORN" at Maimonides Medical Center.

mmc = s[ s['Facility Name']=='Maimonides Medical Center' ]
mmcbabies = mmc[ mmc['CCS Diagnosis Description'] == 'LIVEBORN' ]
babylogcharges = log10(mmcbabies['Total Charges'])
babymax = babylogcharges.max()
babymin = babylogcharges.min()

from histogram1d import histogram1d
import matplotlib.pyplot as pl
lefts, counts = histogram1d(babylogcharges, babymin,babymax, 10000, rawcounts=True)
w = lefts[1]-lefts[0]
values = lefts + w/2
x = vstack([values,values])
y = vstack([zeros_like(counts), counts])
pl.figure(figsize=(12,6))
pl.plot(x,y,'b')
pl.show()
maimonides_babies.png

Note huge (real) jumps in bin occupancy at the low edge of the clusters.

Relevant observation about add-ons from Michael:

hold_baby_40_dollars.png

Hint for efficiency: Instead of using "bar", just draw a vertical line of the appropriate height for each bin, which is much faster, especially if you use a single call to "plot". "plot" accepts 2D array arguments for x and/or y where the columns are the curve data as shown below.

plot_many_curves_one_call_2.png

value_counts()

Exercise: a. How many of each kind of Patient Disposition on discharge were there?

b. What fraction of inpatients died in hospital?

Exercise:

a. What are people diagnosed with? How many with each 'CCS Diagnosis Description'?

b. What about just those with .str.contains('CANCER') ?

more groupby()

Exercise: Are some counties more expensive than others? Find the median charge by county.

http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

Multiple tables in Pandas

A tiny database to illustrate pandas.merge():

import pandas
rev = pandas.DataFrame([[115,'Hui'],[199,'Xuli'],[24,'Krithika'],[666,'Jonathan']],columns=['rid','name'])
mov = pandas.DataFrame([[100,'The Godfather'],[98,'Spotlight'],[23,'Amelie'],[900,'HP I'],[902,'HP II']], columns=['mid','movie'])
rat = pandas.DataFrame([[115,100,4],[199,98,5],[24,23,4],[666,900,5],[666,902,3.8]],columns=['rid','mid','stars'])

Inner join

pandas.merge( mov, rat, on='mid')

Outer join

how = 'left'

Reading in a SQLite database

Download our movie database if you don't still have it, and read the tables in it into pandas dataframes.

import sqlite3
con = sqlite3.connect('moviedb_jb.db')
# Get the metadata of the SQLite database
metadata = pandas.io.sql.read_sql('select * from sqlite_master', con)
#print type(metadata)
#print metadata.columns
#print metadata.index
# Get all the tables into a dictionary
db = {}
for i,row in metadata.iterrows():
         #print i,row['type']
         if row['type']=='table':
             db[ row['name'] ] = pandas.io.sql.read_sql('select * from ' + row['name'], con)
for tablename in db:
         print(tablename)

Questions we asked about movies

https://docs.google.com/document/d/1ROIGH32bXHTVWBWF2v17ur4t7D32LA4tfhDTQZ8y6U8

(change first X to Z)

Each group choose one, develop code to answer it, present to class.

Enter useful things you learn in this google doc:

https://docs.google.com/document/d/11UvTNvxmPpHuL1jubcK-owFrIGybk_rJSBE-rrAQJdo

(Change first x to a z)