Day 21

Thursday, April 20, 2017

Pandas, cont'd

Fine-grained histogram

Often, when making histograms we try to find a happy medium between low resolution (too few bins) and noisy bin heights (too many bins). This is appropriate when the data values are believed to be essentially random numbers with a probability density that varies slowly on the scale of bins.

Bin size experiment:

from numpy import *
from histogram1d import histogram1d
import matplotlib.pyplot as pl
N = 10000
x = random.rand(N)

while True:
        n=int(input('n = '))
        if n<=0: break
        print('N/n = ',round(N/n,1),'sqrt(N/n) = ',round(sqrt(N/n),1),'sqrt(N/n)/(N/n) = ', sqrt(N/n)/(N/n) )
        lefts,counts = histogram1d(x,0,1,n,rawcounts=True)
        w = lefts[1]-lefts[0]


Mean bin occupancy: 1000, 100, 10 respectively (data the same for all 3 cases):

binsize10.png binsize100.png binsize1000.png

Quiz: "bin size": What should the average bin occupancy be so that the noise is about 5%?

[Poisson variance is N/n. Binomial variance is (N/n)(1-1/n) ~ N/n.]

But other times, like the present context of hospital charges, a very fine grain may be appropriate, and large differences in adjacent bar heights can be real and meaningful.

Exercise: Make a very fine-grained histogram of the log10 of the charges for "LIVEBORN" at Maimonides Medical Center.

mmc = s[ s['Facility Name']=='Maimonides Medical Center' ]
mmcbabies = mmc[ mmc['CCS Diagnosis Description'] == 'LIVEBORN' ]
babylogcharges = log10(mmcbabies['Total Charges'])
babymax = babylogcharges.max()
babymin = babylogcharges.min()

from histogram1d import histogram1d
import matplotlib.pyplot as pl
lefts, counts = histogram1d(babylogcharges, babymin,babymax, 10000, rawcounts=True)
w = lefts[1]-lefts[0]
values = lefts + w/2
x = vstack([values,values])
y = vstack([zeros_like(counts), counts])

Note huge (real) jumps in bin occupancy at the low edge of the clusters.

Relevant observation about add-ons from Michael:


Hint for efficiency: Instead of using "bar", just draw a vertical line of the appropriate height for each bin, which is much faster, especially if you use a single call to "plot". "plot" accepts 2D array arguments for x and/or y where the columns are the curve data as shown below.



Exercise: a. How many of each kind of Patient Disposition on discharge were there?

b. What fraction of inpatients died in hospital?


a. What are people diagnosed with? How many with each 'CCS Diagnosis Description'?

b. What about just those with .str.contains('CANCER') ?

more groupby()

Exercise: Are some counties more expensive than others? Find the median charge by county.

Multiple tables in Pandas

A tiny database to illustrate pandas.merge():

import pandas
rev = pandas.DataFrame([[115,'Hui'],[199,'Xuli'],[24,'Krithika'],[666,'Jonathan']],columns=['rid','name'])
mov = pandas.DataFrame([[100,'The Godfather'],[98,'Spotlight'],[23,'Amelie'],[900,'HP I'],[902,'HP II']], columns=['mid','movie'])
rat = pandas.DataFrame([[115,100,4],[199,98,5],[24,23,4],[666,900,5],[666,902,3.8]],columns=['rid','mid','stars'])

Inner join

pandas.merge( mov, rat, on='mid')

Outer join

how = 'left'

Reading in a SQLite database

Download our movie database if you don't still have it, and read the tables in it into pandas dataframes.

import sqlite3
con = sqlite3.connect('moviedb_jb.db')
# Get the metadata of the SQLite database
metadata ='select * from sqlite_master', con)
#print type(metadata)
#print metadata.columns
#print metadata.index
# Get all the tables into a dictionary
db = {}
for i,row in metadata.iterrows():
         #print i,row['type']
         if row['type']=='table':
             db[ row['name'] ] ='select * from ' + row['name'], con)
for tablename in db:

Questions we asked about movies

(change first X to Z)

Each group choose one, develop code to answer it, present to class.

Enter useful things you learn in this google doc:

(Change first x to a z)