# Day 21

Thursday, April 20, 2017

# Pandas, cont'd

## Fine-grained histogram

Often, when making histograms we try to find a happy medium between low resolution (too few bins) and noisy bin heights (too many bins). This is appropriate when the data values are believed to be essentially random numbers with a probability density that varies slowly on the scale of bins.

Bin size experiment: binsize.py

from numpy import *
from histogram1d import histogram1d
import matplotlib.pyplot as pl
N = 10000
x = random.rand(N)
pl.ion()

while True:
n=int(input('n = '))
if n<=0: break
print('N/n = ',round(N/n,1),'sqrt(N/n) = ',round(sqrt(N/n),1),'sqrt(N/n)/(N/n) = ', sqrt(N/n)/(N/n) )
lefts,counts = histogram1d(x,0,1,n,rawcounts=True)
w = lefts[1]-lefts[0]
pl.clf()
pl.bar(lefts,counts,w,color='#ccccff')
pl.xlim(0,1)
pl.draw()

pl.ioff()
input()


Mean bin occupancy: 1000, 100, 10 respectively (data the same for all 3 cases):

Quiz: "bin size": What should the average bin occupancy be so that the noise is about 5%?

[Poisson variance is N/n. Binomial variance is (N/n)(1-1/n) ~ N/n.]

But other times, like the present context of hospital charges, a very fine grain may be appropriate, and large differences in adjacent bar heights can be real and meaningful.

Exercise: Make a very fine-grained histogram of the log10 of the charges for "LIVEBORN" at Maimonides Medical Center.

mmc = s[ s['Facility Name']=='Maimonides Medical Center' ]
mmcbabies = mmc[ mmc['CCS Diagnosis Description'] == 'LIVEBORN' ]
babylogcharges = log10(mmcbabies['Total Charges'])
babymax = babylogcharges.max()
babymin = babylogcharges.min()

from histogram1d import histogram1d
import matplotlib.pyplot as pl
lefts, counts = histogram1d(babylogcharges, babymin,babymax, 10000, rawcounts=True)
w = lefts[1]-lefts[0]
values = lefts + w/2
x = vstack([values,values])
y = vstack([zeros_like(counts), counts])
pl.figure(figsize=(12,6))
pl.plot(x,y,'b')
pl.show()


Note huge (real) jumps in bin occupancy at the low edge of the clusters.

Hint for efficiency: Instead of using "bar", just draw a vertical line of the appropriate height for each bin, which is much faster, especially if you use a single call to "plot". "plot" accepts 2D array arguments for x and/or y where the columns are the curve data as shown below.

## value_counts()

Exercise: a. How many of each kind of Patient Disposition on discharge were there?

b. What fraction of inpatients died in hospital?

Exercise:

a. What are people diagnosed with? How many with each 'CCS Diagnosis Description'?

b. What about just those with .str.contains('CANCER') ?

## more groupby()

Exercise: Are some counties more expensive than others? Find the median charge by county.

http://wesmckinney.com/blog/groupby-fu-improvements-in-grouping-and-aggregating-data-in-pandas/

## Multiple tables in Pandas

A tiny database to illustrate pandas.merge():

import pandas
rev = pandas.DataFrame([[115,'Hui'],[199,'Xuli'],[24,'Krithika'],[666,'Jonathan']],columns=['rid','name'])
mov = pandas.DataFrame([[100,'The Godfather'],[98,'Spotlight'],[23,'Amelie'],[900,'HP I'],[902,'HP II']], columns=['mid','movie'])
rat = pandas.DataFrame([[115,100,4],[199,98,5],[24,23,4],[666,900,5],[666,902,3.8]],columns=['rid','mid','stars'])


### Inner join

pandas.merge( mov, rat, on='mid')


### Outer join

how = 'left'


### Reading in a SQLite database

Download our movie database if you don't still have it, and read the tables in it into pandas dataframes.

import sqlite3
con = sqlite3.connect('moviedb_jb.db')
# Get the metadata of the SQLite database
# Get all the tables into a dictionary
db = {}
#print i,row['type']
if row['type']=='table':
db[ row['name'] ] = pandas.io.sql.read_sql('select * from ' + row['name'], con)
for tablename in db:
print(tablename)