Fall 2019

# Day 12, Monday, October 7

## Todays topics

make our own histogram

## SPARCS: every inpatient hospital visit in NY state in 2016

A much bigger table to analyze: with 2.3 million rows. (SPARCS_2016.zip on my server if you didn't download it already.) Let us explore this data.

• How many rows, exactly?
• What does your average New Yorker get charged for hospital stays in one year?
• What were the largest and smallest charges? (and what were they for?)
• What were the mean and median charges?
• Can we understand this distribution by making a histogram? (histogram.py, matplotlib.pyplot.bar(barcenters, heights, barwidth))

## Histograms in more depth

Let's make a histogram of this data:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

x = np.array([0.,23,15,5,7,29])

plt.plot(x,0*x,'o');


As always, for fast code, use available numpy methods to avoid Python loops: numpy.add.at() useful here.

Exercise: Make a histogram of the 'Total Charges' in the SPARCS data. (log charges, normal comparison?)

### How many bins should you use?

Too many and too few are both bad. A good choice depends on your assumptions about the data.

Example 1: uniform random numbers between 0 and 1. Also look at exponentially distributed random numbers (-log(u)).

Example 2: "Total Charges" for "Liveborn" at Maimonides Medical Center.

Optimal binning.

### Grouping and aggregation

Pandas groupby method

df = pd.DataFrame([['fruit','apple',23],
['fruit','orange',67],
['fruit','banana',113],
['veg','broccoli',5],
['veg','carrot',15]], columns=['type','name','count'])
display(df)


groupby gives an iterable of (group name, group dataframe) tuples

The groupby object supports aggregation, using agg({column:aggregator}) where aggregator is either a ufunc, like numpy.std or the name of an operation pandas understands, like 'count'.

Quiz: "most common diagnosis": What is the most common reason for an inpatient hospital stay?

Exercises:

• List diagnoses with counts in order of decreasing frequency.
• Then restrict to cancer only.
• Display in a bar chart.
• Elaborate the bar chart as a stacked bar chart with color denoting Age Group (use groupby with a tuple of columns).
• mean and/or mean charges by hospital county

Altair data encoding types for reference:

Data Type Shorthand Code Description
quantitative Q a continuous real-valued quantity
ordinal O a discrete ordered quantity
nominal N a discrete unordered category
temporal T a time or date value

### Geospatial aspects

Download this county map of New York State and this function I wrote: colorny.py to color it from a Series.

from colorny import colorny


Exercise: Explore variation in charges from county to county. Be careful not to draw unwarranted conclusions.

Next time: