Fall 2019

make our own histogram

A much bigger table to analyze: with 2.3 million rows. (SPARCS_2016.zip on my server if you didn't download it already.) Let us explore this data.

- How many rows, exactly?
- What does your average New Yorker get charged for hospital stays in one year?
- What were the largest and smallest charges? (and what were they for?)
- What were the mean and median charges?
- Can we understand this distribution by making a histogram? (histogram.py, matplotlib.pyplot.bar(barcenters, heights, barwidth))

Let's make a histogram of this data:

import numpy as np import matplotlib.pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'retina' x = np.array([0.,23,15,5,7,29]) plt.plot(x,0*x,'o');

As always, for fast code, use available numpy methods to avoid Python loops: *numpy.add.at()* useful here.

**Exercise:** Make a histogram of the 'Total Charges' in the SPARCS data. (log charges, normal comparison?)

Too many and too few are both bad. A good choice depends on your assumptions about the data.

Example 1: uniform random numbers between 0 and 1. Also look at exponentially distributed random numbers (-log(u)).

Example 2: "Total Charges" for "Liveborn" at *Maimonides Medical Center*.

Optimal binning.

Pandas *groupby* method

df = pd.DataFrame([['fruit','apple',23], ['fruit','orange',67], ['fruit','banana',113], ['veg','broccoli',5], ['veg','carrot',15]], columns=['type','name','count']) display(df)

*groupby* gives an iterable of (group name, group dataframe) tuples

The groupby object supports aggregation, using *agg({column:aggregator})* where *aggregator* is either a ufunc, like *numpy.std* or
the name of an operation pandas understands, like *'count'*.

**Quiz:** "most common diagnosis": What is the most common reason for an inpatient hospital stay?

**Exercises:**

- List diagnoses with counts in order of decreasing frequency.
- Then restrict to cancer only.
- Display in a bar chart.
- Elaborate the bar chart as a stacked bar chart with color denoting Age Group (use groupby with a tuple of columns).
- mean and/or mean charges by hospital county

Altair data encoding types for reference:

Data Type | Shorthand Code | Description |
---|---|---|

quantitative | Q | a continuous real-valued quantity |

ordinal | O | a discrete ordered quantity |

nominal | N | a discrete unordered category |

temporal | T | a time or date value |

Download this county map of New York State and this function I wrote: colorny.py to color it from a Series.

from colorny import colorny

*Exercise:* Explore variation in charges from county to county. Be careful not to draw unwarranted conclusions.

Next time:

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q