MTH 448/563 Data-Oriented Computing

Fall 2019

Day 12, Monday, October 7

Todays topics

make our own histogram

SPARCS: every inpatient hospital visit in NY state in 2016

A much bigger table to analyze: with 2.3 million rows. (SPARCS_2016.zip on my server if you didn't download it already.) Let us explore this data.

  • How many rows, exactly?
  • What does your average New Yorker get charged for hospital stays in one year?
  • What were the largest and smallest charges? (and what were they for?)
  • What were the mean and median charges?
  • Can we understand this distribution by making a histogram? (histogram.py, matplotlib.pyplot.bar(barcenters, heights, barwidth))

Histograms in more depth

Let's make a histogram of this data:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

x = np.array([0.,23,15,5,7,29])

plt.plot(x,0*x,'o');

As always, for fast code, use available numpy methods to avoid Python loops: numpy.add.at() useful here.

Exercise: Make a histogram of the 'Total Charges' in the SPARCS data. (log charges, normal comparison?)

How many bins should you use?

Too many and too few are both bad. A good choice depends on your assumptions about the data.

Example 1: uniform random numbers between 0 and 1. Also look at exponentially distributed random numbers (-log(u)).

Example 2: "Total Charges" for "Liveborn" at Maimonides Medical Center.

Optimal binning.

Grouping and aggregation

Pandas groupby method

df = pd.DataFrame([['fruit','apple',23],
                       ['fruit','orange',67],
                       ['fruit','banana',113],
                       ['veg','broccoli',5],
                       ['veg','carrot',15]], columns=['type','name','count'])
display(df)

groupby gives an iterable of (group name, group dataframe) tuples

The groupby object supports aggregation, using agg({column:aggregator}) where aggregator is either a ufunc, like numpy.std or the name of an operation pandas understands, like 'count'.

Quiz: "most common diagnosis": What is the most common reason for an inpatient hospital stay?

Exercises:

  • List diagnoses with counts in order of decreasing frequency.
  • Then restrict to cancer only.
  • Display in a bar chart.
  • Elaborate the bar chart as a stacked bar chart with color denoting Age Group (use groupby with a tuple of columns).
  • mean and/or mean charges by hospital county

Altair data encoding types for reference:

Data Type Shorthand Code Description
quantitative Q a continuous real-valued quantity
ordinal O a discrete ordered quantity
nominal N a discrete unordered category
temporal T a time or date value

Geospatial aspects

Download this county map of New York State and this function I wrote: colorny.py to color it from a Series.

from colorny import colorny

Exercise: Explore variation in charges from county to county. Be careful not to draw unwarranted conclusions.

Next time:

Merging (joining) tables

Your movie reviews

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q