MTH 448/563 Data-Oriented Computing

Fall 2019

Day 13, Wednesday, October 9

SPARCS

Review of last lesson

Exercise: Reusing code from last time, make a histogram of Birth Weight in the SPARCS dataset. Start with 20 bins between 0 and 5000(g). Decide what is the optimal number of bins.

Grouping and aggregation

Pandas groupby method

df = pd.DataFrame([['fruit','apple',23],
                       ['fruit','orange',67],
                       ['fruit','banana',113],
                       ['veg','broccoli',5],
                       ['veg','carrot',15]], columns=['type','name','count'])
display(df)

groupby gives an iterable of (group name, group dataframe) tuples

The groupby object supports aggregation, using agg({column:aggregator}) where aggregator is either a ufunc, like numpy.std or the name of an operation pandas understands, like 'count'.

Quiz: "most common diagnosis": Guess what is the most common reason for an inpatient hospital stay?

Useful elaborations: rename(columns={}), reset_index()

Exercises:

  • List diagnoses with counts in order of decreasing frequency.
  • Then restrict to cancer only. str.contains()
  • Display in a bar chart.
  • Elaborate the bar chart as a stacked bar chart with color denoting Age Group (use groupby with a tuple of columns).

Altair data encoding types for reference: Nominal, Ordinal, Quantitative, Temporal

Additional exercises:

  • Is there a relationship between birth weight and the age of mothers? (!)
  • Create a histogram of the lengths of stay.
  • Make a heatmap or scatter plot of charges/day vs. length of stay. .

Geospatial aspects

Download this county map of New York State and this function I wrote: colorny.py to color it from a Series.

from colorny import colorny

Exercise: Explore variation in charges from county to county. Be careful not to draw unwarranted conclusions.

Merging (joining) tables

Download these two small tables: firstname.csv, lastname.csv, teaching.csv. The third is from UB's (Oracle) database UBInfosource.

inner, left (outer), right (outer), and (full) outer joins

on, how, etc.

Your movie reviews

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q

To download, copy, paste, and run this code:

sheets = {
'f19_reviewer.xlsx':'1ZJpIeML2f2t6UxxGFzOTOQPM0xtyUgJVVBrW3I9nyjU', # except change first Z to K
'f19_movie.xlsx'   :'1rZAUcQxNTGcwZTnUEI5kZJvDGjLE7V1LvB5Rey5rY1w', # except change first Z to r
'f19_rating.xlsx'  :'108Z2iAlSiq8AQUJCOVa5uj8lkjIpBIPLIr83C41amYw'} # except change first Z to q

import requests
for sheet in sheets:
                 url = 'https://docs.google.com/spreadsheets/d/' + sheets[sheet] + '/export?format=xlsx'
                 r = requests.get(url)
                 with open(sheet,'wb') as f:
                         f.write(r.content)
                 print('Downloaded '+sheet)

Discussion: Why are databases factored into multiple tables like this?

Quiz: "movie question". A question about the movie reviews you'd like answered. (Please write your question on a single line.)

Let's try to answer some of these.

Notes:

alt.Chart(table).mark_bar().encode(x=dd+':N',y='count',color='Age Group')

alt.data_transformers.enable('json')