MTH 448/563 Data-Oriented Computing

Fall 2019

Day 11

Todays topics

  • wrap-up of 'steering' complaints
    • pandas.melt()
  • dates and times
  • pandas for data exploration
    • airports
    • SPARCS
  • movie reviews

Steering complaints and stacked bar chart

long vs. wide tables, pandas.melt(). Note altair also provides "melt" and "fold" transformations.

My code that counts the complaints that mention something specific as well as all the others:

import requests
import json
import pandas as pd

url0 = 'http://www.nhtsa.gov/webapi/api/Complaints/vehicle/modelyear/{}/make/{}/model/{}?format=json'

def get(make,model,searchstring):
        d = {'other':[],searchstring:[],'model year':[]}
        for year in range(2004,2011):

                print(str(year)+'\r',end='')
                # download the complaints for this make,model,year
                url = url0.format(year,make,model)
                s = requests.get(url).text
                complaints = json.loads(s)

                # count complaints whose Summary contains search string
                nc = 0
                for result in complaints['Results']:
                    if searchstring in result['Summary'].lower(): nc += 1

                d['model year'].append(year)
                d[searchstring].append(nc)
                d['other'].append(complaints['Count']-nc)

        df = pd.DataFrame.from_dict(d)
        #display(df)
        df = df[['model year',searchstring,'other']]   # reorder columns
        return df

df = get('Chevrolet','Cobalt','steer')
display(df)

Nice to make a stacked bar chart with Altair but ... altair wants the data in "long" format while we have prepared it in "wide" format. "Melting" and "pivoting" are operations that convert between these formats. Try this:

df.melt( id_vars=['model year'], value_vars=['other','steer'])

Datetimes

Dates and times are a bit complicated: variable-length months, leap years, etc.

{'ODINumber': 11244379,
 'Manufacturer': 'General Motors LLC',
 'Crash': 'No',
 'Fire': 'No',
 'NumberOfInjured': 0,
 'NumberOfDeaths': 0,
 'DateofIncident': '/Date(1527825600000-0400)/',
 'DateComplaintFiled': '/Date(1565969608000-0400)/',
 'VIN': '1G1AL55F567',
 'Component': 'ELECTRICAL SYSTEM | SERVICE BRAKES | STEERING',
 'Summary': '"TAKATA RECALL" THE CAR CUTS OFF AS I\'M DRIVING, OR SLOWING DOWN, THE IGNITION SWITCH DOESN\'T WORK TOO WELL, IT ALSO STALLS, GOES LONG PERIODS OF TIME WHERE IT WON\'T START OR EVEN TURN OVER. MY POWER STEERING WARNING LIGHT COMES NEARLY EVERYTIME I DRIVE BUT ALL MY FLUIDS IN THE CAR ARE FULL, SO IT MAKES DRIVING KINDA SCARY. IT JUST FEELS UNSAFE AND I WASN\'T AWARE OF THESE PROBLEMS AND SITUATIONS WHEN I HAD GOTTEN THE 2006 COBALT. PLEASE HELP.',
 'ProductType': 'VEHICLE',
 'ModelYear': '2006',
 'Make': 'CHEVROLET',
 'Model': 'COBALT'}

The first 10 digits in '/Date(1565969608000-0400)/' give us the number of seconds since the beginning of the year 1970 (Unix timestamp).

datetime module

supports arithmetic with datetimes

from datetime import datetime
datetime.fromtimestamp( 1527825600 )
datetime.datetime(2018, 6, 1, 0, 0)

Exercise: Make a histogram of incident dates or filing dates of complaints about 1 make, model, year. .

Pandas DataFrame and Series features

A fairly simple dataset to illustrate things: airports worldwide http://ourairports.com/data/: download CSV.

  • head, tail
  • converters (again)
  • idxmax, idxmin (index of extreme values of Series)
  • sort_values(by=)
  • a few simple aggregations
    • min, max, mean, median

Grouping and aggregation

  • value_counts

More generally, we can use the groupby method

df = pd.DataFrame([['fruit','apple',23],
                       ['fruit','orange',67],
                       ['fruit','banana',113],
                       ['veg','broccoli',5],
                       ['veg','carrot',15]], columns=['type','name','count'])
display(df)

groupby gives an iterable of (group name, group dataframe) tuples

The groupby object supports aggregation, using agg({column:aggregator})

Your movie reviews

In preparation for next week, please provide reviews of at least 5 movies:

reviewer table except change first Z to K

movie table except change first Z to r

rating table except change first Z to q

SPARCS: every inpatient hospital visit in NY state in 2016

A much bigger table to analyze: with 2.3 million rows. (SPARCS_2016.zip on my server if you didn't download it already.) Let us explore this data.

  • How many rows, exactly?
  • What does your average New Yorker get charged for hospital stays in one year?
  • What were the largest and smallest charges? (and what were they for?)
  • What were the mean and median charges?
  • Can we understand this distribution by making a histogram? (histogram1d.py, matplotlib.pyplot.bar(barcenters, heights, barwidth))