MTH 448/563 Data-Oriented Computing

Fall 2018

Day 7, Tuesday, Sep 18

More complex data: the bad old days

Everyone made up their own format (some still do).

Microsoft arguably the worst offender: non-plain-text and secret formats for Office documents, etc.

Others: Maple worksheet classic mws (now replaced by XML-based mw), Mathematica notebook.

Things are getting better now ...

JSON

Supplementary reference: Jennifer Widom database lectures.

Examples of use

Example of use: Thunderbird email client's bookkeeping:

$ ls *.json

addons.json
blocklist-addons.json
blocklist-gfx.json
blocklist-plugins.json
directoryTree.json
downloads.json
extensions.json
folderTree-1.json
folderTree-2.json
folderTree-3.json
folderTree-4.json
folderTree-5.json
folderTree-6.json
folderTree-7.json
folderTree.json
logins.json
search.json
sessionCheckpoints.json
session.json
times.json
xulstore.json

Another example: Jupyter notebooks!

Exercise: Make a copy of any of your Jupyter notebook (.ipynb) files as something.json. Then view with your browser or a text editor.

Appeal: Human-readable, self-describing. And JSON data happens to look almost like Python code. (It is Javascript code.)

Browser plugins that render JSON nicely are available.

Correspondence with Python data structures

  • object = dictionary (string:value)
  • array = list
  • value = string/number/true/false/null/object/array/

Cautions:

  • JSON keys are always strings (not required in Python dictionaries)
  • JSON text is almost pasteable as Python code, but JSON "true/false" map to Python "True/False", and JSON "null" maps to Python "None".
  • numpy arrays can't be stored as JSON without conversion to lists.

How to read and write JSON in Python

json module: dump, load

import json
with open('foo.json') as f:
        myobj = json.load(f)

with open('bar.json','w') as f:
        json.dump(myobj,f,indent=3,sortkeys=True)

Escape from Jupyter

Exercise (20 minutes): Write an "Escape from Jupyter" code. Imagine the Jupyter project languishes, and 10 years from now you want to run the code in Jupyter notebooks you wrote this year. Write a plain text python program that extracts all the python code and annotation from a Jupyter Notebook and writes a corresponding executable plain Python script. Specifically:

def jup2py(jupfile):
        # writes plain text python file [jupfile].py

NHTSA complaints database

JSON Exercise : Chevrolet Cobalt ignition switch

From Wikipedia article on Chevrolet Cobalt: Faulty ignition switches in the Cobalts, which cut power to the car while in motion, were eventually linked to many crashes resulting in fatalities, starting with a teenager in 2005 who drove her new Cobalt into a tree. The switch continued to be used in the manufacture of the vehicles even after the problem was known to GM. On February 21, 2014, GM recalled over 700,000 Cobalts for issues traceable to the defective ignition switches. In May 2014 the NHTSA fined the company $35 million for failing to recall cars with faulty ignition switches for a decade, despite knowing there was a problem with the switches. Thirteen deaths were linked to the faulty switches during the time the company failed to recall the cars.

http://www.nhtsa.gov/webapi/api/Complaints/vehicle/modelyear/2005/make/chevrolet/model/cobalt?format=json

Topic of Report 2: Was this problem evident from the NHTSA complaint database long before the 2014 recall? How would you go about searching the database for evidence of other serious problems?

Bug in JSON delivery:

import requests
import json
import pandas
from io import StringIO
url0 = 'http://www.nhtsa.gov/webapi/api/Complaints/vehicle/modelyear/{}/make/{}/model/{}?format=csv'
year,make,model = '2005','chevrolet','cobalt'
url = url0.format(year,make,model)
s = requests.get(url).text  # this is a CSV string
df = pandas.read_csv(StringIO(s)) # use pandas to parse the CSV
complaints = df.to_dict('records') # convert to list of dicts
complaints[0]

Histogram of complaint frequency

Useful matplotlib features:

  • bar()
  • xlim(), ylim()
  • xticks([],rotation=90)
  • title()
  • grid()

More thoughts on Reports

Embedding data: yes or no?

Yes, give the reader a feel for the data you're dealing with:

intro_show_the_data.png

The following is bigger but still fits on one screen, and is very helpful in conveying to the reader how you have processed the data:

report1_ok_data.png

But no, do NOT include long lists, tables that Jupyter puts into scrollable sub-windows or that extend over multiple screens, that no one is likely to read. Include this kind of material in an Appendix if you think it's vital to have for reference.

Scaling things out

number of steering complaints / number of complaints

number of complaints / number of cars

number of Elizabeths / total number of births

james_john_robert_unscaled.png

male_to_female_numbers.png

number of distinct names / total number of births ?

names_unique_to_total.png

Shifting axes

Life-cycle of names, pulses of popularity

joseph_3_pulses.png

Perhaps we could shift and rescale every name to place their peaks of popularity all at the same point, and see how the distribution looks.

Taking logarithms

Logarithms allow you to visually distinguish big, small, very small

Only makes sense if zero is a special value! (not an arbitrary origin)

Miscellaneous

Stacked histograms bad?

stacked_plot.png

Can obscures what's going on for every layer except the lowest.

How many curves in a plot?

As I said last week: A few is good. Very many can be good. Intermediate not so much.

Combining anecdote with entire population statistics

A think this student-created picture is strikingly good:

examples_on_top_of_distribution.png

Marking special values on axes

Can use axvline().

names_chandler_up.png axvline_demo_forrest.png

There is also axvspan() to draw a rectangle.

Rotating axis labels

plt.xticks(np.arange(1940,1986,1),rotation=60)
plt.xlim(1940,1985)
rotated_axis_labels.png

Another little trick:

legend outside the box

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Tufte books

slide_tuftecover1.png tuftecover2.png

Good principles for effective visual display of quantitative information

Examples:

  • have graphical integrity
  • avoid "chartjunk"
  • have high data density