MTH 448/563 Data-Oriented Computing

Spring 2017

Day 3

Tues, Feb 7, 2017

Reminder: Semester-long data-collection project

By Thursday, let me know what data you propose to record.

Exercise 3, conclusion

Analyze the first Presidential debate between Hilary Clinton and Donald Trump

Some questions to answer:

  • how much did each of them speak?
  • how big is the vocabulary of each?
  • which words did each use most frequently?
  • how to the words they used least frequently compare?
  • what are the words used most unequally?

Exercise 4, History of first names in the US

This will be the subject of your Report 1.

Consider the National data set from this US Social Security Administration page

https://www.ssa.gov/oact/babynames/limits.html,

about names given to babies in the US from 1890 through 2014, will be the subject of your Report 1. Creating Report 1 will entail extracting something interesting from this data and writing about it.

We will use this data set to learn about:

  1. One of the oldest and most basic standardized formats for storing structured (tabular) data: CSV (comma-separated values).
  2. A number of extremely useful Unix/Linux/Mac (bash) shell commands, including cd, ls, unzip, tar, mkdir, head, tail, grep, |(pipe), cat, >, >>

As we work today, we will create a guide to bash shell commands in this Google doc:

https://docs.google.com/spreadsheets/d/10hFOE-Z6EXMHBFsUJlNFbGh0NHr_Iuk1XXPgmlRrH2c/edit?usp=sharing

except you need to change the first F in that link to a T.

Packaging and unpackaging files

The baby names data set is a collection of CSV files in a zip archive.

It is a good idea when unpacking packages (archives) to do it in a separate folder you've created for the purpose: just in case the creator of the package was antisocial enough to package a loose collection of files instead of a folder with all the files inside.

Unpack the archive by typing

unzip names.zip

In the Linux world the most common package format is the gzipped tar file. Tar is the packager and gzip is used to compress the package. The following command creates a package (c for create, v for verbose, z for gzip, f for package filename):

tar zcvf mypackage.tgz myfolder

and the next command unpacks it (x for extract):

tar zxvf mypackage.tgz

CSV files

CSV is an ancient, venerable, and still widely used format for storing tabular data as plain text. The government's first names data is provided in this format.

ringland@blue:~/public_html/463/names$ head yob2014.txt
Emma,F,20799
Olivia,F,19674
Sophia,F,18490
Isabella,F,16950
Ava,F,15586
Mia,F,13442
Emily,F,12562
Abigail,F,11985
Madison,F,10247
Charlotte,F,10048
ringland@blue:~/public_html/463/names$

An initial peek at the names data with grep

History of your first name?

Exercise 4: Make a plot of the frequency of your first name over the years. Use maplotlib/pylab plot.

Reports

Every report you write this semester should be a well-constructed well-written document containing text and code, and often tables and/or figures. Points will be deducted for sloppiness of any kind, including ungrammatical sentences, incorrect spelling, or inconsistent formatting.

The report should be structured under a logical set of headings (and possibly subheadings).

The first should always be an Introduction to the topic. This should provide the background to the topic and in broad terms what you are trying to do. From this, the reader should understand what it's about and be motivated to read the body of the report. Some students may want to summarize very briefly here what was accomplished; others may prefer to keep that a "secret" until later in the report.

The Introduction should be followed by one or more sections whose headings may be specific to the topic. For example, here are the headings of an article in the current issue of Phys. Rev. A.

phys_rev_a_article_structure.png

Other mandatory sections are a Conclusions section where you will summarize and reflect on what you have discovered or accomplished, and a References section where all sources should be properly cited.

Furthermore, since reproducibility is essential in all scientific and mathematical endeavors, I want to be able to run your code to verify that it does what you claim. Your code must therefore appear somewhere in the report in a form that can be simply run - without further editing. If you choose to present your code in the body of your report as short fragments interspersed by explanatory text, that is fine (in fact, I quite like this style of presentation), but then you should also include it as a single runnable cell, or a small number of contiguous cells, in an Appendix called "Code" at the end.

Aside from the mandatory sections discussed above, I am perfectly happy for you to try your own ideas for arranging the body of the report, and I will give you prompt feedback on whether I find your structure effective or on how I think it could be improved. I will say in advance, however, that I do not like structures that force the reader to flip back and forth repeatedly between widely separated parts of the report: I much prefer a report that can be read and appreciated linearly from start to finish.

Target reader: the reader you should have in mind as you write is a smart classmate who missed class when the current topic was introduced, explored and discussed.

Report document format

You will use the Jupyter Notbook format for your reports. You will upload your ".ipynb" file to UBLearns.

Ex 4, cont'd: Reading all the names data into memory for analysis

What data structure(s) would be convenient for analyzing the data?