MTH 448/563 Data-Oriented Computing

Fall 2019

Day 1

Data and society

Data collection explosion is transforming society.

There is a tension between

surveillance by own governments

merlin_138777558_35152ffd-0355-4f82-b484-eeac0a7bd9bd-superJumbo.jpg

NYT article, July 8, 2018

surveillance_govt_snowden_clapper.png

influence in politics by foreign actors

Misuse of personal data and to influenced elections: The Great Hack

cambridge_analytica_iowa.jpg zuckerberg_congress.jpg

and surveillance by corporations

surveillance_corporate_target.png

vs.

useful tools that make life better, like Google Maps Traffic

2016_01_26/Screenshot_2015-01-27-07-30-45.png

Science

Exploring the universe, from its smallest constituents:

CERN Large Hadron Collider

2016_01_26/cern-lhc-aerial.jpg

data rate

to its largest structures:

Square Kilometer Array radio telescope

2016_01_26/SKA.jpg

Expected data rate: 10 petabyte = 10,000 terabytes, per day.

Skills needed and taught

Needed

Google "data science desired skills" (my results yesterday here)

Topics taught

  • Pure Python data wrangling (web-scraping, string-splitting, regular expressions)
  • Structured data formats (json, xml, ...) and validation
  • data visualization
  • Relational databases and SQL
  • Use of Python Pandas data analysis library
  • Machine learning, supervised and unsupervised (from scratch and with Scikit-Learn, Google TensorFlow)
  • Geospatial analysis

Some datasets we'll look at

  • Open Data Buffalo (json)
  • first names in the US yearly since 1880
  • Northern California earthquakes
  • all airports in the world (csv)
  • Chicago real-time bus info (XML)
  • UB class schedule
  • Sloan Sky Survey
  • photographic images
  • New York State hospitalizations (SPARCS) (csv)
  • NHTSA complaints (json)
  • your own handwriting
  • every file on your computer

Think about ethics when you use the power you gain in this course!

Structure and policies

Structure and policies

More on the biweekly reports

screenshot of the beginning of one

Detailed Report Guide

Classroom setup

Every day at the beginning of class we will quietly and quickly arrange the tables and chairs like this. At the end of class we will restore them to their original state.

Getting to know each other

Let's introduce ourselves while we go outside and take some photos of ourselves in front of a green screen.

Numpy review: vectorizing, slicing, advanced indexing, broadcasting

Numpy provides array data structures and ways to efficiently operate on them. It is the basis of fast numerical computation in Python, and is what the data-analysis library Pandas uses "under the hood". Today we will explore using numpy directly.

Key ideas are vectorization (e.g. adding one array to another), slicing, boolean and "fancy" indexing (referring to a part of an array), and broadcasting (replicating a small array across a big one). We'll study each of these through examples.

A good numpy reference by Wes McKinney is here. There is also the Numpy manual: on indexing and on broadcasting.

Image play

Photographic images provide examples of a multidimensional array that are naturally visualized. Let us play with the images we've just taken.

Slicing/cropping

Changing colors

Green screen substitution