MTH 448/563 Data-Oriented Computing

Spring 2017

Day 1

Data and society

Data collection explosion is transforming society.

A tension ...

2016_01_26/surveillance_govt.png 2016_01_26/surveillance_corporate.png

vs. useful tools that make life better: east coast storm this past weekend, visualized by Google Maps Traffic:

2016_01_26/Screenshot_2016-01-23-20-05-22.png 2016_01_26/Screenshot_2015-01-27-07-30-45.png

Data in the news almost every day.

Science

CERN LHC

2016_01_26/cern-lhc-aerial.jpg

VLA

2016_01_26/SKA.jpg

Classroom setup

Every day at the beginning of class we will quietly and quickly arrange the tables and chairs like this. At the end of class we will restore them to their original state.

Getting to know each other

Let's introduce ourselves.

Pure Python data wrangling

Exercise 1: Split-and-select

Extract price from an Amazon.com product page

gooselamp.png

First with browser-downloaded page

gooselamp_inspect_price.png split_select.png split_select_day01.png

Then with requests

We want to eliminate the browser step ...

import requests
url = 'https://www.amazon.com/Union-61100-Outdoor-Garden-Statue/dp/B0027YPQEC'
s = requests.get(url)
'29.04' in s.text
True

Oh, it actually worked. Sometimes you will find Amazon refuses to serve the page to a script (robot). In that case we will need to fake our User Agent.

ua_log.png

Now we can "spoof" the user agent:

spoof_ua.png

Finally, we can wrap everything up in a function that can retrieve the price of any product:

import requests
def getprice(pid):
         ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
         url = 'https://www.amazon.com/dp/'+pid
         s = requests.get(url, headers={'User-Agent':ua})
         pattern = '<span id="priceblock_ourprice" class="a-size-medium a-color-price">$'
         price = float( s.text.split(pattern)[-1].split('</span>')[0] )
         return price

getprice('B0027YPQEC')
29.04

Exercise 2: More play with text

Download this list of English words: http://blue.math.buffalo.edu/448/words.txt

Exercise 2a: Sort words by right-to-left alphabetical order

Hints:

w = 'drawer'
w[::-1]
'reward'

Note that set-membership can be tested much faster than list membership.

Exercise 2b: List all the palindromes

Exercise 2c: List all the reversible words

Useful Python features:

  • list element access and slicing with stride
  • string replace
  • list sort
  • functions, def and lambdas