Day 06

Thursday, Feb 16, 2017

Regex quiz

25 minutes

Long-term data-collection project

How to deal with pages with Javascript-generated content, like this? The text "777" that we see in the browser does not exist (explicitly) in the page source:

no777.png

Selenium, Selenium with Python automates browsing.

With the following code, we can launch Firefox, go to the Google search page, and do a search:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Firefox()
driver.get('http://google.com')

searchbox  = driver.find_element_by_name('q')
searchbox.send_keys('goose lamp')
searchbox.send_keys(Keys.RETURN)
time.sleep(5)
driver.quit()

With the following code, we can grab the Javascript-generated text in the page javascript_demo.html:

from selenium import webdriver

driver = webdriver.Firefox()
#driver = webdriver.phantomJS()

driver.get('http://blue.math.buffalo.edu/448/javascript_demo.html')
elt = driver.find_element_by_class_name('foo')
f = open('foo.txt','w')
f.write(elt.text)
f.close()
driver.quit()
777.png

CSV (comma-separated values)

For simple cases like First Names data, read-split-select works fine

For more elaborate csv, there is

csv module

  1. csv.reader() (optional arguments: delimiter=, quotechar=, quoting=csv.QUOTE_NONNUMERIC), next()
  2. csv.DictReader()

Pandas

pandas.read_csv(filename) (optional arguments, sep=, header=None, skiprows=, ...)

We will learn lots more about Pandas later.

Exercise:

CSV1. Make a plot of all the abandoned buildings in Chicago using csv.DictReader() to access the CSV file you download from here.

or

CSV2. Make a plot of the locations of all earthquakes in Northern California so far this year, using a CSV export from this site.
abandoned_buildings_head.png abandoned_pandas_df_head.png abandoned_plot1.png abandoned_plot2.png

JSON

Supplementary reference: Jennifer Widom database lectures.`

Examples of use: Thunderbird email data, Ipython Notebooks

object = dictionary (string:value) array = list value = string/number/object/array/true/false/null

json module: dumps, loads

Cautions:

  1. JSON keys are always strings (not required in Python dictionaries)
  2. JSON text is almost pasteable as Python code, but JSON "true/false" map to Python "True/False"

JSON Exercise (30 minutes)

Write an "Escape Route from IPython Notebook", i.e. a program that extracts all the python code and annotation from an IPython Notebook and writes a corresponding executable plain Python script.