MTH 448/563 Data-Oriented Computing

Fall 2019

Day 5: XML and HTML in particular

Today we will look at XML, the 2nd of 3 major plain-text data markup languages in use today (csv, xml, json).

But first a cautionary word about CSV ...

CSV

csv is not totally simple. For example, it supports having the separator (often comma) within the data, like this:

number,name,value
1,"Shakespeare, William",999
5,J.S. Bach,998

A naive parser, such as what we wrote for the baby names files, would fail on this. Recommend using robust parser such as:

  • pandas.read_csv()
  • csv.reader()

csv parser examples

nonsimple.csv_parsers.png

XML (eXtensible Markup Language)

The bad old days of ad-hoc formats (GIF format spec)

XML is a "meta-format": it provides a standard structure for data formats.

Wikipedia says: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a [self-describing] format which is both human-readable and machine-readable.

Nice Introduction at w3schools.com

Example of data in an XML format:

<note>
  <to>John</to>
  <from>self</from>
  <heading>Reminder</heading>
  <body>Don't forget you have to grade 448/563 reports this weekend!</body>
</note>

Each element has a opening tag and and closing tag:

<thing> ... </thing>

unless it is self-closing:

<funnything hilarity="99.9" />

Elements can have child elements:

<thing>
        <part> ... </part>
        <part> ... </part>
</thing>

Elements can have attributes (must be quoted):

<thing color="blue" urgency="extreme">
        <part> ... </part>
        <part> ... </part>
</thing>

XML documents are element trees: elements contain other elements or text:

<thing color="blue" urgency="extreme">
        <temperature>35.5</temperature>
</thing>

Quandry: attributes vs. nested elements

Examples of XML formats

GPX

GPX an XML format for exchanging GPS data (Example: Winnipeg.gpx recorded using MyTracks on my phone, Dec 31, 2015.)

winnipeg_Screenshot_2016-02-22-18-15-01.png

KML

KML for plotting things on Google Earth (Example: NCEDC earthquakes)

ncedc_kml.png

SVG

SVG for vector graphics

<svg
   xmlns="http://www.w3.org/2000/svg"
   width="500" height="500">
  <circle cx="50" cy="50" r="40" stroke="gray" stroke-width="8" fill="#77cc77" />
</svg>

Exercise: copy the above into a plain text file and call it anythingyoulike.svg. Then open it in your browser.

Office formats

Microsoft Office files (.docx, .xlsx, etc.), and Libre/OpenOffice files, are (gzipped) bundles of XML documents.

Excercise: Make or take a Word or Writer document, rename it to something.zip, unzip it, and observe the XML!

Ad-hoc XML formats

Many sources provide data in their own ad-hoc XML format. Example: real-time Chicago bus information

HTML

HTML - the language of the web!

Exercise

Make your own tiny HTML document that contains a table.

Exercise

For each ASL course this at UB semester, get the timeslot, the number of students registered, and the number of empty seats.

Unlike the Amazon item price, it is not very easy to locate the chunks of text we want.

Quandry: how are we going to extract the desired data?

Possible easy solution: pandas.read_html()

More general solution: bs4 module or lxml module

XML parsing with lxml.etree

url = 'http://www.buffalo.edu/class-schedule?switch=showcourses&semester=fall&division=UGRD&dept=ASL'

import requests
s = requests.get(url).text

from lxml import etree
doc = etree.fromstring( s )

Unfortunately all the UB class schedule pages are malformed XML!

HTML parsing with bs4

bs4 is a bit more tolerant of badly formed XML.

import bs4

b = bs4.BeautifulSoup(s,'lxml')
count = 0
for child in b.body:
        if child.name == 'table':
            print('table found')

Elements can be iterated over.

Element name can be accessed as element.name.

Attributes of elements can be got using dictionary-element-like syntax.