MTH 448/563 Data-Oriented Computing

Fall 2019

Day 5: XML and HTML in particular

Today we will look at XML, the 2nd of 3 major plain-text data markup languages in use today (csv, xml, json).

But first a cautionary word about CSV ...


csv is not totally simple. For example, it supports having the separator (often comma) within the data, like this:

1,"Shakespeare, William",999
5,J.S. Bach,998

A naive parser, such as what we wrote for the baby names files, would fail on this. Recommend using robust parser such as:

  • pandas.read_csv()
  • csv.reader()

csv parser examples


XML (eXtensible Markup Language)

The bad old days of ad-hoc formats (GIF format spec)

XML is a "meta-format": it provides a standard structure for data formats.

Wikipedia says: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a [self-describing] format which is both human-readable and machine-readable.

Nice Introduction at

Example of data in an XML format:

  <body>Don't forget you have to grade 448/563 reports this weekend!</body>

Each element has a opening tag and and closing tag:

<thing> ... </thing>

unless it is self-closing:

<funnything hilarity="99.9" />

Elements can have child elements:

        <part> ... </part>
        <part> ... </part>

Elements can have attributes (must be quoted):

<thing color="blue" urgency="extreme">
        <part> ... </part>
        <part> ... </part>

XML documents are element trees: elements contain other elements or text:

<thing color="blue" urgency="extreme">

Quandry: attributes vs. nested elements

Examples of XML formats


GPX an XML format for exchanging GPS data (Example: Winnipeg.gpx recorded using MyTracks on my phone, Dec 31, 2015.)



KML for plotting things on Google Earth (Example: NCEDC earthquakes)



SVG for vector graphics

   width="500" height="500">
  <circle cx="50" cy="50" r="40" stroke="gray" stroke-width="8" fill="#77cc77" />

Exercise: copy the above into a plain text file and call it anythingyoulike.svg. Then open it in your browser.

Office formats

Microsoft Office files (.docx, .xlsx, etc.), and Libre/OpenOffice files, are (gzipped) bundles of XML documents.

Excercise: Make or take a Word or Writer document, rename it to, unzip it, and observe the XML!

Ad-hoc XML formats

Many sources provide data in their own ad-hoc XML format. Example: real-time Chicago bus information


HTML - the language of the web!


Make your own tiny HTML document that contains a table.


For each ASL course this at UB semester, get the timeslot, the number of students registered, and the number of empty seats.

Unlike the Amazon item price, it is not very easy to locate the chunks of text we want.

Quandry: how are we going to extract the desired data?

Possible easy solution: pandas.read_html()

More general solution: bs4 module or lxml module

XML parsing with lxml.etree

url = ''

import requests
s = requests.get(url).text

from lxml import etree
doc = etree.fromstring( s )

Unfortunately all the UB class schedule pages are malformed XML!

HTML parsing with bs4

bs4 is a bit more tolerant of badly formed XML.

import bs4

b = bs4.BeautifulSoup(s,'lxml')
count = 0
for child in b.body:
        if == 'table':
            print('table found')

Elements can be iterated over.

Element name can be accessed as

Attributes of elements can be got using dictionary-element-like syntax.