Fall 2019

# Day 8: XML schema, validation, regular expressions

## Todays topics

• lxml
• XML schemas
• regular expressions
• Report #2 help

Quiz "are they intensive?": Of US counties: (i) maximum distance from state capital (ii) average distance from state capital.

# XML Schema

Validation of an XML document against a schema

My simple example XML document: mydoc.xml

<!-- THIS IS MY HAND-WRITTEN XML DOCUMENT CONFORMING TO A SCHEMA -->

<mycourses
xmlns="http://blue.math.buffalo.edu/463/mycourses"
version="0.1" >

<course>
<number>MTH 448</number>
<name>Data-Oriented Computing</name>
<semester>201909</semester>
</course>

<course>
<number>MTH 448</number>
<name>Data-Oriented Computing</name>
<semester>201809</semester>
</course>

<course>
<number>MTH 649</number>
<name>Partial Differential Equations</name>
<semester>200809</semester>
</course>

</mycourses>


Exercise, part 1: (results will be collected). Type up an XML document of your own invention from scratch. Let's say you want to record a stream of notes from you to yourself. Or a set of any other kind of objects, a contacts list, movies you like, etc. - you can choose anything you like. Invent an appropriate set of tags for the task.

Syntactic validation (i.e. testing that the XML is well-formed), can be done with firefox, some text editors, as well as more powerful tooks.

Semantic validation: can be done with the free-standing program xmllint from libxml2-utils, and within Python using lxmlschema among other modules.

## Imposing semantic rules with XML Schema

My schema document: myxmls.xsd

<?xml version="1.0"?>
<!-- THIS IS ALMOST MY FIRST XML SCHEMA -->

<!--    "http://www.w3.org/2001/XMLSchema" is a magic phrase, like "Open, Sesame".
No connection is being made to that website.

"http://blue.math.buffalo.edu/463/mycourses"
The mandated use of a URL here as the namespace name is intended to ensure
that namespace names are unique.
-->

<schema
xmlns              ="http://www.w3.org/2001/XMLSchema"
targetNamespace    ="http://blue.math.buffalo.edu/463/mycourses"
>

<element name="mycourses">
<complexType>
<sequence>
<element name="course" maxOccurs="unbounded">
<complexType>
<all>
<element name="number">
<simpleType>
<restriction base="string">
<pattern value="[A-Z]{3} [0-9]{3}"/>  <!-- "number" must match this regular expression -->
</restriction>
</simpleType>
</element>
<element name="name"     type="string" />
<element name="semester" type="integer" />
</all>
</complexType>
</element>
</sequence>
<attribute name="version" type="decimal" use="required" />
</complexType>
</element>

</schema>


Now we validate the xml data document against the xsd schema document:

import xmlschema
xmlschema.validate('mydoc.xml','myschema.xsd')


If we modify the data so that it no longer conforms to the schema, xmlschema.validate will tell us.

<course>
<number>MTH 448</number>
<name>Data-Oriented Computing</name>
<semester>201809!</semester>
</course>

XMLSchemaDecodeError: failed validating '201909!' with XsdAtomicBuiltin(name='xs:integer'):
Reason: invalid literal for int() with base 10: '201909!'


or

<course>
<number>MTH 448</nombre>
<nombre>Data-Oriented Computing</nombre>
<semester>201809</semester>
</course>

XMLSchemaChildrenValidationError: failed validating <Element '{http://blue.math.buffalo.edu/463/mycourses}course' at 0x7fc9687ab3b8> with XsdGroup(model='all', occurs=[1, 1]):
Reason: Unexpected child with tag '{http://blue.math.buffalo.edu/463/mycourses}nombre' at position 1. Tag (number | name | semester) expected.


Note that with both of the changes above, we still had well-formed XML.

Exercise part 2:

Write a schema for your own XML document. Make sure your document "validates" against your schema. Then "break" your documents in several small ways to see how the validator responds. I will be asking you to turn in your xml and xsd documents.

References: my examples above, and w3schools

Note: you can even embed "regular expressions" in an XML schema: myxmlre.xml, myxmlre.xsd

Assignment is due 2:00pm Monday, September 30. A type must be specified for all atomic elements. Use a regular expression to restrict at least one element. Use both "required" and "maxOccurs" attributes where logic dictates and at least once.

# Regular expressions

"Regular expressions" is a tool for finding things in text when what you're looking for is not one specific thing, but rather one of any number of possible things that match some kind of pattern.

Many useful "cheatsheets" available online, like this one

import re
results = re.findall(expr,string)


Greedy vs. lazy matching: .?, .*

Cautions:

• May need multiple backslashes to get backslashes into search strings.
• "." does not include line terminators like newline by default: use re.DOTALL.
re.findall(expr,string,re.DOTALL)


# Help with Report 2: Story told with choropleth map(s) of US

Due 8:00am Saturday, Sep 28.

## Find some interesting data for every county in the US

Search online for some data given for every county in the nation.

You may want to consult this page about the FIPS codes for counties. Fall-back suggestion if you don't find anything else you like: annual unemployment data from the US Bureau of Labor Statistics.

Note: You don't have to use a linear color map: consider if a logarithmic one would be helpful.

## Color the counties on the map according to your data

This may require some thought, some research, and some work. Although you are welcome to consider elaborate color maps (even 2D maps representing two quantities), you should beware of creating graphics that are strikingly colorful but difficult for the viewer to interpret. Simple things like a white-to-red gradient can be very effective: the picture above shows the fraction of the population that is of African descent (generated by a student in a previous run of this course).