xml | Reliably Broken

I figured out why my XML parsing code works fine using the [pure-Python ElementTree XML parsing module][elementtree] but fails when using [the speedy and memory-optimized cElementTree XML parsing module][celementtree].

[The XPath 1.0 specification][xpath] says `’.’` is short-hand for `’self::node()’`, selecting a node itself.

Parsing an XML document and selecting the context node with ElementTree in Python 2.5:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
‘1.2.6’
>>> doc = “BUG”
>>> node1 = ElementTree.fromstring(doc).find(‘./Example’)
>>> node1

>>> node1.find(‘.’)

>>> node1.find(‘.’) == node1
True

See how the result of `node1.find(‘.’)` is the node itself? [As it should be][selfnode].

Parsing an XML document and selecting the context node with cElementTree in Python 2.5:

>>> from xml.etree import cElementTree
>>> doc = “BUG”
>>> node2 = cElementTree.fromstring(doc).find(‘./Example’)
>>> node2

>>> node2.find(‘.’)
>>> node2.find(‘.’) == node2
False

Balls. The result of `node2.find(‘.’)` is `None`.

However! I have a kludgey work-around that works whether you use ElementTree or cElementTree. Use `’./’` instead of `’.’`:

>>> node1.find(‘./’)

>>> node1.find(‘./’) == node1
True
>>> node2.find(‘./’)

>>> node2.find(‘./’) == node2
True

*Kludgey because `’./’` is not a valid XPath expression.*

So we are back on track. Also works for Python 2.6 which has the same version of ElementTree.

Fortunately Python 2.7 got a new version of ElementTree and the bug is fixed:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
‘1.3.0’
>>> doc = “BUG”
>>> node3 = ElementTree.fromstring(doc).find(‘./Example’)
>>> node3

>>> node3.find(‘.’)

>>> node3.find(‘.’) == node3
True

However! They also fixed my kludgey work-around:

>>> node3.find(‘./’)
>>> node3.find(‘./’) == node3
False

So I can’t code something that works for all three versions. This is annoying. I was hoping to just replace ElementTree with the C version, makes my code run in one third the time (the XML parts of it run in one tenth the time). And cannot install any compiled modules – the code can only rely on Python 2.5’s standard library.

[celementtree]: http://effbot.org/zone/celementtree.htm
[elementtree]: http://effbot.org/zone/element-index.htm
[xpath]: http://www.w3.org/TR/xpath/
[selfnode]: http://www.w3.org/TR/xpath/#path-abbrev

At work we have several [Filemaker Pro][fmp] databases. I have been slowly working through these, converting them to Web-based applications using [the Django framework][django]. My primary motive is to replace an overly-complicated Filemaker setup running on four Macs with a single 2U rack-mounted server running [Apache][apache] on [FreeBSD][fbsd].

At some point in the process of re-writing each database for use with Django I have needed to convert all the records from Filemaker to Django. There exist good [Python][python] libraries for [talking to Filemaker][pyfmp] but they rely on the XML Web interface, meaning that you need Filemaker running and set to publish the database on the Web while you are running an import.

In my experience [Filemaker’s built-in XML publishing interface][fmpxml] is too slow when you want to migrate tens of thousands of records. During development of a Django-based application I find I frequently need to re-import the records as the new database schema evolves – doing this by communicating with Filemaker is tedious when you want to re-import the data several times a day.

So my approach has been to export the data from Filemaker as XML using [Filemaker’s FMPXMLRESULT][fmpxmlresult] format. The Filemaker databases at work are _old_ (Filemaker 5.5) and perhaps things have improved in more recent versions but Filemaker 5/6 is a very poor XML citizen. When using the FMPDSORESULT format (which has been dropped from more recent versions) it will happily generate invalid XML all over the shop. The FMPXMLRESULT format is better but even then it will emit invalid XML if the original data happens to contain funky characters.

So here is [filemaker.py, a Python module for parsing an XML file produced by exporting to FMPXMLRESULT][dave] format from Filemaker.

To use it you create a sub-class of the `FMPImporter` class and over-ride the `FMPImporter.import_node` method. This method is called for each row of data in the XML file and is passed an XML node instance for the row. You can convert that node to a more useful dictionary where keys are column names and values are the column values. You would then convert the data to your Django model object and save it.

A trivial example:

import filemaker

class MyImporter(filemaker.FMPImporter):
def import_node(self, node):
node_dict = self.format_node(node)
print node[‘RECORDID’], node_dict

importer = MyImporter(datefmt=’%d/%m/%Y’)
filemaker.importfile(‘/path/to/data.xml’, importer=importer)

The `FMPImporter.format_node` method converts values to an appropriate Python type according to the Filemaker column type. Filemaker’s `DATE` and `TIME` types are converted to Python [`datetime.date`][dtdate] and [`datetime.time`][dttime] instances respectively. `NUMBER` types are converted to Python `float` instances. Everything else is left as strings, but you can customize the conversion by over-riding the appropriate methods in your sub-class (see the source for the appropriate method names).

In the case of Filemaker `DATE` values you can pass the `datefmt` argument to your sub-class to specify the date format string. See Python’s [time.strptime documentation][strptime] for the complete list of the format specifiers.

The code uses [Python’s built-in SAX parser][pysax] so that it is efficent when importing huge XML files (the process uses a constant 15 megabytes for any size of data on my Mac running Python 2.5).

Fortunately I haven’t had to deal with Filemaker’s repeating fields so I have no idea how the code works on repeating fields. Please let me know if it works for you. Or not.

[Download filemaker.py][dave]. This code is released under a 2-clause BSD license.

[dave]: http://reliablybroken.com/b/wp-content/uploads/2009/11/filemaker.py
[strptime]: http://docs.python.org/library/time.html#time.strftime
[fmp]: http://www.filemaker.com/
[django]: http://www.djangoproject.com/
[apache]: http://httpd.apache.org/
[fbsd]: http://www.freebsd.org/
[python]: http://www.python.org/
[pyfmp]: http://code.google.com/p/pyfilemaker/
[fmpxml]: http://www.filemaker.com/support/technologies/xml
[fmpxmlresult]: http://www.filemaker.com/help/html/import_export.16.30.html#1029660
[dtdate]: http://docs.python.org/library/datetime.html#date-objects
[dttime]: http://docs.python.org/library/datetime.html#time-objects
[pysax]: http://docs.python.org/library/xml.sax.html

Reliably Broken

It's a blog: let's do funch!

Tag Archives: xml

XPath bug in old versions of ElementTree

Migrating a Filemaker database to Django