Tag Archives: xml

XPath bug in old versions of ElementTree

I figured out why my XML parsing code works fine using the pure-Python ElementTree XML parsing module but fails when using the speedy and memory-optimized cElementTree XML parsing module.

The XPath 1.0 specification says '.' is short-hand for 'self::node()', selecting a node itself.

Parsing an XML document and selecting the context node with ElementTree in Python 2.5:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node1 = ElementTree.fromstring(doc).find('./Example')
>>> node1
<Element Example at 10e0ed8c0>
>>> node1.find('.')
<Element Example at 10e0ed8c0>
>>> node1.find('.') == node1

See how the result of node1.find('.') is the node itself? As it should be.

Parsing an XML document and selecting the context node with cElementTree in Python 2.5:

>>> from xml.etree import cElementTree
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node2 = cElementTree.fromstring(doc).find('./Example')
>>> node2
<Element 'Example' at 0x10e0e3660>
>>> node2.find('.')
>>> node2.find('.') == node2

Balls. The result of node2.find('.') is None.

However! I have a kludgey work-around that works whether you use ElementTree or cElementTree. Use './' instead of '.':

>>> node1.find('./')
<Element Example at 10e0ed8c0>
>>> node1.find('./') == node1
>>> node2.find('./')
<Element 'Example' at 0x10e0e3660>
>>> node2.find('./') == node2

Kludgey because './' is not a valid XPath expression.

So we are back on track. Also works for Python 2.6 which has the same version of ElementTree.

Fortunately Python 2.7 got a new version of ElementTree and the bug is fixed:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node3 = ElementTree.fromstring(doc).find('./Example')
>>> node3
<Element 'Example' at 0x107257210>
>>> node3.find('.')
<Element 'Example' at 0x107257210>
>>> node3.find('.') == node3

However! They also fixed my kludgey work-around:

>>> node3.find('./')
>>> node3.find('./') == node3

So I can’t code something that works for all three versions. This is annoying. I was hoping to just replace ElementTree with the C version, makes my code run in one third the time (the XML parts of it run in one tenth the time). And cannot install any compiled modules – the code can only rely on Python 2.5’s standard library.

Migrating a Filemaker database to Django

At work we have several Filemaker Pro databases. I have been slowly working through these, converting them to Web-based applications using the Django framework. My primary motive is to replace an overly-complicated Filemaker setup running on four Macs with a single 2U rack-mounted server running Apache on FreeBSD.

At some point in the process of re-writing each database for use with Django I have needed to convert all the records from Filemaker to Django. There exist good Python libraries for talking to Filemaker but they rely on the XML Web interface, meaning that you need Filemaker running and set to publish the database on the Web while you are running an import.

In my experience Filemaker’s built-in XML publishing interface is too slow when you want to migrate tens of thousands of records. During development of a Django-based application I find I frequently need to re-import the records as the new database schema evolves – doing this by communicating with Filemaker is tedious when you want to re-import the data several times a day.

So my approach has been to export the data from Filemaker as XML using Filemaker’s FMPXMLRESULT format. The Filemaker databases at work are old (Filemaker 5.5) and perhaps things have improved in more recent versions but Filemaker 5/6 is a very poor XML citizen. When using the FMPDSORESULT format (which has been dropped from more recent versions) it will happily generate invalid XML all over the shop. The FMPXMLRESULT format is better but even then it will emit invalid XML if the original data happens to contain funky characters.

So here is filemaker.py, a Python module for parsing an XML file produced by exporting to FMPXMLRESULT format from Filemaker.

To use it you create a sub-class of the FMPImporter class and over-ride the FMPImporter.import_node method. This method is called for each row of data in the XML file and is passed an XML node instance for the row. You can convert that node to a more useful dictionary where keys are column names and values are the column values. You would then convert the data to your Django model object and save it.

A trivial example:

import filemaker

class MyImporter(filemaker.FMPImporter):
    def import_node(self, node):
        node_dict = self.format_node(node)
        print node['RECORDID'], node_dict

importer = MyImporter(datefmt='%d/%m/%Y')
filemaker.importfile('/path/to/data.xml', importer=importer)

The FMPImporter.format_node method converts values to an appropriate Python type according to the Filemaker column type. Filemaker’s DATE and TIME types are converted to Python datetime.date and datetime.time instances respectively. NUMBER types are converted to Python float instances. Everything else is left as strings, but you can customize the conversion by over-riding the appropriate methods in your sub-class (see the source for the appropriate method names).

In the case of Filemaker DATE values you can pass the datefmt argument to your sub-class to specify the date format string. See Python’s time.strptime documentation for the complete list of the format specifiers.

The code uses Python’s built-in SAX parser so that it is efficent when importing huge XML files (the process uses a constant 15 megabytes for any size of data on my Mac running Python 2.5).

Fortunately I haven’t had to deal with Filemaker’s repeating fields so I have no idea how the code works on repeating fields. Please let me know if it works for you. Or not.

Download filemaker.py. This code is released under a 2-clause BSD license.