Quantcast
Viewing latest article 2
Browse Latest Browse All 25

XPath bug in old versions of ElementTree

I figured out why my XML parsing code works fine using the pure-Python ElementTree XML parsing module but fails when using the speedy and memory-optimized cElementTree XML parsing module.

The XPath 1.0 specification says '.' is short-hand for 'self::node()', selecting a node itself.

Parsing an XML document and selecting the context node with ElementTree in Python 2.5:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
'1.2.6'
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node1 = ElementTree.fromstring(doc).find('./Example')
>>> node1
<Element Example at 10e0ed8c0>
>>> node1.find('.')
<Element Example at 10e0ed8c0>
>>> node1.find('.') == node1
True

See how the result of node1.find('.') is the node itself? As it should be.

Parsing an XML document and selecting the context node with cElementTree in Python 2.5:

>>> from xml.etree import cElementTree
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node2 = cElementTree.fromstring(doc).find('./Example')
>>> node2
<Element 'Example' at 0x10e0e3660>
>>> node2.find('.')
>>> node2.find('.') == node2
False

Balls. The result of node2.find('.') is None.

However! I have a kludgey work-around that works whether you use ElementTree or cElementTree. Use './' instead of '.':

>>> node1.find('./')
<Element Example at 10e0ed8c0>
>>> node1.find('./') == node1
True
>>> node2.find('./')
<Element 'Example' at 0x10e0e3660>
>>> node2.find('./') == node2
True

Kludgey because './' is not a valid XPath expression.

So we are back on track. Also works for Python 2.6 which has the same version of ElementTree.

Fortunately Python 2.7 got a new version of ElementTree and the bug is fixed:

>>> from xml.etree import ElementTree
>>> ElementTree.VERSION
'1.3.0'
>>> doc = "<Root><Example>BUG</Example></Root>"
>>> node3 = ElementTree.fromstring(doc).find('./Example')
>>> node3
<Element 'Example' at 0x107257210>
>>> node3.find('.')
<Element 'Example' at 0x107257210>
>>> node3.find('.') == node3
True

However! They also fixed my kludgey work-around:

>>> node3.find('./')
>>> node3.find('./') == node3
False

So I can’t code something that works for all three versions. This is annoying. I was hoping to just replace ElementTree with the C version, makes my code run in one third the time (the XML parts of it run in one tenth the time). And cannot install any compiled modules – the code can only rely on Python 2.5′s standard library.


Viewing latest article 2
Browse Latest Browse All 25

Trending Articles