I am parsing an XML file with encoded entities in it (& and so on). If I use minidom to parse the XML file, minidom will unescape the entities and display the correct value. If I use pulldom, it skips the entity and moves onto the next line.
For example, given the following XML file,
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<title>The quick & the dead.</title>
</item>
</items>
and using xml.minidom to parse the <title> element I get the following,
>>> from xml.dom import minidom
>>> dom = minidom.parse('test.xml')
>>> for node in dom.getElementsByTagName('item'):
... title = node.getElementsByTagName('title')
... print title[0].firstChild.data
...
The quick & the dead.
>>>
You can see it outputs the title with the & turned into a & correctly.
If, however, I use pulldom to parse the file I get the following,
>>> from xml.dom import pulldom
>>> events = pulldom.parse('test.xml')
>>> for (event, node) in events:
... if event == pulldom.START_ELEMENT:
... if node.tagName == "item":
... events.expandNode(node)
... title = node.getElementsByTagName("title")
... print title[0].firstChild.data
...
The quick
>>>
As you can see it stop processing at the "&" and leaves me with just "The quick"
Does it have something to do with the XML file needing a DTD defining the internal entities and how to process them? I thought I could do something like,
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE items [
<!ENTITY amp "&">
]>
<items>
<item>
<title>The quick & the dead.</title>
</item>
</items>
However, that does not work either. I am at a bit of a loss and any suggestions would be appreciated.
I need to use pulldom because the acutally XML file I am processing is HUGE and minidom will simply not process it.
Wayne