wdyck 0 Newbie Poster

I am parsing an XML file with encoded entities in it (& and so on). If I use minidom to parse the XML file, minidom will unescape the entities and display the correct value. If I use pulldom, it skips the entity and moves onto the next line.

For example, given the following XML file,

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>
    <title>The quick &amp; the dead.</title>
  </item>
</items>

and using xml.minidom to parse the <title> element I get the following,

>>> from xml.dom import minidom
>>> dom = minidom.parse('test.xml')
>>> for node in dom.getElementsByTagName('item'):
...   title = node.getElementsByTagName('title')
...   print title[0].firstChild.data
... 
The quick & the dead.
>>>

You can see it outputs the title with the &amp; turned into a & correctly.

If, however, I use pulldom to parse the file I get the following,

>>> from xml.dom import pulldom
>>> events = pulldom.parse('test.xml')
>>> for (event, node) in events:
...   if event == pulldom.START_ELEMENT:
...     if node.tagName == "item":
...       events.expandNode(node)
...       title = node.getElementsByTagName("title")
...       print title[0].firstChild.data
... 
The quick 
>>>

As you can see it stop processing at the "&" and leaves me with just "The quick"

Does it have something to do with the XML file needing a DTD defining the internal entities and how to process them? I thought I could do something like,

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE items [
  <!ENTITY amp "&">
]>
<items>
  <item>
    <title>The quick &amp; the dead.</title>
  </item>
</items>

However, that does not work either. I am at a bit of a loss and any suggestions would be appreciated.

I need to use pulldom because the acutally XML file I am processing is HUGE and minidom will simply not process it.

Wayne