So I have been told by various people on numerous occasions that you cannot parse XML by using regular expressions or any means other than a parser. So, here at work, I have LXML and that is what I have to use. At the moment I am trying to remove specific journals form my "XML", I say "XML" because it's not really. It looks like this:
<record> <type>Primary</type> <type><FPI>NO</FPI><TPG>NO</TPG><FT>YES</FT></type> <num>23594001</num> <authdtl><author>Bolt, Mátt</author></authdtl> <title>Every word doth almost tell my name</title> <pub><journal>Early Modern</journal> <issue>(15:3)</issue> <dispdate>2011</dispdate>, <pages>N_A</pages>.</pub> <startend><start>20110101</start><end>20111231</end></startend> <allyears>2011</allyears> </record> <record> <type>Primary</type> <type><FPI>NO</FPI><TPG>NO</TPG><FT>YES</FT></type> <num>23594141</num> <authdtl><author>Packard, Bethany</author></authdtl> <title>The Witch of Eddmon</title> <pub><journal>Early Modern</journal> <issue>(15:3)</issue> <dispdate>2011</dispdate>, <pages>N_A</pages>.</pub> <startend><start>20110101</start><end>20111231</end></startend> <pubdate>2011</pubdate> <allyears>2011</allyears> </record>
So before I can do anything I have to ad <rec>
to the beginning of the file and then </rec>
to the end of the file. This is a right pain.
So far I have this:
Code blocks are created by indenting at least 4 spaces
... and can span multiple lines
#!/usr/local/bin/python2.6
from lxml import etree
for _, document in etree.iterparse('2011_39.xml', tag="record"):
_type = document.xpath('//type')
_journal = document.xpath('//journal')
print _type
print _journal
# if _type and _journal:
# '''do stuff'''
# else:
# '''do other stuff'''
# then somewhere is have to print _document... I think
When I run this I get this:
lxml.etree.XMLSyntaxError: Entity 'aacute' not defined, line 4, column 28
It does not like á
. When I remove this it seems to work ok. My question(for now) is: What do I need to do to make this with all the á
and é
and all the other funny characters I have in my XML?