So I have been told by various people on numerous occasions that you cannot parse XML by using regular expressions or any means other than a parser. So, here at work, I have LXML and that is what I have to use. At the moment I am trying to remove specific journals form my "XML", I say "XML" because it's not really. It looks like this:

<record> <type>Primary</type> <type><FPI>NO</FPI><TPG>NO</TPG><FT>YES</FT></type> <num>23594001</num> <authdtl><author>Bolt, M&aacute;tt</author></authdtl> <title>Every word doth almost tell my name</title> <pub><journal>Early Modern</journal> <issue>(15:3)</issue> <dispdate>2011</dispdate>, <pages>N_A</pages>.</pub> <startend><start>20110101</start><end>20111231</end></startend> <allyears>2011</allyears> </record> <record> <type>Primary</type> <type><FPI>NO</FPI><TPG>NO</TPG><FT>YES</FT></type> <num>23594141</num> <authdtl><author>Packard, Bethany</author></authdtl> <title>The Witch of Eddmon</title> <pub><journal>Early Modern</journal> <issue>(15:3)</issue> <dispdate>2011</dispdate>, <pages>N_A</pages>.</pub> <startend><start>20110101</start><end>20111231</end></startend> <pubdate>2011</pubdate> <allyears>2011</allyears> </record>

So before I can do anything I have to ad <rec> to the beginning of the file and then </rec> to the end of the file. This is a right pain.

So far I have this:

Code blocks are created by indenting at least 4 spaces
... and can span multiple lines
#!/usr/local/bin/python2.6
from lxml import etree

for _, document in etree.iterparse('2011_39.xml', tag="record"):
    _type = document.xpath('//type')
    _journal = document.xpath('//journal')
    print _type
    print _journal
#    if _type and _journal:
#        '''do stuff'''

#    else:
#        '''do other stuff'''
#  then somewhere is have to print _document... I think

When I run this I get this:

lxml.etree.XMLSyntaxError: Entity 'aacute' not defined, line 4, column 28

It does not like &aacute;. When I remove this it seems to work ok. My question(for now) is: What do I need to do to make this with all the &aacute; and &eacute; and all the other funny characters I have in my XML?

This looks like what you need.

Nope. We don't have that:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named sax.saxutils

The problem is I'm not sure what I'm doing. I think I have to somehow put these "special characters" of our in a XML schema. At lease, that is from what I can tell from an older DTD I managed to dig up. It looks like this:

<!ENTITY oacute CDATA "&#243;" > <!ENTITY oelig CDATA "&#156;" > <!ENTITY ocirc CDATA "&#244;" > <!ENTITY otilde CDATA "&#245;" > <!ENTITY ouml CDATA "&#246;" > etc.

This morning I wrote this:

`<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

<xs:element name="records">
<xs:complexType>
<xs:sequence>
<xs:element name="record" minOccurs="1" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="type" type="xs:string">
<xs:element name="texttype" minOccurs="1" maxOccurs="1">
<xs:complexType>
<xs:sequence>
<xs:element name="FPI" type="xs:string">
<xs:element name="TPG" type="xs:string">
<xs:element name="FT" type="xs:string">
</xs:sequence>
</xs:complexType>

    <xs:element name="num" type="xs:string">

        <xs:element name="authdtl" minOccurs="1" maxOccurs="1">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="author" type="xs:string"  minOccurs="1" maxOccurs="unbounded">
              </xs:complexType>
            </xs:sequence>

    <xs:element name="title" type="xs:string">
        <xs:element name="pubd" type="xs:string">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="journal" type="xs:string">
                <xs:element name="issue" type="xs:string">
                <xs:element name="dispdate" type="xs:string">
                <xs:element name="pages" type="xs:string">
              </xs:sequence>
            </xs:complexType>

        <xs:element name="startend"  type="xs:string">
            <xs:complexType>
              <xs:sequence>
                <xs:element name="start" type="xs:string">
                <xs:element name="end" type="xs:string">
              </xs:sequence>
            </xs:complexType>

  </xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>

</xs:element>
</xs:schema>`

But I'm not sure if this is correct. I'm also not sure what to do with the "special" characters

you are missing xml. From import.

Nope, I type this:
from xml.sax.saxutils import escape
But I get this back:
Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named sax.saxutils

I'm now have a working schema going and I'm trying just to get the parsing going. Will post back later

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.