fatbob 0 Newbie Poster

Hi all
I am new to programming and python and am having trouble reading particular strings from a text file and writing them out to a separate file.
The file has a large number of lines, has the following format and is interspersed with signature sentences i don't need:

subdoc="Book=2:chapter=1" span="Cum0:dare0">
<word id="1" form="Cum" lemma="cum1" postag="c--------" head="20" relation="AuxC"/>
<word id="2" form="esset" lemma="sum1" postag="v3sisa---" head="1" relation="ADV"/>
<word id="3" form="Caesar" lemma="Caesar1" postag="n-s---mn-" head="2" relation="SBJ"/>

I only need the information relating to form, lemma and postag, the rest i can ignore. The approach i was taking was 1)to remove the unwanted sentences, 2)remove quotation marks and leave whitespace, 3) split the string and return the word at the relevant index. Trouble is i need the information at form+1, lemma+1, postag+1. eg for word id 3 - i need Caesar, Caesar1 and n-s---mn-.

Given that the information changes for every single line in the text file how do i iterate over the file and return the right words?

My code so far:

import re
f=open('caesar.txt','r')
rfformat=open('blank.txt','w')

#1) removes unwanted signature sentences
for line in f:
if re.match("(.*)(f|1)orm(.*)", line):
print >>rfformat, line,
rfformat.close()

#2) removes quotation marks
f=open('blank.txt','r')
quotes=f.read()
noquotes=quotes.replace('"','')
f.close()

rfformat=open('blank.txt','w')
rfformat.write(noquotes)
rfformat.close()

#3) removes =
f=open('blank.txt','r')
equals=f.read()
noequals=equals.replace('=',' ')
f.close()

rfformat=open('blank.txt','w')
rfformat.write(noequals)
rfformat.close()

#4)list of words i'm interested in:
keywords=

f=open('blank.txt','r').read()
words_list=f.split()
for word in words_list:
if word in keywords:
print words_list.index(word)
print word

From this i can determine that 'form' is at index 3, 'lemma' is at index 5 and 'postag' is at index 7. Therefore i need indexes 4, 6, 8 from every sentence. Whenever i ask python to return the words at these indexes it returns an error. Would really appreciate it if someone could point me in the right direction. Let me know if none of the above makes sense.

thanks in advance

PS Instead of formatting the file in stages I tried using a combination of read and seek. EG find 'form', read forward until reach quote mark ,", read information between quote marks e.g. "Caesar" and return information to outfile. Then read to lemma read forward to quote mark, "Caesar1", read info between quotation marks and return to outfile. I couldn't get this approach to work but think this would be more efficient.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.