Hi there. Day 2 of programming python. In this thread I posted my first attempt
http://www.daniweb.com/forums/post1231604.html#post1231604
and growing from there it goes to slightly deeper water here.
I have three .txt files:
nvutf8.txt here new vocab items are stored
esutf8.txt here example sentences are stored
exoututf8.txt example sentences from esutf8.txt containing vocab from nvutf8.txt is supposed to be stored here.
I have written the following code:
#step1: find example sentences in esutf8.txt which contain new voc items from nvutf8.txt
#step2: among those sentences find those which contain as few as possible new words from kvutf8.txt (known vocab).
import codecs
enout = codecs.open('Python/ExListBuild/exoututf8.txt', encoding = 'utf-8', mode = 'w')
nvin = codecs.open('Python/ExListBuild/nvutf8.txt', encoding = 'utf-8', mode = 'r')
for line in open('Python/ExListBuild/nvutf8.txt'):
newvocab = nvin.readline()
print "-"
print "next vocab item being checked"
print "-"
esin = codecs.open('Python/ExListBuild/esutf8.txt', encoding = 'utf-8', mode = 'r')
for line in open('Python/ExListBuild/esutf8.txt'):
sentence = esin.readline()
index = sentence.find(newvocab)
if index==-1:
print "nope"
else:
print "yes"
enout.write(sentence)
esin.close()
nvin.close()
There are some hard to understand irregularities going on.
I use the following example sentences in esutf8.txt:
我前边要拐弯了,请注意。
车来了快跑。
请排好队上车。
带好自己的东西。
方向错了!
我给你讲一个成语故事。
感谢你对我们的关心。
For new vocab I use in nvutf8.txt:
要
我
And I get returned in exoututf8.txt:
我前边要拐弯了,请注意。
我给你讲一个成语故事。
感谢你对我们的关心。
So it worked fine for 我, but it did not work for 要 (which is in the first sentence).
EDIT: Apparently it always works ONLY for the last word from nvutf8.txt. (also for two or more char vocab like
自己)
I have a version (with analogue code) running without the utf-8 stuff which works fine for roman letters.