find key word in a line as well as the word before and after

Question

doomas10 0 Newbie Poster

14 Years Ago

Hello all,

How are you? Hope well. Just a quick question. I have a file which contains abstracts (small texts) and i am looking for certain keywords and their frequency. These keywords are provided from another file. I was thinking to read the keyword file first and the perform a scan on the abstract file. However, if i find a keyword, i would like to find the word before and after as well. Any thoughts of how i can do that?

example

Keyword_file looks like this:

George
myself

abstract_file looks like this:

hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick.

so i want to find the words 'george' and 'myself' as well as 'is','how' and 'consider', 'to'.

Any suggestions? :?:

python

3 Contributors
10 Replies
301 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by doomas10

Gribouillis 1,391 Programming Explorer

14 Years Ago

You can iterate over the words, or build a list, with a regular expression (the re module)

import re

abstract = """hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick."""

word_pattern = re.compile(r"\w+")

print list(word_pattern.findall(abstract))

""" my output -->
['hello', 'my', 'name', 'is', 'george', 'How', 'are', 'you', 'Today', 'i', 'am', 'not', 'feeling', 'very', 'well', 'I', 'consider', 'myself', 'to', 'be', 'sick']
"""

Edited 14 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

14 Years Ago

You can use enumerate. Supposing you have 2 lists, keywords and word_list,

keyword_set = set(keywords) # better use a set
for i, w in enumerate(word_list):
    if w in keyword_set:
        word_before = word_list[i-1] if i > 0 else ''
        word_after = word_list[i+1] if i+1 < len(word_list) else ''
        print("%s <%s> %s" % (word_before, w, word_after))

Edited 14 Years Ago by Gribouillis because: n/a

snippsat 661 Master Poster

14 Years Ago

print list(word_pattern.findall(abstract))

Just a tips.
re.findall is returning a list,so there is not necessary to use list().

import re

text = '''\
hello, my name is george. How are you? Today
i am not feeling very well. I consider myself to be
sick.
'''

word_pattern = re.findall(r'\w+', text)
print word_pattern

""" Out-->
['hello', 'my', 'name', 'is', 'george', 'How', 'are', 'you', 'Today', 'i', 'am', 'not', 'feeling', 'very', 'well', 'I', 'consider', 'myself', 'to', 'be', 'sick']
"""

Edited 14 Years Ago by snippsat because: n/a

Gribouillis commented: indeed ! +4

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

doomas10 0 Newbie Poster · Answer 1 · 2010-08-06T19:58:41+00:00

i try the list and i think within that area a solution can be found.however, i do not know how to search for the words after and before. for example:

filename='abex.txt' 
wordlist=re.split('\s+', file(filename).read().lower())
print 'words in text:', len(wordlist) 
print wordlist
print ""

filename2='singleheads.txt'
wordlist2=re.split('\s+', file(filename2).read().lower())
print 'words in the single head file:', len(wordlist2)
print wordlist2

    
for word in wordlist2:
    if word in wordlist:
       print word

you see this code prints only the common ones which is super. But i can not understand how can i ask the algorithm to fetch the words before and after the keywords. is there a function that can do that?:?:

doomas10 0 Newbie Poster · Answer 2 · 2010-08-06T22:02:10+00:00

thanks! it worked awesome! but here is another problem. you see with this code

filename='abex.txt' 
wordlist=re.split('\s+', file(filename).read().lower()) #ta kanei lowercase
filename2='singleheads.txt'
wordlist2=re.split('\s+', file(filename2).read().lower())

punctuation=re.compile(r'[.?!,":;]')   #remove the punctuation
for word in wordlist:
    word=punctuation.sub("",word) 

keyword_set = set(wordlist2)
for i,w in enumerate(wordlist): #it gives to the list items numbers
    if w in keyword_set:
        before_word = wordlist[i-1] if i > 0 else ''
        after_word = wordlist[i+1] if i+1 < len(wordlist) else ''
        print "%s <%s> %s" % (before_word,w,after_word)

it brings nice results! but words that may have with them a .?!" are excluded since the keyword list has "clean" words. i tried to fix it with

punctuation=re.compile(r'[.?!,":;]')   #remove the punctuation
for word in wordlist:
    word=punctuation.sub("",word)

with three lines and it seems to work fine. but i can not make a connection between the new wordlist(the clean one now) and set. Keep hitting my head on the wall but i can not find a way. Something says it is going to be very very simple!:sweat:

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2010-08-06T22:06:54+00:00

Forget about punctuation: build the words list as I did in my first post above, using the regex r'\w+'.

doomas10 0 Newbie Poster · Answer 4 · 2010-08-06T22:42:12+00:00

i did and it works fine! thank you very much for your help :-):cool: for the record this is my final algorithm:

import re
terms='singleheads.txt'
wordlist=re.split('\s+', file(terms).read().lower())

abstract=open('abex.txt','r')
abstract2=abstract.read().lower() 
abstract3=str(file2)

word_pattern = re.compile(r"\w+")
doom=list(word_pattern.findall(abstract3))
print doom
print ""

keyword_set = set(wordlist)
for i,w in enumerate(doom): #it gives to the list items numbers
    if w in keyword_set:
        before_word = doom[i-1] if i > 0 else ''
        after_word = doom[i+1] if i+1 < len(doom) else ''
        print "%s <%s> %s" % (before_word,w,after_word)
        sephiroth=open('staib.txt','a')
        sephiroth.write(str(before_word)+ " "+ "<" + str(w) + ">" + str(after_word) + "\n")
        sephiroth.close()

thanks for the help!

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 5 · 2010-08-06T22:46:12+00:00

Nice, but you should not open 'staib.txt' for each iteration. Open it before the loop starts and close it after the loop. Opening a file is an expensive system call.

doomas10 0 Newbie Poster · Answer 6 · 2010-08-07T00:33:52+00:00

doomas10 0 Newbie Poster

14 Years Ago

ok :-) thanks for the tip :-))

doomas10 0 Newbie Poster · Answer 7 · 2010-08-09T17:14:43+00:00

doomas10 0 Newbie Poster

14 Years Ago

thanks snippsat :-)