Hi,

I'm trying to remove non-stop words from a text file using regular expresions but it is not working. I used something like ('^[a-z]?or') in order to avoid removing (or) from the mibble of words e.g. morning.

Temp = []
Original_File = open('out.txt', 'r')
Original_File_Content = Original_File.read()
Original_File.close()

Temp.append("".join(Original_File_Content))

FileString = "".join(Temp)

p = re.compile( "^[a-z]?is|^[a-z]?or|^[a-z]?in")
RemoveWords = p.sub( '', FileString)

Thanks

Agni commented: excellent first post !! welcome to daniweb +3

This is trivial without regular expressions. Read one record, split into words, and test each word; read the next record, etc. Note that the following two lines probably don't do anything as Original_File_Content is one, long string. See Section 7.2.1 for clarification http://docs.python.org/release/2.5.2/tut/node9.html#SECTION009200000000000000000. If it is a very large file, then converting to a set and comparing to a set of stop words using set.difference() would be faster.

Temp.append("".join(Original_File_Content))
 
FileString = "".join(Temp)

Thank you..

Well, I was testing my work on text files but the bigger picture is that I'm working with BeautifulSoup objects. So, they have to be converted to string so I can manipulate the text. It has been successful so far except for the non-stop words.

html = urllib2.urlopen(someurl.html).read()
    soup = BeautifulSoup(html) 
   
    # Remove tags and non-stop words
    p = re.compile( "<.*?>|^[a-z]?is|^[a-z]?or|^[a-z]?in")
    RemoveWords = p.sub( '' , str(soup))

    p = re.compile( r'\W+' )
    WordList = p.split(RemoveWords)

I provided a minimum number for non-stop words for simplicity. There are more than 200 non-stop words and It will be difficult (and I assume inefficient) to test everyword in my list to non-stop words. That's why I considered re.

It will be difficult (and I assume inefficient) to test every word in my list to non-stop words

There is no other way to do it no matter what method is used. Every word has to be checked somehow. 200 words is not worth worrying over. 200,000 words would require some tweaking. If you are concerned about the amount of time it will take, then consider using a set or dictionary as they are indexed via a hash.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.