Figuring out most used words in text file

Question

parallel91 0 Newbie Poster

14 Years Ago

Hey all,

I have a text file and I want to find out the top 40 most used words in the text file. I managed to do that. But, I have another text file that has hundreds of "stop words." When looping through the text file to find out the top 40 most used words, my program needs to ignore the stop words. I seem to be missing something, cause I just can't figure this out.

Thanks in advance for the help!

Here is the code I have thus far:

from string import punctuation

#opens empty list, reads stopWords.txt
#adds all words in stopWords.txt to open list
stopWordsList = ['']
stopWordsText = open("stopWords.txt", 'r')

for words in stopWordsText:
    words = words.strip(punctuation).lower()
    words = words.strip('\n')
    stopWordsList.append(words)

#finds the top 40 words in debate.txt
#prints out the word and the frequency of the word
def sort_items(x, y):
    """Sort by value first, and by key (reverted) second."""
    return cmp(x[1], y[1]) or cmp(y[0], x[0])

N = 40
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("debate.txt")
                                             for word in line.split())
                                             
for word in words_gen:
    words[word] = words.get(word, 0) + 1


top_words = sorted(words.iteritems(), cmp=sort_items, reverse=True)[:N]
      
for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

python

3 Contributors
3 Replies
281 Views
11 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by king_koder

king_koder 0 Light Poster

14 Years Ago

Try replacing the code in line 26 with:

if word not in stopWordsList: words.get(word, 0) + 1

Edited 14 Years Ago by king_koder because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 1 · 2010-11-08T16:40:26+00:00

Try replacing the code in line 26 with:
if word not in stopWordsList: words.get(word, 0) + 1

You mean:

if word not in stopWordsList: words[word] = words.get(word, 0) + 1

king_koder 0 Light Poster · Answer 2 · 2010-11-08T17:05:59+00:00

Oops, extremely sorry for the error!

if word not in stopWordsList: words[word] = words.get(word, 0) + 1