Hey all,
I have a text file and I want to find out the top 40 most used words in the text file. I managed to do that. But, I have another text file that has hundreds of "stop words." When looping through the text file to find out the top 40 most used words, my program needs to ignore the stop words. I seem to be missing something, cause I just can't figure this out.
Thanks in advance for the help!
Here is the code I have thus far:
from string import punctuation
#opens empty list, reads stopWords.txt
#adds all words in stopWords.txt to open list
stopWordsList = ['']
stopWordsText = open("stopWords.txt", 'r')
for words in stopWordsText:
words = words.strip(punctuation).lower()
words = words.strip('\n')
stopWordsList.append(words)
#finds the top 40 words in debate.txt
#prints out the word and the frequency of the word
def sort_items(x, y):
"""Sort by value first, and by key (reverted) second."""
return cmp(x[1], y[1]) or cmp(y[0], x[0])
N = 40
words = {}
words_gen = (word.strip(punctuation).lower() for line in open("debate.txt")
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), cmp=sort_items, reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)