I have a txt file within there are three articals recognizable by the html tags < doc > < / doc>
As a result i need to count the words in each artical and get a result like this:
[the] -> [1, 20] -> [2, 34] -> [3, 12]
[author] -> [1, 7] -> [3, 2]
The code i'm using at the moment only counts all the words in the txt file. But it's not giving me the correct output. Has anybody suggestions how I can create the output that is want ?
This is the code i have so far:
import re
import nltk
import numpy as np
import matplotlib.pyplot as plt
from operator import itemgetter
file=open('/Users/c1/Desktop/doc.txt')
def unicount(file):
dic={}
for word in file.read().split():
word = word.lower()
if tekens(word) == False:
continue
elif word in dic:
dic[word] += 1
else:
dic[word] = 1
print dic
print len(dic)
def tekens(word):
''' Filtering out all punctuation marks'''
regex = re.compile("^[A-Za-z0-9]+$")
if regex.match(word):
return True
else:
return False
unicount(file)
Where unicount count ALL the words in the document but what I want is to count the words within in each <body>
This code gives me the following output:
'fair': 1, 'po': 3, 'color': 1,