i have a text document 'topics.txt':
1~cocoa
2~
3~
4~
5~grain~wheat~corn~barley~oat~sorghum
6~veg-oil~linseed~lin-oil~soy-oil~sun-oil~soybean~oilseed~corn~sunseed~grain~sorghum~wheat
7~
8~
9~earn
10~acq
and so on..
here the numbers correspond to the file names, i have about 20000 files.
import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob
def removegarbage(text):
text=re.sub(r'\W+',' ',text)
text=text.lower()
return text
folderpath='d:/individual-articles'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))
num_files = len(filepaths)
with open('topics.txt','r') as filehandle:
lines = filehandle.read()
words = removegarbage(lines).split()
counter.update(words)
for word, count in counter.most_common():
probability=count//num_files
print('{} {} {}'.format(word,count,probability))
i need my output to be of the form:
word,count of word in topics.txt,probability, list of files,number of files in list,
so far my program works fine till probability, but how do i get a list of the files belonging to a word?
ex: 'grain' must contain the list(5,6,....)
would counting the line number for each word and storing the line number work?
how do i go about it?
Plz help!