Hello,
I am trying to generate word frequencies using ngrams. I have taken the brown corpus from nltk and changed it for use with ngram calculations by adding <s> and </s> at the beginning and end (in place of period.) I need to try and calculate the frequencies from this file but am unsure how to go about doing this? My end desire is to generate random ngrams based off bigrams, trigrams and quadgrams.
How can I go about with the calculations? Thank you.
import re
import nltk
import nltk.corpus as corpus
import tokenize
from nltk.corpus import brown
def alter_list(row):
if row[-1] == '.':
row[-1] = '</s>'
else:
row.append('</s>')
return ['<s>'] + row
news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'
x = len(news)
for row in news[:x]:
print(alter_list(row))