So for one of my university projects we have been assigned a problem to complete. I have the code working fine for the example output provided however I just need some help regarding a few errors that need fixing with different inputs.
I am not asking for you to do this for me as I have done most if not all of the program but I just need some help with errors and general ways to make the code more presentable.
Here is the problem and my code below:
You are to write an indexing program that will record and print out on which lines particular
words appear in a piece of text supplied as input by the user. Hence, the index you generate
will look like a book index, but each index entry will have a word followed by the line
numbers on which the word appears, rather than the page numbers.
Specifically, your program should:
a) read in lines of text one at a time, keeping track of the line numbers, stopping when a
line is read that contains only a single full-stop;
b) remove punctuation (as specified below) and change all text to lowercase;
c) remove stop words (the stop word list is specified below);
d) stem the words (the common endings to look out for are specified below);
e) add the remaining words to the index – a word should appear only once in the index
even though it may appear many times in the text, and the line numbers on which it
appears (removing duplicates) should be recorded with the word;
f) print the index, using exactly the format below, once all lines have been entered.
import string
pMarks = ".,:;!?&'"
sWords = ['a','i','it','am','on','in','of','to','is','so', \
'too','my','the','and','but','are','very','here','even','from' \
'them','then','than','this','that','though']
endings = ['s','es','ed','er','ly','ing']
def removePunc(text):
nopunc = ""
for char in text:
if char not in pMarks:
nopunc = nopunc + char
return nopunc.lower().split()
def removeStop(text):
nostop = []
for word in text:
if word not in sWords:
nostop.append(word)
return nostop
def stemWords(words):
for wrd in words:
for n in range(1,4):
if wrd[-n:] in endings:
index = words.index(wrd)
words.remove(wrd)
words.insert(index,wrd[:-n])
return words
def removeDuplicates(words):
nodupe = []
for wrd in words:
if wrd not in nodupe:
nodupe.append(wrd)
return nodupe
def main():
lines = []
textTwo = ""
text = raw_input("Indexer: type in lines, finish with a . at start of line only \n")
if text == ".":
exit()
lines.append(text)
while textTwo != ".":
textTwo = raw_input()
lines.append(textTwo)
text = text + " " + textTwo
if textTwo == ".":
lines = lines[:len(lines)-1]
text = removePunc(text)
text = removeStop(text)
text = stemWords(text)
text = removeDuplicates(text)
print "The Index is:"
for word in text:
lineNumbers = []
for l in lines:
if word in l:
lineNumbers.append(lines.index(l)+1)
print word, lineNumbers
main()
What could be done to ensure that the words are stemmed fully and correctly? For example if i had "annoyingly" or "sings" they contain more than one ending.
Also for the output, my code prints out "wind [1,3,4]" instead of "wind 1, 3, 4".
Also we are not allowed to use any code that we havent covered in the course so far, so just the basic operands can be used.
Any help would be great thanks.