Hi!
I want to read from a set of documents and put the information into a matrix[x][y] , where x is the document and y is a boolean field denoting whether a particular word appears in the document x or not. So each row would have y fields/dimensions where i is the number of words in the document x.Something like :
D1:”The cat in the hat disabled”
D2:”A cat is a fine pet ponies.”
D3:”Dogs and cats make good pets”
D4:”I haven’t got a hat.”
good pet hat make dog cat poni fine disabl
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]
The entries are somewhat different because the stemming is applied on the document contents and frequently occurring words are ignored.
All the documents are in the same directory and are named sequentially.