Hi!
I want to read from a set of documents and put the information into a matrix[x][y] , where x is the document and y is a boolean field denoting whether a particular word appears in the document x or not. So each row would have y fields/dimensions where i is the number of words in the document x.Something like :

D1:”The cat in the hat disabled”
D2:”A cat is a fine pet ponies.”
D3:”Dogs and cats make good pets”
D4:”I haven’t got a hat.”

good pet hat make dog cat poni fine disabl
D1 [+0.00 +0.00 +1.00 +0.00 +0.00 +1.00 +0.00 +0.00 +1.00 ]
D2 [+0.00 +1.00 +0.00 +0.00 +0.00 +1.00 +1.00 +1.00 +0.00 ]
D3 [+1.00 +1.00 +0.00 +1.00 +1.00 +1.00 +0.00 +0.00 +0.00 ]
D4 [+0.00 +0.00 +1.00 +0.00 +0.00 +0.00 +0.00 +0.00 +0.00 ]

The entries are somewhat different because the stemming is applied on the document contents and frequently occurring words are ignored.
All the documents are in the same directory and are named sequentially.

I would suggest a dictionary with the key pointing to a list, which would show up similar to your example

D_dict = {"D1":[0, 0, 1, 0, 0, 1, 0, 0, 0 ],
          "D2":[0, 1, 0, 0, 0, 1, 1, 1, 0 ] }

Come up with some code to read the file and split/compare the words, and post back with any problems.

Actually I have to feed this as input to another program..So I want it as a matrix..Ill try and code whatever part I can and post that here..Also is there any way of reading different files from the same directory except os.walk? I find using it a bit tedious..

Maybe you could catch some points from this words haiku code I made recently (reorganizing words in given count pattern to lines):

from __future__ import print_function

bookfile = '11.txt'
pattern = (7, 5, 7)

def nwords(book, nwords):
    for dontcare in range(nwords):
        ## give word from begining (0) and remove from list of words one by one
        if book:
             yield book.pop(0)
        else:
             break
    ## finish giving words after nwords has been given 

with open(bookfile) as thebook:
    # read text of book and split from white space
    bookaslist =  thebook.read().split()
    # until bookaslist is empty, which is considered False value
    while bookaslist:
        for count in pattern:
            # rejoin count words and center it for 60 column print
            print(' '.join(nwords(bookaslist, count)).center(60))
        # empty line between to clarify the form
        print()

Sorry for not replying all this time. Was real busy!
tony,I couldn't get much from the code you provided as in I couldn't get any leads to the solve the problem mentioned before.
Guys! I need help and I need it fast (hate to sound demanding ! :|)
I guess I will change the problem statement
I have this data structure.(list of dics)
{'year': ({'file21.txt': 1}, {'year': [1040]})}
{'year': ({'file22.txt': 2}, {'year': [1604, 1846]})}
{'year': ({'file26.txt': 1}, {'year': [110]})}
{'year-old': ({'file17.txt': 1}, {'year-old': [1344]})}
{'yearlong': ({'file01.txt': 1}, {'yearlong': [4681]})}
{'yet': ({'file01.txt': 2}, {'yet': [2055, 2403]})}
{'yet': ({'file11.txt': 1}, {'yet': [4409]})}
And I have to map it to a matrix such that the zeroth row belongs to term year and the 21st element of this row is marked as one(coz year appears in file21.txt).Like wise for all the terms.I need it in the matrix form because I have to carry out some matrix operations on it.

]    for i in grail:
       for k in i.keys():
        print float(i[k][0].keys())

I was using this to extract21 from 'file21.txt' . It gives an error saying float argument must be a string or a number.

That is done. Now

def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         print index
         row[int(index)] == 1

This gives list index out of range. :(
P.S. grail is the array of dictionaries in the format mentioned above

And what it says when you change line to say

print index, len(row), row

I have to say that your code is very strange k is iterating keys of dictionary like list and the you go on using i[k][0].keys()[0][4]!

for i in grail:
for k in i.keys():
print float(i[k][0].keys())

I was using this to extract21 from 'file21.txt' . It gives an error saying float argument must be a string or a number.

You can do

print [float(i[k][0]) for i in grail for k in i ]

And what it says when you change line to say

print index, len(row), row

I have to say that your code is very strange k is iterating keys of dictionary like list and the you go on using i[k][0].keys()[0][4]!

Regarding the "strange" part, I can't help it. I have to do something in two days and I am trying out anything and everything to that end! Regarding the thing u mentioned

rows =[0]*31
def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         rows[index] == 1
         print index, len(rows), rows

This gives
.
.
23 31 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
.
.
as the output.
I initialized rows list(size 31) as a list of all zeroes.

This seems to work! Let me be sure of it though.Will post back again! Thanks tonyjv

def matrix(grail):
     #for i in range(1,31):
      for i in grail:
       for k in i.keys():
         rows =[0]*31
         index = int(i[k][0].keys()[0][4]+i[k][0].keys()[0][5])
         rows[index] = 1
         #print index, len(rows), rows
         print rows

Does not look possible as, you aresetting then rows[23] as 1 but rows is all zeroes. Strange. But no error anymore, then. You are loosing of previous values of rows every loop, you must save them somewhere. Maybe initialization in outer loop instead?

Hmm..I need a separate row for each term. So I think resetting the rows is logical.For putting these rows in a 2D matrix can I use something like mat.append(rows) ?

Sounds right.

I'll wait for a while (in case I have any doubts) ,before marking this thread as solved. Thanks tonyjv! :) You came to my rescue as always! Really appreciate that!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.