Hi All

Hope everyones doing good.

I have two \t files with 3 columns, file1 contains 600050 rows and file2 contains 11221133 rows.

I am comparing file2 with file1 to match common entries in first two columns, if file1[0:2] in file2[0:2 ,] write file2[0:2]+column 3 else fil1[0:2] + 5.

I did this by using two dictionaries with elements of column[0:2] is key and column[3] as value and but loading of file2 in dictionary gave a memory error.

dict2[item1,item2] = item3
MemoryError

I also tried list but memory error, sets works but huge duplicates and unordered. I want to preserve the order as same as file1.

find attached test files.

f1 = open(file1) 
    f2 = open(file2)   # very very large file
    f3 = open(file3,'w')

    dict1 = {}
    dict2 = {}
    for line in f1:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict1[item1,item2] = item3

    for line in f2:
        lstrip = line.strip('\n')
        item1,item2,item3 = lstrip.split()
        dict2[item1,item2] = item3        # here its giving memory error

    for item in dict1.keys():
        if item in dict2:
            match = item[0] +'\t'+ item[1] + '\t' + dict2[item] + '\n'
            f3.write(data)
        else:
            data= item[0] +'\t'+ item[1] + '\t' + str(0) + '\n'
            f3.write(data)
    f1.close()
    f2.close()
    f3.close()

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:

openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()

If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:

  • Treat each sub-file of file 2 as the entire file2 in the algorithm above
  • Instead of writing the final result on a non-match, write the whole line
  • On the next pass, use the modified outfile as the infile:
    • Already modified lines will either be recreated or untouched: OK
    • Some unmodified lines will be matched and modified: OK
  • On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

Its still giving a memory error :-(.

Umm. Too bad that the data you need to keep is in the shorter file. Still, if there is enough room to make a set of pairs from the large file, you can do it like this:

openfile2 = open(file2,'r')
matchset = set((tuple(x.split()[:2]) for x in openfile2))
openfile1 = open(file1,'r')
openout = open(outfile,'w')
for row in openfile1:
  items = row.split()
  if tuple(items[:2]) in matchset:
     openout.write(" ".join(items[0],items[1],items[2])+'\n')
  else:
     openout.write(" ".join(items[0],items[1],items[4])+'\n')
openout.close()
openfile1.close()
openfile2.close()

If that still causes a memory error, then you will have to do it multiple times. Split file2 into some smaller parts and do this:

  • Treat each sub-file of file 2 as the entire file2 in the algorithm above
  • Instead of writing the final result on a non-match, write the whole line
  • On the next pass, use the modified outfile as the infile:
    • Already modified lines will either be recreated or untouched: OK
    • Some unmodified lines will be matched and modified: OK
  • On the last pass, use the algorithm above, with file1 as the mostly modified file and the last part of file2 making a dictionary.

Note that this only works because you want adjacent columns if matched. In other cases, you will have to write the whole line with a first character/column flag for matches (don't look at those lines in later passes). Then make a final pass through the file, rewriting it according to the marks.

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.

my code is like :

for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"

can anyone help on this?

A second dictionary is not required. Create a tuple containing item1 and item2 and look up in the dictionary. If found, you have a match. If you want to keep the keys in the same order as the file, use the ordered dictionary. For Python 2.7 it is in "collections".

I tried using the ordered dict, but its still giving the memory error

That is too vague to be of any value. Post the actual error. Also this line
if tuple(line[:2]) in od.keys():
should just be
if tuple(line[:2]) in od:

".keys" returns a list of the keys which means there is double the amount of memory for the number of keys. If you can not do this with the memory available, then you want to use an SQLite database on disk instead.

I tried using the ordered dict, but its still giving the memory error. I tried it for small set where it works but ordered is not preserved.

my code is like :

for row in openfile2: 
    line = row.split()
    if tuple(line[:2]) in od.keys():
        print line
    else:
         "here i want to print the key,value pair from od for which no entry in file2"

can anyone help on this?

I can only suggest again that you split the work into more manageable pieces. I tried to read in that many triples from a file (generated by file.write(%s\t%s\t%s\n"%(random.random(),random.random(),random.random())) and was unable to get much past 8 million lines (it did proceed very slowly after that, but there was obviously a lot of thrashing going on: it took anything from .3 seconds to 850 seconds to read the next 1000 rows. Yes: nearly 15 minutes!) I'm running on OS/X with 4G memory.

While I was doing this work, it occurred to me that you might have duplicate keys in your long file. What should happen in that case? There are three options:

  1. First key wins
  2. Last key wins
  3. Value for a randomly chosen key is used

(up to 11 million rows, my random data had no duplicate keys. I killed the program at that point since it was apparent it would not finish in reasonable time)

For reference, here's my code for reading from the file

#!/usr/bin/env python

from random import random
import time

data = {}
doublecounter = 0

def mumble(i,lp,s,ss,e):
   looptime = e-ss
   totaltime = e-s
   print "%8d: count: %d, (lp:%d) lptime: %2.2f, ttime: %2.2f"%(i,doublecounter,lp,looptime,totaltime)

def doit(f):
   when = 1000000
   global data,doublecounter
   start = time.time()
   count = 0
   lpstart = start
   end = start
   for line in f:
      if 0 == count % when:
         lpstart = end
         end = time.time()
         mumble(count,when,start,lpstart,end)
         if count == 7000000: when /= 10
         elif count == 8800000: when /=10
         elif count == 11000000: when /= 10
      s = line.split()
      k = tuple(s[:2])
      v = s[2]
      data.setdefault(k,[])
      data[k].append(v)
      if len(data[k]) > 1:
         doublecounter+= 1
      count += 1

with open ('f2','r') as f:
   doit(f)

and the first several rows of my f2 file

0.681726943412	0.317524601127	0.774220362723
0.960827529946	0.884868924006	0.805958559062
0.948431957255	0.654394548708	0.261958105771
0.790787661492	0.588754682813	0.784801700146
0.91496579649	0.65679730019	0.643389604304
0.410742283212	0.266691538578	0.251305611073
0.452187326938	0.537941526934	0.162800839411
0.298231566648	0.287904077361	0.553563473187
0.892003052642	0.483519506157	0.605940960314
0.118257450942	0.51597182572	0.868219791638

(the white spaces are tabs)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.