I'm trying to make program using Google 5gram data to compare with one of English exam called TOEIC. There are 118 files for 4grams and each size of 4grams are approximately 300MB and each files has 10,000,000 lines.
So, here is the point.
It requires about 4 or 5 seconds to read few bytes of lines from a file using
readlines(some value). Even though, 5 second is quite short time it going to be few minutes if the system need to check the end of lines.
I've just starting to learn python few month ago so I don't know how to reduce the time to read lines in file.
I know, there is a way to read a specific line of a file but it also takes times to upload to memory first. I heard that it doesn't need to upload to memory to read specific lines in Java. Is there any similar way to read specific line without costing uploading time like Java?
This is the part of my code caparing part.
while bcheck == 0:
if bcheck == 1:
tx.close()
break
#read a 5gram file match with first letter of input
if nRange == maxLine:
if str1stLetters == 'A':
tx = open(r"d:\##DB\Google-4gram\4gm-0034")
elif str1stLetters == 'B':
tx = open(r"d:\##DB\Google-4gram\4gm-0036")
elif str1stLetters == 'C':
tx = open(r"d:\##DB\Google-4gram\4gm-0038")
elif str1stLetters == 'D':
so on....
lines = tx.readlines(2000000) [B]# WANT TO REDUCE COSTING TIME OF THIS [/B]
countLine = 0
for x in range(len(lines)):
str4gram = lines[countLine]
if strSearch in str4gram:
strSplit4gram = str4gram.split()
print str4gram
print nRange,"th line"
tShow.insert(INSERT, str4gram + "\n")
nRange = nRange + 1
if nRange == 10000000:
bcheck = 1
str4gram = lines[countLine]
strSplit4gram = str4gram.split()
print nRange,"th line"
tShow.insert(INSERT, str4gram + "\n")
print nRange
break
if strSearch in lines[countLine + 1]:
countLine = countLine + 1
continue
else:
bcheck = 1
break
else:
nRange = nRange + 1
countLine = countLine + 1
tShow.insert(INSERT, "\n")
tInput.delete(0, END), tInput_2.delete(0, END)