Hi, I have a huge file (over 60 GB) which has lines in the following consistent format.
"entry1";;"entry2";;"entry3";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
The problem is that a Few lines in this file have a line break precisely after the 3rd entry like this:
"entry1";;"entry2";;"entry3\n
";;"entry4";;"entry5";;"entry6";;"entry7";;"entry8";;"entry9";;"entry10";;"entry11"
I need to delete that extra newline and concatenate the line below it with the above one so that it becomes a complete line again. I've come up with the following so far.
in_file = open('myhugetextfile.txt')
out_file = open('mycleaneduptextfile.txt','w')
#go through the file line by line
for line in in_file:
#split on ;; and check if the length is less than 11 entries long
#if length is less than 11, it means the line has an unnecessary newline in it
if len(line.split(';;')) < 11:
#strip the unneeded newline char off the end of the line
line = line.strip('\n')
#Read the next line (incomplete) and store it
newline = in_file.readline()
#now join the original broken line and the next line
repaired_line = line + newline
#Write it to a new file
out_file.write(repaired_line)
#If there are no breaks in the line, just write it out to the new file
else:
out_file.write(line)
However, when I run this I get a
"ValueError: Mixing iteration and read methods would lose data"
Is my program logic correct or am I doing this the wrong way?
Any help would be appreciated.
Thanks,
Adi