Hello, I have about 500,000+ txt file for about 7+ gigs of data total.
I am using python to put them into a sqlite database. I am creating 2 tables, 1. is the pK and the hyperlink to the file.
For the other table I am using an entity extractor that was devloped in perl by a coworker.
To accomplish this I am using subprocess.Popen(). The problem I am running into is the load time...my perl opens, parses the file, and then closes. At the next file the process repeats. I would like for the perl to open, stay open, and allow me to just pass my reports to it ad hoc. Here is some snipit;
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl" , reportString], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
r = numberExtractor.communicate()
#print r
Using just python or just perl isnt really an option for me. The only other work around I have is to run all the files through the perl scripts (which there are 3 of), the python script, and then write another python script which merges them all together. But that would be time consuming when it comes to data loading, and this needs to be a repeatable process for use by many.
So my question is there a better way to call this perl script, in my mind I would like the script to be called stay open and just accept input as I send it there, basically an interactive perl module. But I am open to anything.