I'm trying to parse through a large file (544,000+ rows). I'm pulling information from each row and creating a SQL statement with that information, and trying to remove duplicates.
Here is my current code:
try:
fdata = open(vhdat).readlines()
except IOError, detail:
print "Error opening file. ", detail
# Define list variables
vhlist = []
vhlist = list(fdata)
sqlstring = ''
# Process the .dat file.
for x in range(len(vhlist)):
vhstring = str(vhlist[x])
table = string.maketrans('\n', ' ')
vhstring = vhstring.translate(table)
vhstring = vhstring.strip()
NPANXX = vhstring[:6] # NPA
NXXType = vhstring[16:18] # NXXTYPE
MVC = vhstring[45:50] # Major Vertical Coordinate
MHC = vhstring[50:55] # Major Horizontal Coordinate
RCLATA = vhstring[55:58] # Rate Center LATA
SPCA = vhstring[81:83] # State, Province, or County Abbreviation
DIND = vhstring[89] # Dialable Indicator
OLRS = vhstring[90:92] # Other Line Rate Step
PID = vhstring[92] # Point Identification
if DIND == '1':
DIND = 'Y'
else:
DIND = 'N'
# Create SQL Statement
if NXXType == '00':
sqltemp = 'insert into database.table values (scn_telco.npanxx_id_seq.nextval,0,\
\'%s\',\'%s\',%s,%s,\'%s\',\'%s\',\'%s\',\'X\',\'%s\',%s,\'\');\n' % (NPANXX, RCLATA, MVC, MHC, DIND, NXXType, OLRS, SPCA, PID)
# Look for duplicates. Do not add line to sqlstring if sqltemp string already exists
if sqlstring.find(sqltemp) == -1:
# Print the record number so I can see where the script is while processing.
# This print line will be removed in production.
print x
sqlstring = sqlstring + sqltemp
else:
pass
else:
pass
f = open('C:\Python25\VHSQL.txt', 'w')
f.write(sqlstring)
f.close
The problem is this:
There are over 544,000 lines in the original file. The script has been running for over 2.5 hours and is only (as I type this) on record 71,672. At this rate, it'll take close to an entire day to run.
Is there a more efficient way to check for the duplicates? I'm fairly new to programming in general (and Python in particular) so any help would be appreciated.