Hello,

I am having some terrible issues getting data formatted in a way that is useful for me. All I am trying to do is read a file that has several points of data in comma delimited format, and break each line into multiple variables (just need to grab each variable separated by a comma, hopefully leaving it formatted as a string, and use it in another function). My current code is below, with a few lines of its output sitting below it. I dont know what the gibberish is, tho I image it has to do with wacky encoding issues as things got moved from windows to linux. Thanks and let me know if more info is needed. (o, i forgot to mention, when I look at my file using the command: [file textfile.txt], i get CURRENTTESTFILE: MPEG ADTS, layer I, v1, 96 kBits, 44.1 kHz, Stereo , which makes no sense...)

#!/usr/bin/env python

#Program to read comma delimited data in from a file and insert it
#into a prespecified table in psql

import os
import sys
import pgdb
import string

fa = open(datafilename, 'r')

for line in fa:
     out = line.split(",")
     print (out)
     print (line)
fa.close()

brandon@xxxxx:/$ python TabelCreationModule.py


001,ADAMS COUNTY,99344,46.827354,-119.1742,COM,1592


001,ADAMS COUNTY,99344,46.827036,-119.173798,COM,1767


001,ADAMS COUNTY,99344,46.826063,-119.15651,COM,6500


001,ADAMS COUNTY,99344,46.826063,-119.169933,COM,16000


001,ADAMS COUNTY,99344,46.824922,-119.173798,COM,1320


001,ADAMS COUNTY,99344,46.824906,-119.173798,COM,3202


001,ADAMS COUNTY,99344,46.824874,-119.173798,COM,2400

Yeah, the "\x##" is the format for encoding non-ansi (is that correct?...) characters. As far as I can tell, these are useless characters anyway, because if you pick through the segment: '\x00A\x00D\x00A\x00M\x00S\x00 \x00C\x00O\x00U\x00N\x00T\x00Y\x00' You can see that there are the regular letters "ADAMS COUNTY" in it. So are all these null characters useless? If so you can just remove them from every index in your script so that you'd only get those regular letters left...

EDIT:
Here's a way to remove those control characters and keep each value in the list:

# in that for loop of the lines...
s = ''.join(out)
final = []
for segment in s.split('\x00\x00'):
    temp = ''
    for char in segment:
        if ord(char) > 31:
            temp += char
    final.append(temp)

# or, for ugly but compressed code (not really recommended though...):
s = ''.join(out)
final = []
for segment in s.split('\x00\x00'):
    final.append(''.join(filter(lambda x: ord(x) > 31, [char for char in segment])))

s is the joined string of the out variable you had in your loop. I discovered that each index was separated by a \x00\x00, so I split the string at that. Then, within each of the resulting segments, I weeded out any characters whose ASCII value was less than 32 (the null/control characters). Then it appends those resulting characters to the new final list.
This is messy solution, but it left me with the output: ['001', 'ADAMS COUNTY', '99344', '46.824922', '-119.173798', 'COM', '1320'] for one of the lines. There's most likely a much better way to do this but I'm quite tired at the moment... :P

did you try dos2unix on the file first?

commented: good catch +22

did you try dos2unix on the file first?

Oh damn, good point. I forgot about the new line thing being different on UNIX from Windows... UNIX is just \n and Windows is \r\n
I think?....

Here is some code I used in the past to solve the non-printable character problem. The characters are hidden so if you print the line it looks the same before as it does after. You can tell that the line has been cleaned by viewing it through binascii.hexlify

import string,re, binascii

norm = string.maketrans('', '') #builds list of all characters
non_alnum = string.translate(norm, norm, string.printable)

foo="\\008PROMPT\\008\008\008\008\008 cv \08\1\2\8\3\8\8\8\4\5\6\7\10"
print binascii.hexlify(foo)
#5c30303850524f4d50545c303038003800380038003820637620003801025c38035c385c385c380405060708
print foo
#\008PROMPT\0088888 cv 8\8\8\8\8
cleaned=foo.translate(norm,non_alnum)
print binascii.hexlify(cleaned)
#5c30303850524f4d50545c3030383838383820637620385c385c385c385c38
print cleaned
#\008PROMPT\0088888 cv 8\8\8\8\8

Hope this helps

dos2unix was the first thing I tried, but for whatever reason, it didnt work. Ill start on these suggestions now and let you know what happens, thanks!

okidoki, I know get what appears to be a list for each line of code I run. The problem I have to figure out now is how to reference each variable.. using the code below with the print statments, I now get an output that looks like this:

0
0
0
0
0
0
0
0
0
0

which is just the 0.0 element, i.e. the first character in the "list". Looking at tutorials to confirm, typically in a list, doing this will give the first variable in the list, not just the first character. Is there a conversion command to convert this well formed string into an actual list?

for line in fa:
    out = line.split(",")
    s = ''.join(out)
    final = []
    for segment in s.split('\x00\x00'):
        temp = ''
        for char in segment:
             if ord(char) > 31:
                  temp +=char
        final.append(temp)
    print(final[0][0])

Answered my own question as I am apparently slightly fog-headed this morning, this works PERFECTLY. Now that I am gainfully employed, and plan to ask more questions in the future as I learn more....perhaps I should start donating as these types of forums are always so quick and helpful. thanks you all SOOOOO much, you just made my day, and a beer is owed to those who helped.

-Brandon

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.