I'm working with some really ugly files at the moment When I get them they can look like any of these:

All data on one line delimited by ┌
data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌

Nice data. All the bits I'm interrested already one one line per bit of information:
data1|data2|data3|
data1|data2|data3|

Mixed:
data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌
data1|data2|data3|
data1|data2|data3|

or even:
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"

So at the moment I have this:

import os

def process_data(data):
    print '%s' % data

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile
        with open(nfile, 'r') as infile:
            for line in infile:
                #discard blank lines
                if not line.strip():
                    continue
                else:
                    line = line.strip()
                    if '' in line:
                        lines = line.split('')
                        for sline in lines:

                            process_data(sline[:-1])
                    elif line.startswith('"') and line.endswith('"'):

                        process_data(line[1:-2])
                    else:

                        process_data(line[:-1])

This seems to work ok but I'm not convinced this is the best way to go about this. Does anyone have anyt suggestions on how I can tidy this up?

Also the delimiter character is not really the one I have but it is the closest I could find that would display here.

Like:

# -*- coding: utf-8 -*-
for d in (u'data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|',
          u'''data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌
data1|data2|data3|
data1|data2|data3|''',
          u'''"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"'''):
    if d.strip():
        print(''.join(c for c in d.replace(u'┌', '\n') if c.isalnum() or  c in ('|','\n')))

Thanks Tony. I'll have to figure out wxactly what that does later(lunch time first:)

Quick question though. Can I then do something like this:

import os

def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        with open(nfile, 'r') as infile:
            the_file = infile.read()
            process_data(the_file)

At the moment I'm getting:
d
a
t
a
1
d
a
t
a
2
d
a
t
.
.
.

So obviously I'm doing something wrong.
Cheers

def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

To

def process_data(data):
    return (''.join(c for c in d.replace(u'┌', '\n').replace('\n\n','\n') if c.isalnum() or  c in ('|','\n')))

print(process_data(the_line))

It is better to return value and print it in caller.

Ok thanks again Tony.

I ended up with this:

# -*- coding: utf-8 -*-

import os
def process_data(data):
    return ''.join(c for c in data.replace(u'', '\n') if c.isalnum() or  c in ('|','\n'))


directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                if line.strip():
                    print(process_data(line.strip()))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 10182: ordinal not in range(128)
Dernit!

Just to be clear I have no idea how to fix this. My guess is that Python is expecting the data to be ascii but it is something else right?

A known solution is

import codecs
file = codecs.open(path, encoding='iso8859-1')

see if it works for you.

I found the problem. The records are seperated byt '┌' then a space then (sometimes one sometimes two) NULL characters. I was trying to get rid of the null characters using
.replace(u'\0', '', line)

this is what brought up the error. At the moment I'm using this:

# -*- coding: utf-8 -*-
import os
import re

def process_data(data):
    return ''.join(c for c in data.replace('', '\n') if c.isalnum() or c in ('|','\n'))

out = open('outddd.txt', 'w')
directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                line = re.sub(r'\0', r'',line)
                if line.strip():
                    out.write(process_data(line.strip()))
out.close()

But I'm losing all my spaces and underscores. Any idea why this is?

\0 characters usually mean that your file is encoded. Try to open it with codecs.open and the appropriate encoding (it could be 'utf8' or ' iso8859-1' or another encoding). You could try this first

with open(filename, 'rb') as ifh:
    print repr(ifh.read(4))

this may give you the BOM from which we could perhaps guess the encoding.

All I get when I run that is this:
'UDC_'
These characters appear at the begining of the 'stream', if I understand this correctly. Does this help me?

I'm reading up on encodings at the moment...

sigh, again

So, as I was saying... If I do this:

text_file = open('example.txt')
text_file.readline()

my output looks like this:

UDC_*data|data|data|\x01 \x00\x00UDC_data|data|data|\x01 \x00\x00UDC_data|data|data|\x01 \x00\x00\x00"

If I look at the table found here: http://en.wikipedia.org/wiki/Byte_order_mark , it seems like this is not proper BOMs??

Is there a way I can do something like this:

        with open(nfile, 'r') as infile:
            for line in infile:
                match = re.search(u'(\x..)',line)
              #  line = re.sub(r'\\x..', r'',line)
              #  line = re.sub(r'\x01', r'',line)
                #if line.strip():
                #    out.write(process_data(line))
                if match != None:
                    print match.group(1)

This works fine:
line = re.sub(r'\x01', r'',line)

But this:
match = re.search(u'(\x..)',line)

gives me an error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-3: truncated \xXX escape

I want to look through all the files I have and see how many of these characters exists and what they are

It doesn't look like a BOM. Did you try to open the file with the codecs module to see if it solves your accented letters issue ?

If I do
infile = codecs.open(nfile, encoding='iso8859-1')
I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8421: ordinal not in range(128)

If I try utf8 I get:
UnicodeError: UTF-16 stream does not start with BOM

Can't you upload the file somewhere so that we can try to read it ?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.