Files, splitting lines etc...

Question

4evrmrepylrning 0 Newbie Poster

12 Years Ago

I'm working with some really ugly files at the moment When I get them they can look like any of these:

So at the moment I have this:

import os

def process_data(data):
    print '%s' % data

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile
        with open(nfile, 'r') as infile:
            for line in infile:
                #discard blank lines
                if not line.strip():
                    continue
                else:
                    line = line.strip()
                    if '' in line:
                        lines = line.split('')
                        for sline in lines:

                            process_data(sline[:-1])
                    elif line.startswith('"') and line.endswith('"'):

                        process_data(line[1:-2])
                    else:

                        process_data(line[:-1])

This seems to work ok but I'm not convinced this is the best way to go about this. Does anyone have anyt suggestions on how I can tidy this up?

Also the delimiter character is not really the one I have but it is the closest I could find that would display here.

python

3 Contributors
16 Replies
209 Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by Gribouillis

TrustyTony 888 ex-Moderator

12 Years Ago

Like:

# -*- coding: utf-8 -*-
for d in (u'data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|',
          u'''data1|data2|data3|┌data1|data2|data3|┌data1|data2|data3|┌
data1|data2|data3|
data1|data2|data3|''',
          u'''"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"
"data1|data2|data3|"'''):
    if d.strip():
        print(''.join(c for c in d.replace(u'┌', '\n') if c.isalnum() or  c in ('|','\n')))

Edited 12 Years Ago by TrustyTony

TrustyTony 888 ex-Moderator

12 Years Ago

def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

To

def process_data(data):
    return (''.join(c for c in d.replace(u'┌', '\n').replace('\n\n','\n') if c.isalnum() or  c in ('|','\n')))

print(process_data(the_line))

It is better to return value and print it in caller.

Edited 12 Years Ago by TrustyTony

Gribouillis 1,391 Programming Explorer

12 Years Ago

A known solution is

import codecs
file = codecs.open(path, encoding='iso8859-1')

see if it works for you.

Edited 12 Years Ago by Gribouillis

Gribouillis 1,391 Programming Explorer

12 Years Ago

\0 characters usually mean that your file is encoded. Try to open it with codecs.open and the appropriate encoding (it could be 'utf8' or ' iso8859-1' or another encoding). You could try this first

with open(filename, 'rb') as ifh:
    print repr(ifh.read(4))

this may give you the BOM from which we could perhaps guess the encoding.

Edited 12 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

4evrmrepylrning 0 Newbie Poster · Answer 1 · 2012-05-21T11:25:07+00:00

Thanks Tony. I'll have to figure out wxactly what that does later(lunch time first:)

Quick question though. Can I then do something like this:

import os

def process_data(data):
    for d in (data):
        print d
    #    if d.strip():
    #        print(''.join(c for c in d if c.isalnum() or  c in ('|','\n')))

directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        with open(nfile, 'r') as infile:
            the_file = infile.read()
            process_data(the_file)

At the moment I'm getting:
d
a
t
a
1
d
a
t
a
2
d
a
t
.
.
.

So obviously I'm doing something wrong.
Cheers

4evrmrepylrning 0 Newbie Poster · Answer 2 · 2012-05-21T13:39:57+00:00

Ok thanks again Tony.

I ended up with this:

# -*- coding: utf-8 -*-

import os
def process_data(data):
    return ''.join(c for c in data.replace(u'', '\n') if c.isalnum() or  c in ('|','\n'))


directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                if line.strip():
                    print(process_data(line.strip()))

4evrmrepylrning 0 Newbie Poster · Answer 3 · 2012-05-21T13:40:12+00:00

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 10182: ordinal not in range(128)
Dernit!

4evrmrepylrning 0 Newbie Poster · Answer 4 · 2012-05-21T14:46:45+00:00

Just to be clear I have no idea how to fix this. My guess is that Python is expecting the data to be ascii but it is something else right?

4evrmrepylrning 0 Newbie Poster · Answer 5 · 2012-05-21T16:02:10+00:00

I found the problem. The records are seperated byt '┌' then a space then (sometimes one sometimes two) NULL characters. I was trying to get rid of the null characters using
.replace(u'\0', '', line)

this is what brought up the error. At the moment I'm using this:

# -*- coding: utf-8 -*-
import os
import re

def process_data(data):
    return ''.join(c for c in data.replace('', '\n') if c.isalnum() or c in ('|','\n'))

out = open('outddd.txt', 'w')
directory = '.'
absdir = os.path.abspath(directory)
for files in os.listdir(absdir):
    if files.startswith(('Upd', 'UPD')):
        nfile = os.path.join(absdir,files)
        print nfile


        with open(nfile, 'r') as infile:
            for line in infile:
                line = re.sub(r'\0', r'',line)
                if line.strip():
                    out.write(process_data(line.strip()))
out.close()

But I'm losing all my spaces and underscores. Any idea why this is?

4evrmrepylrning 0 Newbie Poster · Answer 6 · 2012-05-22T09:32:47+00:00

All I get when I run that is this:
'UDC_'
These characters appear at the begining of the 'stream', if I understand this correctly. Does this help me?

I'm reading up on encodings at the moment...

4evrmrepylrning 0 Newbie Poster · Answer 7 · 2012-05-22T09:59:02+00:00

4evrmrepylrning 0 Newbie Poster

12 Years Ago

sigh
Ok if I do this:

4evrmrepylrning 0 Newbie Poster · Answer 8 · 2012-05-22T10:24:25+00:00

sigh, again

So, as I was saying... If I do this:

text_file = open('example.txt')
text_file.readline()

my output looks like this:

If I look at the table found here: http://en.wikipedia.org/wiki/Byte_order_mark , it seems like this is not proper BOMs??

4evrmrepylrning 0 Newbie Poster · Answer 9 · 2012-05-22T11:13:03+00:00

Is there a way I can do something like this:

        with open(nfile, 'r') as infile:
            for line in infile:
                match = re.search(u'(\x..)',line)
              #  line = re.sub(r'\\x..', r'',line)
              #  line = re.sub(r'\x01', r'',line)
                #if line.strip():
                #    out.write(process_data(line))
                if match != None:
                    print match.group(1)

This works fine:
line = re.sub(r'\x01', r'',line)

But this:
match = re.search(u'(\x..)',line)

gives me an error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 1-3: truncated \xXX escape

I want to look through all the files I have and see how many of these characters exists and what they are

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 10 · 2012-05-22T11:31:03+00:00

It doesn't look like a BOM. Did you try to open the file with the codecs module to see if it solves your accented letters issue ?

4evrmrepylrning 0 Newbie Poster · Answer 11 · 2012-05-22T13:13:00+00:00

If I do
infile = codecs.open(nfile, encoding='iso8859-1')
I get:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8421: ordinal not in range(128)

If I try utf8 I get:
UnicodeError: UTF-16 stream does not start with BOM

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 12 · 2012-05-22T15:14:26+00:00

Can't you upload the file somewhere so that we can try to read it ?