character encoding in python

Question

MacUsers 0 Newbie Poster

14 Years Ago

Greetings everyone!!
First of all, I'm a python newbie; my apology in advance if I'm being silly with my question but I really hope some one can help me on this.
I got myself a project (just to start) to read iTunes library .xml file - iTunes Music Library.xml - and create the file system/directory structure reading the id3-meta tag information in this format: /genre/artist/album". The last bit - i.e. "album" - is the symbolic link to the album directory at the original iTunes location. Here is my problem to convert the html/xml character references (like & or %20 etc.) to plain text and I'm getting the error when the string contains stuff like ""%CC%81" ( i.e. é ) or %CC%88 ( i.e. ï ). Rest of the things are okay so far. The error is like this:

--------------------------------------------------------------------------------------
Traceback (most recent call last):
 File "./metadata.py", line 153, in <module>
   artist_dir = "%s/%s" % (media_dir, mn)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
----------------------------------------------------------------------------------------

After a bit more digging, I think, I know why I get this error but I'm not that expert to figure out a solution. Python, by default, uses the ASCII, hence characters with an ASCII value > 127 (as ASCII only defined numeric values from 0 to 127) in the input data, generates UnicodeDecodeError as, that character can't be handled by the ASCII encoding. This error can be easily reproduced like this:

>>> unicode('abcdef' + chr(127))
u'abcdef\x7f'
>>> unicode('abcdef' + chr(128))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 6: ordinal not in range(128)
>>>

In fact, anything, which is "accented" throws this error in:

>>> unicode('abc©')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)
>>> unicode(u'abc©')
u'abc\xa9'
>>>

In my case, the error was due to the string: "Beyoncé", which can't be handled by ASCII encoding. I was really confused in the first place and was banging my head against the screen to explain this:

[santanu@baba test]$ ls -l
total 8
drwxrwxr-x 2 santanu santanu 4096 2010-02-03 11:20 Beyoncé
drwxrwxr-x 2 santanu santanu 4096 2010-02-03 11:18 Beyoncé

i.e. two directories at the same place co-exist with exactly the same name and then I found the difference:

>>> import os
>>> os.listdir('/home/santanu/Scripts/test')
['Beyonc\xc3\xa9', 'Beyonce\xcc\x81']
>>> unicode('Beyoncé')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> unicode(u'Beyoncé')
u'Beyonc\xe9'

the two directories with actually two different values. The first one, at 11:20 was created by the python script and the latter one, at 11:18 was created by hand using "mkdir". Now if I use the python script to create a symbolic link to the the manually-created directory [using os.system("ln -sf ... ...")], it end up with a invalid link. You need to create the directory by hand: mkdir -p "/tmp/testSource/Beyoncé/B'Day [Deluxe Edition] before running the script below to see what I mean. This is obviously a snippet of the original script but should be enough to demonstrate the problem.

#!/usr/bin/env python
# -*- coding: ISO-8859-15 -*-

import os, re, sys, string;
import urllib, htmlentitydefs;

xmlLoc = "<string>file://localhost/Volumes/DataCenter/nMedia/mMusic/iTunes/iTunes%20Music/Beyonce%CC%81/B'Day%20%5BDeluxe%20Edition%5D/Amor%20Gitano.m4a</string>"

def xmlRef2txt(text):
   def fixup(m):
       text = m.group(0)
       if text[:2] == "&#":
           # character reference
           try:
               if text[:3] == "&#x":
                   return unichr(int(text[3:-1], 16))
               else:
                   return unichr(int(text[2:-1]))
           except ValueError:
               pass
       else:
           # named entity
           try:
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
           except KeyError:
               pass
       return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

def toTxt1(xmlString):
   String = xmlString.split("/")
   iX = xmlRef2txt(urllib.unquote(String[-4]))
   iY = xmlRef2txt(urllib.unquote(String[-3]))
   return (iX, iY)

(mn,op) = toTxt1(xmlLoc)

#print "%s\t%s" % (op,mn)
media_dir = "/tmp/testTarget"
artist_dir = "%s/%s" % (media_dir, mn)
album_dir = artist_dir + "/%s" % op
print "mkdir -p " + "\"" + album_dir + "\""
os.system("mkdir -p " + "\"" + artist_dir + "\"")
os.system("ln -sf \"/tmp/testSource/%s/%s\" " % (mn,op) + "\"" + album_dir + "\"")

Dose anyone have a clue what I'm doing wrong or missing?
Sorry for my long post but I was just trying to provide as much info I can to make to easy for everyone (and myself). Thanks in advance for your help. cheers!!!

character processing python xml

3 Contributors
5 Replies
686 Views
3 Days Discussion Span
Latest Post 14 Years Ago Latest Post by Pupo

woooee 814 Nearly a Posting Maven

14 Years Ago

You have to declare the encoding somewhere. The following works for me on python 3.x. Note that printing to screen is a different matter as some characters will not print depending on the encoding and font for the system.

#!/usr/bin/python3
# -*- coding: latin-1 -*-

y = 'abcdef' + chr(128)
print(y)

Edited 14 Years Ago by woooee because: n/a

woooee 814 Nearly a Posting Maven

14 Years Ago

My guess would be that the system has a different encoding, so when you drop to system level to execute this
os.system("ln -s <src> <dst>")
you get an error. os.symlink stays within the Python program so understands the encoding.

Edited 14 Years Ago by woooee because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

MacUsers 0 Newbie Poster · Answer 1 · 2010-02-05T22:07:40+00:00

You have to declare the encoding somewhere.

Isn't that urllib.unquote() doing the encoding? As I said, it's works just file for anything else but string like %CC%81

The following works for me on python 3.x. Note that printing to screen is a different matter as some characters will not print depending on the encoding and font for the system.
#!/usr/bin/python3
# -*- coding: latin-1 -*-

y = 'abcdef' + chr(128)
print(y)

Doesn't really work on 2.5 here. If I change the string like this: y = 'oncé' + chr(128) it prints oncé? My problem is not with printing but creating directories and symbolic links. If I create the directory (with special characters in the name) and create the link using the script it works but it doesn't work if the source was created manually. Using the above script, I can create the link:

[santanu@baba python]$ ll /tmp/testTarget/Beyoncé/
total 0
lrwxrwxrwx 1 santanu santanu 48 2010-02-05 15:42 B'Day [Deluxe Edition] -> /tmp/testSource/Beyoncé/B'Day [Deluxe Edition]

but not valid:

>>> import os
>>> os.access("/tmp/testSource/Beyoncé/B'Day [Deluxe Edition]", os.F_OK)
True
>>> os.access(os.readlink("/tmp/testTarget/Beyoncé/B'Day [Deluxe Edition]"), os.F_OK)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: "/tmp/testTarget/Beyonc\xc3\xa9/B'Day [Deluxe Edition]"

Cheers!!!

MacUsers 0 Newbie Poster · Answer 2 · 2010-02-08T07:42:31+00:00

It turned out that using os.symlink() to create the link, works just fine for everything. But I'm still interested to know why os.system("ln -s <src> <dst>") doesn't work to create a valid link. Any one else got any clue? Cheers!!!

Pupo 0 Newbie Poster · Answer 3 · 2010-02-08T20:15:44+00:00

Hi.
Why not to use UTF8?

# -*- coding: utf-8 -*-

import os

targ = u"/tmp/testTarget/Beyoncé/B'Day [Deluxe Edition]"
os.access(targ.encode('utf8'), os.F_OK)
os.access(os.readlink(targ.encode('utf8')), os.F_OK)

felix@theway:~$ python prueba.py
Traceback (most recent call last):
File "prueba.py", line 7, in <module>
os.access(os.readlink(targ.encode('utf8')), os.F_OK)
OSError: [Errno 2] No such file or directory: "/tmp/testTarget/Beyonc\xc3\xa9/B'Day [Deluxe Edition]"

The symlink does not exists, but it does not complain about encoding.
Hope it work for you.