Greetings everyone!!
First of all, I'm a python newbie; my apology in advance if I'm being silly with my question but I really hope some one can help me on this.
I got myself a project (just to start) to read iTunes library .xml file - iTunes Music Library.xml - and create the file system/directory structure reading the id3-meta tag information in this format: /genre/artist/album". The last bit - i.e. "album" - is the symbolic link to the album directory at the original iTunes location. Here is my problem to convert the html/xml character references (like & or %20 etc.) to plain text and I'm getting the error when the string contains stuff like ""%CC%81" ( i.e. é ) or %CC%88 ( i.e. ï ). Rest of the things are okay so far. The error is like this:
--------------------------------------------------------------------------------------
Traceback (most recent call last):
File "./metadata.py", line 153, in <module>
artist_dir = "%s/%s" % (media_dir, mn)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
----------------------------------------------------------------------------------------
After a bit more digging, I think, I know why I get this error but I'm not that expert to figure out a solution. Python, by default, uses the ASCII, hence characters with an ASCII value > 127 (as ASCII only defined numeric values from 0 to 127) in the input data, generates UnicodeDecodeError as, that character can't be handled by the ASCII encoding. This error can be easily reproduced like this:
>>> unicode('abcdef' + chr(127))
u'abcdef\x7f'
>>> unicode('abcdef' + chr(128))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 6: ordinal not in range(128)
>>>
In fact, anything, which is "accented" throws this error in:
>>> unicode('abc©')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3: ordinal not in range(128)
>>> unicode(u'abc©')
u'abc\xa9'
>>>
In my case, the error was due to the string: "Beyoncé", which can't be handled by ASCII encoding. I was really confused in the first place and was banging my head against the screen to explain this:
[santanu@baba test]$ ls -l
total 8
drwxrwxr-x 2 santanu santanu 4096 2010-02-03 11:20 Beyoncé
drwxrwxr-x 2 santanu santanu 4096 2010-02-03 11:18 Beyoncé
i.e. two directories at the same place co-exist with exactly the same name and then I found the difference:
>>> import os
>>> os.listdir('/home/santanu/Scripts/test')
['Beyonc\xc3\xa9', 'Beyonce\xcc\x81']
>>> unicode('Beyoncé')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
>>> unicode(u'Beyoncé')
u'Beyonc\xe9'
the two directories with actually two different values. The first one, at 11:20 was created by the python script and the latter one, at 11:18 was created by hand using "mkdir". Now if I use the python script to create a symbolic link to the the manually-created directory [using os.system("ln -sf ... ...")], it end up with a invalid link. You need to create the directory by hand: mkdir -p "/tmp/testSource/Beyoncé/B'Day [Deluxe Edition]
before running the script below to see what I mean. This is obviously a snippet of the original script but should be enough to demonstrate the problem.
#!/usr/bin/env python
# -*- coding: ISO-8859-15 -*-
import os, re, sys, string;
import urllib, htmlentitydefs;
xmlLoc = "<string>file://localhost/Volumes/DataCenter/nMedia/mMusic/iTunes/iTunes%20Music/Beyonce%CC%81/B'Day%20%5BDeluxe%20Edition%5D/Amor%20Gitano.m4a</string>"
def xmlRef2txt(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
def toTxt1(xmlString):
String = xmlString.split("/")
iX = xmlRef2txt(urllib.unquote(String[-4]))
iY = xmlRef2txt(urllib.unquote(String[-3]))
return (iX, iY)
(mn,op) = toTxt1(xmlLoc)
#print "%s\t%s" % (op,mn)
media_dir = "/tmp/testTarget"
artist_dir = "%s/%s" % (media_dir, mn)
album_dir = artist_dir + "/%s" % op
print "mkdir -p " + "\"" + album_dir + "\""
os.system("mkdir -p " + "\"" + artist_dir + "\"")
os.system("ln -sf \"/tmp/testSource/%s/%s\" " % (mn,op) + "\"" + album_dir + "\"")
Dose anyone have a clue what I'm doing wrong or missing?
Sorry for my long post but I was just trying to provide as much info I can to make to easy for everyone (and myself). Thanks in advance for your help. cheers!!!