Hi,
I have this code:

list =

def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
... 		print type(e)
... 		e = unicode(e,'iso-8859-1')
... 		print type(e)
... 		print e

I get this output:

12 angry men
Rash\xf4mon

whereas I wanted the output to be:

12 angry men
Rasho'mon (o' is like spanish o with some accent like character above it)

Can someone help me?

PS: I tried this on python interactive shell (of PythonWin). python version 2.6

What is your question exactly?

Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Rash\xf4mon'
Rashômon
>>>

I think your problem is that in the call to re.sub you're replacing the match with a raw string. Try removing the 'r' prefacing the replacement string.

What is your question exactly?

Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Rash\xf4mon'
Rashômon
>>>

My question was how to fix the behavior. Here is the copy of my pythonwin interactive shell in verbatim:

>>> list = ['12 angry men', 'Rash &# xf4;mon']
>>> def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
>>> def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
... 		print type(e)
... 		e = unicode(e,'iso-8859-1')
... 		print type(e)
... 		print e
... 		
... 		
>>> func(list)
<type 'str'>
<type 'unicode'>
12 angry men
<type 'str'>
<type 'unicode'>
Rash\xf4mon
>>>

This is my problem the behavior that is not supposed to happen.

I ran this from the IDLE editor:

s = u'Rash\u00f4mon'
print s  # --> Rashômon 

a = u'\u00bfC\u00f3mo es usted?'
print a  # --> ¿Cómo es usted?

And the following works fine for me, so it may have something to do with the regex as stated above. Try running the code with the "re.sub" line commented and see if that makes a difference.

s = u'La Pe\xf1a'         
print s.encode('latin-1') 

x = u"Rash\xf4mon"
print x.encode('iso-8859-1') 

##-----prints-----
La Peña                                                                 
Rashômon

I used IDLE and Python25, for some odd reason IDLE takes unicode best. Looks like regex works fine, but the resulting string escapes the escape ( '\' turns into '\\' ) ...

# -*- coding: cp1252 -*-

# fiddling with unicode

import re

def func(mylist):
    for e in mylist:
        e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
        e = unicode(e,'latin-1')
        print e        # --> Rash\xf4mon
        # this is the reason for the problem ...
        print repr(e)  #  --> u'Rash\\xf4mon'

mylist = ['12 angry men', 'Rash &# xf4;mon']
func(mylist)

# however ...
print 'Rash\xf4mon'  # --> Rashômon

The problem is that you're creating a raw string, which isn't doing what you think it will. Instead of allowing you to create a hex value attached to the '\x' escape, it's instead creating a string that contains the characters exactly as '\' + 'x' + 'f' + '4'.

Here's a version that will do what you want (I hope :) ):

import re 

def convert_to_unicode(match_obj):
    raw_chr_val = match_obj.group(1)
    return unichr(int(raw_chr_val, 16))

def func(mylist):
    for e in mylist:
        e = re.sub(' ?&# x([0-9a-f]*);', convert_to_unicode, e) 
        print e

mylist = ['12 angry men', 'Rash &# xf4;mon']
func(mylist)
commented: nice solution +12
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.