unicode riddle

Question

johndoe444 1 Posting Whiz in Training

15 Years Ago

Hi,
I have this code:

list =

def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
... 		print type(e)
... 		e = unicode(e,'iso-8859-1')
... 		print type(e)
... 		print e

I get this output:

12 angry men
Rash\xf4mon

whereas I wanted the output to be:

12 angry men
Rasho'mon (o' is like spanish o with some accent like character above it)

Can someone help me?

PS: I tried this on python interactive shell (of PythonWin). python version 2.6

python

6 Contributors
7 Replies
168 Views
2 Days Discussion Span
Latest Post 15 Years Ago Latest Post by The_Kernel

All 7 Replies

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

I used IDLE and Python25, for some odd reason IDLE takes unicode best. Looks like regex works fine, but the resulting string escapes the escape ( '\' turns into '\\' ) ...

# -*- coding: cp1252 -*-

# fiddling with unicode

import re

def func(mylist):
    for e in mylist:
        e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
        e = unicode(e,'latin-1')
        print e        # --> Rash\xf4mon
        # this is the reason for the problem ...
        print repr(e)  #  --> u'Rash\\xf4mon'

mylist = ['12 angry men', 'Rash &# xf4;mon']
func(mylist)

# however ...
print 'Rash\xf4mon'  # --> Rashômon

The_Kernel 33 Light Poster

15 Years Ago

The problem is that you're creating a raw string, which isn't doing what you think it will. Instead of allowing you to create a hex value attached to the '\x' escape, it's instead creating a string that contains the characters exactly as '\' + 'x' + 'f' + '4'.

Here's a version that will do what you want (I hope :) ):

import re 

def convert_to_unicode(match_obj):
    raw_chr_val = match_obj.group(1)
    return unichr(int(raw_chr_val, 16))

def func(mylist):
    for e in mylist:
        e = re.sub(' ?&# x([0-9a-f]*);', convert_to_unicode, e) 
        print e

mylist = ['12 angry men', 'Rash &# xf4;mon']
func(mylist)

vegaseat commented: nice solution +12

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jlm699 320 Veteran Poster · Answer 1 · 2009-07-01T07:56:45+00:00

What is your question exactly?

Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Rash\xf4mon'
Rashômon
>>>

The_Kernel 33 Light Poster · Answer 2 · 2009-07-01T09:40:04+00:00

I think your problem is that in the call to re.sub you're replacing the match with a raw string. Try removing the 'r' prefacing the replacement string.

johndoe444 1 Posting Whiz in Training · Answer 3 · 2009-07-01T09:49:22+00:00

What is your question exactly?

Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print 'Rash\xf4mon'
Rashômon
>>>

My question was how to fix the behavior. Here is the copy of my pythonwin interactive shell in verbatim:

>>> list = ['12 angry men', 'Rash &# xf4;mon']
>>> def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
>>> def func(list):
... 	for e in list:
... 		e = re.sub(' ?&# x([0-9a-f]*);',r'\x\1',e)
... 		print type(e)
... 		e = unicode(e,'iso-8859-1')
... 		print type(e)
... 		print e
... 		
... 		
>>> func(list)
<type 'str'>
<type 'unicode'>
12 angry men
<type 'str'>
<type 'unicode'>
Rash\xf4mon
>>>

This is my problem the behavior that is not supposed to happen.

Lardmeister 461 Posting Virtuoso · Answer 4 · 2009-07-01T22:13:51+00:00

I ran this from the IDLE editor:

s = u'Rash\u00f4mon'
print s  # --> Rashômon 

a = u'\u00bfC\u00f3mo es usted?'
print a  # --> ¿Cómo es usted?

woooee 814 Nearly a Posting Maven · Answer 5 · 2009-07-01T22:47:39+00:00

And the following works fine for me, so it may have something to do with the regex as stated above. Try running the code with the "re.sub" line commented and see if that makes a difference.

s = u'La Pe\xf1a'         
print s.encode('latin-1') 

x = u"Rash\xf4mon"
print x.encode('iso-8859-1') 

##-----prints-----
La Peña                                                                 
Rashômon

unicode riddle

Recommended Answers Collapse Answers

All 7 Replies

Recommended Answers