Hi,
I have been searching high and low on google, and I cannot seem to figure out how to convert unicode to integers. Take the unicode codepoint, u'3001', for example. I know in utf-8, this is suppose to be ideographic comma. The hexadecimal representation is 0xE38081. I know if I convert 0xE38081 to an integer, it is suppose to be 14909569. 14909569 is the answer I want, but I cannot seem to figure out how to do this in python.

>>> unichr(0x3001)
u'\u3001'
>>> str(unichr(0x3001))
'\xe3\x80\x81'
>>> int('\xe3\x80\x81',16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 16: '\xe3\x80\x81'
>>> int('0xe38081',16)
14909569
>>>

How come int() won't take the syntax \xE3\x80\x81? How can I strip or replace \x? the string functions? strip() and replace() do not work either. Is there another method that can deal with a unicode codepoint?

Here is my going around and finally I got there, but not maybe most elegant way as Python did not allow me to take ord from the individual bytes making utf8 letter. So I ended up manipulating the repr of that letter by string manipulation.

a=unichr(0x3001)
b=a.encode('utf8')
print b
c=repr(b)
print c
d= r"0x"+c.translate(None,r"\x'")
print d,int(d,16)

thanx for the lp tonyjv!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.