k, so I'm sort of a python newbie. And a unicode newbie too.
On the webpage: http://www.mangaupdates.com/series.html?id=1580
I want to be able to scrap the following line (html):
Associated Names</b></div>
<div class="sContent">ロザリオとバンパイア<br>吸血鬼女友<br>로자리오와 뱀파이어<br>Rosario + Vampire<br>Rosario+Vampire<br>
how ever, when I scrap it. all the asian character a returned like this: &# 12525 ;
(I put spaces in there because this site auto convert the char to utf-8)
here is my test method:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import urllib
import urllib2
import socket
import re
from cookielib import LWPCookieJar
class MangaUpdates(object):
def __init__(self):
# Globalized header string sent to servers
self._header = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
'Content-Type': 'application/x-www-form-urlencoded'}
# setup cookie handler
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(LWPCookieJar()))
urllib2.install_opener(opener)
socket.setdefaulttimeout(40)
def getAtlTitleNames(self,series_id):
""" Login no required
"""
base_url = 'http://www.mangaupdates.com/series.html'
query_string = urllib.urlencode({
'id': series_id
})
request = urllib2.Request(base_url, query_string, self._header)
try:
response = unicode(urllib2.urlopen(request).read(),"utf-8", 'replace')
except urllib2.URLError, e:
if hasattr(e, 'reason'):
print 'Failed to reach mangaupdates.com'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
return False
test = u'旅游'
print test # This works
regx1 = re.compile(r'Associated Names.*\n.*')
print unicode(regx1.search(response).group())
d = MangaUpdates()
print d.getAtlTitleNames('1580')
I could manually convert each char via unichr(int) but is there a way to return the whole thing a unicode?
you'll notice that urllib2.urlopen(request).read()
returns a string which is the origin of this problem.