web (screen) scraping pages with Asian Characters

evilsage4 0 Newbie Poster

15 Years Ago

k, so I'm sort of a python newbie. And a unicode newbie too.
On the webpage: http://www.mangaupdates.com/series.html?id=1580

I want to be able to scrap the following line (html):

Associated Names</b></div>
<div class="sContent">ロザリオとバンパイア<br>吸血鬼女友<br>로자리오와 뱀파이어<br>Rosario + Vampire<br>Rosario+Vampire<br>

how ever, when I scrap it. all the asian character a returned like this: &# 12525 ; (I put spaces in there because this site auto convert the char to utf-8)

here is my test method:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

import urllib
import urllib2
import socket
import re
from cookielib import LWPCookieJar
class MangaUpdates(object):
    def __init__(self):

        # Globalized header string sent to servers
        self._header = {
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
            'Content-Type': 'application/x-www-form-urlencoded'}

        # setup cookie handler
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(LWPCookieJar()))
        urllib2.install_opener(opener)

        socket.setdefaulttimeout(40)

    def getAtlTitleNames(self,series_id):
        """ Login no required
        """
        base_url = 'http://www.mangaupdates.com/series.html'
        query_string = urllib.urlencode({
                'id': series_id
            })

        request = urllib2.Request(base_url, query_string, self._header)

        try:
            response = unicode(urllib2.urlopen(request).read(),"utf-8", 'replace')
        except urllib2.URLError, e:
            if hasattr(e, 'reason'):
                print 'Failed to reach mangaupdates.com'
                print 'Reason: ', e.reason
            elif hasattr(e, 'code'):
                print 'The server couldn\'t fulfill the request.'
                print 'Error code: ', e.code
            return False

        test = u'旅游'
        print test       # This works
        regx1 = re.compile(r'Associated Names.*\n.*')
        print unicode(regx1.search(response).group())

d = MangaUpdates()
print d.getAtlTitleNames('1580')

I could manually convert each char via unichr(int) but is there a way to return the whole thing a unicode?
you'll notice that urllib2.urlopen(request).read() returns a string which is the origin of this problem.

python

1 Contributor
0 Replies
64 Views

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.