istib 0 Newbie Poster

Dear all,

I've been struggling with an issue whilst scraping a site that has both roman and arabic (and asian characters too, it seems). I want to extract those entries that are English, hence that don't contain foreign letters. I've been reading into some Unicode guides but can't get my Python script to work it out. I confess that I've just started with Python.

i've tried several styles of encoding, but in essence the (non-working) code is meant to do something like this:

from BeautifulSoup import BeautifulSoup

	for i in range(1,len(soup)):
		title = unicode(soup[i].b.string)
		test = re.search("\?|!|&|#|ä|ö|ü", title)
		if test:
			continue
                # treat entry [i] in soup.b ...

Thanks for any help!