I'm having this problem with passing a variable.
Basically i'm trying to find every url in a HTML document. I'm doing this using the HTMLParser class (i made an extension of it).
This is the code:
class Parser(HTMLParser.HTMLParser):
def __init__(self, url):
HTMLParser.HTMLParser.__init__(self)
req = urllib2.urlopen(url)
self.feed(req.read())
def handle_starttag(self, tag,attrs):
if tag.lower() == 'a' and attrs:
for attribute in attrs:
if 'href' in attribute :
link = urlparse.urljoin(base,attribute[1])
print link
Everything works fine, but in the urlparse it goes wrong. I want to find out the base url, which is the url passed through in __init__ .
I can't hardcode it, since the url will change. I also can't pass it as an argument since than the superclass won't recognise that function anymore...
So i'm lost on how i can get the argument "url" from __init__ in my function: handle_starrtag.
(PS. i can't use Beautifulsoup or lxml or anything like that, and yes it's killing me)
EDIT:
I managed to make this work :
- make a variable 'base' in the class Parser
- in the __init__ you can type: Parser.base = url
- in handletag you can then say: Parser.base