Hello
I'm building my own html parser in python, and have ran into some problems.
First off, I'm using python 3, so I can't use the old bundled sgmlparser, or beautiful soup and could not find windows binaries for lxml, so I'm rolling my own. It is for my master thesis, so it's not that wasted anyway. The parser will be used to parse pages I find with my crawler for statistical analysis.
What I use: regex. I found this beautiful regex (?i)<(\/?\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>
that works like a charm. I get every tag in the page and I track the start and end positions of the tag.
The problem: I'm really not interested in whatever goes on between <script></script> tags. Since script tags cannot contain html, I thought it was just a matter of matching the start and end tag and remove whatever in between. But it was not that easy. The biggest problem I face is javascript-code that outputs javascript code itself!
An example:
document.write('<SCRIPT LANGUAGE=VBScript\> \n');
document.write('on error resume next \n');
document.write('ShockMode = (IsObject(CreateObject("ShockwaveFlash.ShockwaveFlash.6")))\n');
document.write('<\/SCRIPT\> \n');
My regex matches the <script> tag in document.write
, but I really don't want that. Especially since it doesn't match the <\/script> tag, and that really messes up my parsing.
Anyone got any good ideas to what I can do to solve my problem?
And if someone spots any other problems I might run into with this method of parsing, I would love to be made aware of them :)
Best regards
Vidaj