I am coding a web spider for research purposes and have run into an error I am uncertain about. I am fairly new to web programming and need a bit of guidance. I use http.client to get a connection, request a site, get the response, and read the resonse into a variable. Then, using HTMLparser, I attempt to read() the variable, but am given this error:
Traceback (most recent call last):
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 45, in <module>
main()
File "C:\Users\snorris4\Desktop\FLOSSmoleSpiderSavannah\src\SavannahSpider.py", line 41, in main
spider.feed(page)
File "C:\Python31\lib\html\parser.py", line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
Any help would be very much appreciated. Thank you.
''
Created on May 26, 2009
@author: Steven Norris
This program runs as a spider for the the savannah.gnu.org to add information about
both the GNU projects and non-GNU projects to a database for further investigation.
'''
from html import parser
from http import client
import re
class SpiderSavannahProjectsList(parser.HTMLParser):
check_links=[]
def get_page(self, site, page):
conn=client.HTTPConnection(site)
conn.request("GET","http://"+site+page)
resp=conn.getresponse()
html_page=resp.read()
return html_page
def handle_starttag(self,tag,attrs):
if tag=='a':
link=attrs[0][1]
if re.search('\.\./projects/',link)!=None:
self.check_links.append(link)
def add_to_database(self,links):
for link in links:
page=self.get_page('savannah.gnu.org',link[3:len(link)])
#add page to database here.
def main():
spider=SpiderSavannahProjectsList()
page=spider.get_page('savannah.gnu.org','/search/?type_of_search=soft&words=%2A&type=1&offset=0&max_rows=400#results')
print (page)
spider.feed(page)
for i in spider.check_links:
print (i)
main()