Member Avatar for leegeorg07

hi again i have been assigned a project to create a web crawler in python but i have no idea where to start so all help will be welcome.

This is a good place to start. http://cis.poly.edu/cs912/parsing.txt

That is sample code that you can use to gather all of the links on a particular web page. Once you have the list of links on a page, you could repeat the process for each one of those links. Repeat the process until you have as the links you want.

Member Avatar for leegeorg07

thanks that works but i have a little problem with the code

import urllib, htmllib, formatter
a = []
running = 0
class LinksExtractor(htmllib.HTMLParser): # derive new HTML parser
    def __init__(self, formatter) :        # class constructor
      htmllib.HTMLParser.__init__(self, formatter)  # base class constructor
      self.links = []        # create an empty list for storing hyperlinks
    def start_a(self, attrs) :  # override handler of <A ...>...</A> tags
      # process the attributes
      if len(attrs) > 0 :
         for attr in attrs :
            if attr[0] == "href" :         # ignore all non HREF attributes
                self.links.append(attr[1]) # save the link info in the list
    def get_links(self):
        return self.links
format = formatter.NullFormatter()           # create default formatter
htmlparser = LinksExtractor(format)        # create new parser object

data = urllib.urlopen("http://uk.youtube.com/")
htmlparser.feed(data.read())      # parse the file saving the info about links
htmlparser.close()

links = htmlparser.get_links()   # get the hyperlinks list
print (links)   # print all the links

while running <=3:
    for item in links:
        a.append(item)
        for item in a:
            data = urllib.urlopen(item)
        htmlparser.feed(data.read())
        htmlparser.close()

        links = htmlparser.get_links()
        print(links)

it raises this error:

Traceback (most recent call last):
File "C:\Python26\web crawler start.py", line 30, in <module>
data = urllib.urlopen(item)
File "C:\Python26\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "C:\Python26\lib\urllib.py", line 203, in open
return getattr(self, name)(url)
File "C:\Python26\lib\urllib.py", line 461, in open_file
return self.open_local_file(url)
File "C:\Python26\lib\urllib.py", line 486, in open_local_file
return addinfourl(open(localname, 'rb'),
IOError: [Errno 2] No such file or directory: '\\'


in this trial i used youtube

why is there this error and how can i solve it?

if you add a line print(item) before your line data = urllib.urlopen(item) you might see why urlopen can't open the url.

Member Avatar for leegeorg07

hi i worked out that it does that whenever i have a website with a login link so how can i solve this problem and after that how can i implement it into a search engine?

Member Avatar for leegeorg07

i guess that im going to need knowledge in php and the information to be in a txt file but how can i go about doing this?

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.