help with web crawler

Question

leegeorg07

15 Years Ago

a while ago i asked for help with a web crawler and i got it, but now in the air cadets we are looking at the history of the raf and i wanted to know if there is any way i can edit the code below to search for sites with a certain phrase or string?

#these modules do most of the work
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO

def log_stdout(msg):
    """Print msg on the screen"""
    print msg

def get_page(url, log):
    """Retrieve URL and return contents, log errors."""
    try:
        page = urllib2.urlopen(url)
    except urllib2.URLError:
        log("Error retrieving: " + url)
        return ''
    body = page.read()
    page.close()
    return body

def find_links(html):
    """Return a list of links in html."""
    #we're using the parser just to get the HREFs
    writer = formatter.DumbWriter(StringIO())
    f = formatter.AbstractFormatter(writer)
    parser = htmllib.HTMLParser(f)
    parser.feed(html)
    parser.close()
    return parser.anchorlist

class Spider:
    
    """
    The heart of this program, finds all links within a websit.
    
    run() contains the main loop.
    process_page() retrieves each page and finds all the links.
    """
    def __init__(self,startURL, log=None):
        #this method sets initial values
        self.URLs = set()
        self.URLs.add(startURL)
        self.include = startURL
        self._links_to_process = [startURL]
        if log is None:
            #use log_stdout function if no log provided
            self.log = log_stdout
        else:
            self.log = log
    
    def run(self):
        #processes lists of URLs one at a time
        while self._links_to_process:
            url = self._links_to_process.pop()
            self.log("Retrieving: " +url)
            self.process_page(url)
        
    def url_in_site(self, link):
        #checks whether the link starts with the base URL
        return link.startswith(self.include)
    
    def process_page(self, url):
        #Retrieves page and finds links in it
        html = get_page(url, self.log)
        for link in find_links(html):
            #Handle relative links
            link = urlparse.urljoin(url, link)
            self.log("Checking: " +url)
            #make sure this is a new URL within current site
            if link not in self.URLs:
                self.URLs.add(link)
                self._links_to_process.append(link)

if __name__ == '__main__':
    #this code runs when script is started from command line
    startURL = 'http://www.raf.mod.uk/'
    spider = Spider(startURL)
    spider.run()
    for URL in sorted(spider.URLs):
        print URL

python

2 Contributors
4 Replies
104 Views
18 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by lllllIllIlllI

lllllIllIlllI 178 Veteran Poster

15 Years Ago

You could just do a quick count function to see if enough relevant words are in your page.
So for example:

body = "The Royal Air Force (RAF) is the United Kingdom's air force, the oldest independent air force in the world.[2] Formed on 1 April 1918,[3] the RAF has taken a significant role in British military history ever since, playing a large part in World War II and in more recent conflicts. The RAF operates almost 1,100 aircraft and, as of 31 March 2008, had a projected trained strength of 41,440 regular personnel.[4]The majority of the RAF's aircraft and personnel are based in the UK with many others serving on operations (principally Iraq, Afghanistan, Middle East, Balkans, and South Atlantic) or at long-established overseas bases (notably the Falkland Islands, Qatar, Germany, Cyprus, and Gibraltar)."

if body.count('RAF') and body.count("History") and\
  body.count("air force"):
    print "This source is good! :)"

Do you see what i mean? Then you can just make it more accurate and things like that!

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

leegeorg07 · Answer 1 · 2009-03-18T15:28:57+00:00

thanks ill give it a go when i get home!!!

leegeorg07 · Answer 2 · 2009-03-19T00:04:49+00:00

just wondering... how could i use it with the code i posted above?

lllllIllIlllI 178 Veteran Poster · Answer 3 · 2009-03-19T01:50:55+00:00

What is does, is it counts the number of occurrences of the words RAF, Air Force, and History. If they do count 1 or more, then the count() function will return the amount counted,
So if you had a bit of text with only RAF and History, then you wouldn't get a match because when something is not counted it returns -1.

So you could make it more accurate by having it ask for a specific number of things, and how many of them should there be. So you could have if body.count("RAF")>4..... That way you can make sure the whole page is about the RAF's History!

Oh and replace that large string above with what text you download from the website.