Hello,

I am building a searchengine with php. Nearly finished.
Now, I need to build the web crawler.
Since most websites have xml sitemap for web crawlers to use to find their site links, I prefer to build an xml site map crawler than a general http crawler.
I am not having much luck finding a php tutorial on it. Using these keywords on Google:

"sitemap crawler"+"php tutorial" OR "php", -"sitemap generator"

If you know of any tutorial or free code then let me know.
The simpler the code, the better.
I will keep you updated on this thread to what I have found so far. If you find any better than mine then drop me a line here.

Thanks

Disclaimer: I am not a sitemap/seo expert (or an expert in anything else, for that matter).

Correct me if I'm wrong, but I think that in order to find the sitemap, you would have to crawl the website anyway. You would start by looking for the sitemap. If you find the sitemap, download and parse it. Otherwise, continue crawling the website to build your own sitemap.

Finally, I searched Google using the following keywords: site crawler tutorial php and the first hit was: https://www.freecodecamp.org/news/web-scraping-with-php-crawl-web-pages/

Hope this helps!

@gce517

Thanks. Checking your link out now.
No need to write php code for the crawler to sniff out the SiteMap url as the site owner's will submit on the "Link Submit" form the url of their sitemaps. All the crawler needs to do is load the Sitemap url and extract all the links and then visit the links and extract the:

meta keywords
meta descriptions
headers
links (urls & anchors)

That's all for now.
No need for the crawler to have php code to deal with robots.txt file.
Anything else important I am forgetting here and missing out ?

@dani

I am curious, have you ever built a web crawler before ?

Programming Buddies,

If you know of any small site's xml sitemap url then let me know so I can test each sitemap crawler code I come across on that small site.
Do not want to be testing on large sites.

Thanks

Ok, I get it now. You just want a sitemap parser, right? Maybe this will help? https://github.com/VIPnytt/SitemapParser There may be others, of course; that's just the first hit in my search.

I haven't tested it, but I think that fits into what you need it to do: load the sitemap and extract the links and metadata.

If you do put it to the test, I would go the medium-to-large site route. Why? In my experience testing with a reduced data sample, once I got my code working, a larger data sample invariably broke it. Even with a large data sample, you can test incrementally to make sure you cover all the bases. Just my two cents.

Edit: Because sitemaps are created following a sitemap protocol, I think that understanding the protocol is the best first step to creating (or implementing) your sitemap parser. If anything, it might even help you write the very tutorial you were seeking ;-)

It has been a looooong time since anyone asked me that. 'gce' are my initials. The number 517 is part of my ICQ number (127517).

I'm glad you found what you needed!

Cheers!

No need to write php code for the crawler to sniff out the SiteMap url as the site owner's will submit on the "Link Submit" form the url of their sitemaps.

Many sites, DaniWeb included, also include a link to the sitemap file in our robots.txt.

That being said, the code at https://www.daniweb.com/programming/web-development/threads/538869/can-this-oop-be-convert-to-procedural-style that you posted should be exactly what you're looking for, but unfortunately there is no way of converting it to procedural style. It was designed to be object oriented for a reason ... Objects give you functionality that is exponentially more difficult to accomplish with procedural code. And, in this case, because we're using built-in PHP classes and objects, it's not possible to rewrite them without thousands and thousands of lines of code.

If you know of any small site's xml sitemap url then let me know so I can test each sitemap crawler code I come across on that small site.
Do not want to be testing on large sites.

We use sitemap index files to break our sitemap up. You can try parsing https://www.daniweb.com/home-sitemap.xml which is a pretty small sitemap file of ours.

I am curious, have you ever built a web crawler before ?

I have not, but I have 20 years of PHP experience, and built this site. While not a crawler, per se, we do have a cURL-based link validator in the backend.

Struggling to understand the code on this one

There is a class called BJ_Crawler that defines objects of the type BJ_Crawler. Objects of this type have two properties: $_sitemaps and $_urls.

The constructor function is executed when you create a new object of the type BJ_Crawler. That means I can use this class by doing something such as:

$mySitemaps = new Array('sitemap1.xml', 'sitemap2.xml');
$myCrawler = new BJ_Crawler($mySitemaps);

and when I run that code, it calls the __construct() function first. In that constructor function, for each sitemap that is passed in (e.g. in this case, sitemap1.xml and sitemap2.xml), we call the class's add_sitemap() function.

This add_sitemap() function fetches each sitemap file, adds it to the object's list of sitemap files, finds all of the sitemap's URLs, and adds them all to the object's list of URLs.

Now, the $myCrawler variable is an object that is storing a list of sitemaps and a list of URLs contained within those sitemaps.

We can then call:

$myCrawler->run();

and what that will do is perform cURL requests to look up all of the URLs associated with $myCrawler.

You can see, in this example code, above the Crawler class is a similar use case that the developer presents:

$sitemaps = array(
    'https://bjornjohansen.no/sitemap_index.xml',
);

$crawler = new BJ_Crawler( $sitemaps );
$crawler->run();
commented: Great explanation! +1

Just so that everyone will always know what's being referred to, in case the other page is taken offline, here's the code I'm referring to:

#!/usr/bin/php
<?php
/**
 * @license http://www.wtfpl.net/txt/copying/ WTFPL
 */

date_default_timezone_set( 'UTC' );


$sitemaps = array(
    'https://bjornjohansen.no/sitemap_index.xml',
);

$crawler = new BJ_Crawler( $sitemaps );
$crawler->run();


/**
 * Crawler class
 */
class BJ_Crawler {

    protected $_sitemaps = null;
    protected $_urls = null;

    /**
     * Constructor
     *
     * @param array|string $sitemaps A string with an URL to a XML sitemap, or an array with URLs to XML sitemaps. Sitemap index files works well too.
     *
     */
    function __construct( $sitemaps = null ) {

        $this->_sitemaps = [];
        $this->_urls = [];

        if ( ! is_null( $sitemaps ) ) {
            if ( ! is_array( $sitemaps ) ) {
                $sitemaps = array( $sitemaps );
            }

            foreach ( $sitemaps as $sitemap ) {
                $this->add_sitemap( $sitemap );
            }
        }

    }

    /**
     * Add a sitemap URL to our crawl stack. Sitemap index files works too.
     *
     * @param string $sitemapurl URL to a XML sitemap or sitemap index
     */
    public function add_sitemap( $sitemapurl ) {

        if ( in_array( $sitemapurl, $this->_sitemaps ) ) {
            return;
        }

        $this->_sitemaps[] = $sitemapurl;

        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, $sitemapurl );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
        $content = curl_exec( $ch );
        $http_return_code = curl_getinfo( $ch, CURLINFO_HTTP_CODE );

        if ( '200' != $http_return_code ) {
            return false;
        }

        $xml = new SimpleXMLElement( $content, LIBXML_NOBLANKS );

        if ( ! $xml ) {
            return false;
        }

        switch ( $xml->getName() ) {
            case 'sitemapindex':
                foreach ( $xml->sitemap as $sitemap ) {
                    $this->add_sitemap( reset( $sitemap->loc ) );
                }
                break;

            case 'urlset':
                foreach ( $xml->url as $url ) {
                    $this->add_url( reset( $url->loc ) );
                }
                break;

            default:
                break;
         }

    }

    /**
     * Add a URL to our crawl stack
     *
     * @param string $url URL to check
     */
    public function add_url( $url ) {

        if ( ! in_array( $url, $this->_urls ) ) {
            $this->_urls[] = $url;
        }

    }

    /**
     * Run the crawl
     */
    public function run() {

        // Split our URLs into chunks of 5 URLs to use with curl multi
        $chunks =  array_chunk( $this->_urls, 5 );

        foreach ( $chunks as $chunk ) {

            $mh = curl_multi_init();

            foreach ( $chunk as $url ) {
                $ch = curl_init();
                curl_setopt( $ch, CURLOPT_URL, $url );
                curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
                curl_multi_add_handle( $mh, $ch );
            }

            $active = null;
            do {
                $mrc = curl_multi_exec( $mh, $active );
            } while ( CURLM_CALL_MULTI_PERFORM == $mrc );

            while ( $active && CURLM_OK == $mrc ) {
                if ( curl_multi_select( $mh ) != -1) {
                    do {
                        $mrc = curl_multi_exec( $mh, $active );
                    } while ( CURLM_CALL_MULTI_PERFORM == $mrc );
                }
            }
        }
    }

}

@dani

How come your sitemap lists urls without extensions such as:

.html
.htm
.shtml
.shtm
.php

Because, as much of the web does nowadays, we use URI routing to dynamically build our URLs. They don’t map one-to-one with actual files on our web server.

@Dani

Not sure I understand.
I see dynamic urls got extensions. If your website is dynamic then it too should have extensions on that aprticular file running the scripts.
I am missing something here. The URI thing I do not understand.

I do not know how to explain it better than I already have, so here is a Google blog article from 2008 that explains why you shouldn’t use URI rerouting:

https://developers.google.com/search/blog/2008/09/dynamic-urls-vs-static-urls

The article is actually outdated because nowadays (as this article is quite old) URI routing is built into most web frameworks, so there’s actually not an easy way to NOT use it. But it does explain what URI routing is.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.