PHP HTTP Screen-Scraping Class with Caching

Troy 0 Tallied Votes 337 Views Share

class_http.php is a "screen-scraping" utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. Caching makes you a good neighbor!

The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

The class cloaks itself as the User Agent of the user making the request to your script. It also sends your script as the Referer, since in essence, it is the referrer. This means you should be able to screen-scrape sites that normally block screen-scraping. This class is not meant to help you break any company's usage policies. Be a good neighbor, and always use caching when you can.

Need to access protected content? The class can do basic authentication. However, a lot of sites that require login do not use basic authentication.

Most current information and documentation and downloads found at
http://www.troywolf.com/articles/php/class_http.

There are three complete PHP files listed below. First is the class file, class_http.php. The second is example.php to show you how to use the class. The third file is image_cache.php--a companion script to cache images for use within the src attribute of img elements.

Troy Wolf operates ShinySolutions Webhosting, and is the author of SnippetEdit--a PHP application providing browser-based website editing that even non-technical people can use. "Website editing as easy as it gets." Troy has been a professional Internet and database application developer for over 10 years. He has many years' experience with ASP, VBScript, PHP, Javascript, DHTML, CSS, SQL, and XML on Windows and Linux platforms.

=====================================================
class_http.php
=====================================================
<?php
/*
* Filename.......: class_http.php
* Author.........: Troy Wolf [troy@troywolf.com]
* Last Modified..: Date: 2006/03/06 10:15:00
* Description....: Screen-scraping class with caching. Includes image_cache.php
                   companion script. Includes static methods to extract data
                   out of HTML tables into arrays or XML. Now supports sending
                   XML requests and custom verbs with support for making
                   WebDAV requests to Microsoft Exchange Server.
*/

class http {
    var $log;
    var $dir;
    var $name;
    var $filename;
    var $url;
    var $port;
    var $verb;
    var $status;
    var $header;
    var $body;
    var $ttl;
    var $headers;
    var $postvars;
    var $xmlrequest;
    var $connect_timeout;
    var $data_ts;
    
    /*
    The class constructor. Configure defaults.
    */
    function http() {
        $this->log = "New http() object instantiated.<br />\n";
        
        /*
        Seconds to attempt socket connection before giving up.
        */
        $this->connect_timeout = 30;
        
        /*
        Set the 'dir' property to the directory where you want to store the cached
        content. I suggest a folder that is not web-accessible.
        End this value with a "/".
        */
        $this->dir = realpath("./")."/"; //Default to current dir.

        $this->clean();               

        return true;
    }
    
    /*
    fetch() method to get the content. fetch() will use 'ttl' property to
    determine whether to get the content from the url or the cache.
    */
    function fetch($url="", $ttl=0, $name="", $user="", $pwd="", $verb="GET") {
        $this->log .= "--------------------------------<br />fetch() called<br />\n";
        $this->log .= "url: ".$url."<br />\n";
        $this->status = "";
        $this->header = "";
        $this->body = "";
        if (!$url) {
            $this->log .= "OOPS: You need to pass a URL!<br />";
            return false;
        }
        $this->url = $url;
        $this->ttl = $ttl;
        $this->name = $name;
        $need_to_save = false;
        if ($this->ttl == "0") {
            if (!$fh = $this->getFromUrl($url, $user, $pwd, $verb)) {
                return false;
            }
        } else {
            if (strlen(trim($this->name)) == 0) { $this->name = MD5($url); }
            $this->filename = $this->dir."http_".$this->name;
            $this->log .= "Filename: ".$this->filename."<br />";
            $this->getFile_ts();
            if ($this->ttl == "daily") {
                if (date('Y-m-d',$this->data_ts) != date('Y-m-d',time())) {
                    $this->log .= "cache has expired<br />";
                    if (!$fh = $this->getFromUrl($url, $user, $pwd, $verb)) {
                        return false;
                    }
                    $need_to_save = true;
                    if ($this->getFromUrl()) { return $this->saveToCache(); }
                    } else {
                        if (!$fh = $this->getFromCache()) {
                        return false;
                    }
                }
            } else {
                if ((time() - $this->data_ts) >= $this->ttl) {
                    $this->log .= "cache has expired<br />";
                    if (!$fh = $this->getFromUrl($url, $user, $pwd)) {
                        return false;
                    }
                    $need_to_save = true;
                } else {
                    if (!$fh = $this->getFromCache()) {
                        return false;
                    }
                }
            }
        }
        
        /*
        Get response header.
        */
        $this->header = fgets($fh, 1024);
        $this->status = substr($this->header,9,3);
        while ((trim($line = fgets($fh, 1024)) != "") && (!feof($fh))) {
            $this->header .= $line;
            if ($this->status=="401" and strpos($line,"WWW-Authenticate: Basic realm=\"")===0) {
                fclose($fh);
                $this->log .= "Could not authenticate<br />\n";
                return FALSE;
            }
        }
        
        /*
        Get response body.
        */
        while (!feof($fh)) {
            $this->body .= fgets($fh, 1024);
        }
        fclose($fh);
        if ($need_to_save) { $this->saveToCache(); }
        return $this->status;
    }
    
    /*
    PRIVATE getFromUrl() method to scrape content from url.
    */
    function getFromUrl($url, $user="", $pwd="", $verb="GET") {
        $this->log .= "getFromUrl() called<br />";
        preg_match("~([a-z]*://)?([^:^/]*)(:([0-9]{1,5}))?(/.*)?~i", $url, $parts);
        $protocol = $parts[1];
        $server = $parts[2];
        $port = $parts[4];
        $path = $parts[5];
        if ($port == "") {
            if (strtolower($protocol) == "https://") {
                $port = "443";
            } else {
                $port = "80";
            }
        }

        if ($path == "") { $path = "/"; }
        
        if (!$sock = @fsockopen(((strtolower($protocol) == "https://")?"ssl://":"").$server, $port, $errno, $errstr, $this->connect_timeout)) {
            $this->log .= "Could not open connection. Error "
                .$errno.": ".$errstr."<br />\n";
            return false;
        }
        
        $this->headers["Host"] = $server.":".$port;
        
        if ($user != "" && $pwd != "") {
            $this->log .= "Authentication will be attempted<br />\n";
            $this->headers["Authorization"] = "Basic ".base64_encode($user.":".$pwd);
        }
        
        if (count($this->postvars) > 0) {
            $this->log .= "Variables will be POSTed<br />\n";
            $request = "POST ".$path." HTTP/1.0\r\n";
            $post_string = "";
            foreach ($this->postvars as $key=>$value) {
                $post_string .= "&".urlencode($key)."=".urlencode($value);
            }
            $post_string = substr($post_string,1);
            $this->headers["Content-Type"] = "application/x-www-form-urlencoded";
            $this->headers["Content-Length"] = strlen($post_string);
        } elseif (strlen($this->xmlrequest) > 0) {
            $this->log .= "XML request will be sent<br />\n";
            $request = $verb." ".$path." HTTP/1.0\r\n";
            $this->headers["Content-Length"] = strlen($this->xmlrequest);
        } else {
            $request = $verb." ".$path." HTTP/1.0\r\n";
        }

        #echo "<br />request: ".$request;

        
        if (fwrite($sock, $request) === FALSE) {
            fclose($sock);
            $this->log .= "Error writing request type to socket<br />\n";
            return false;
        }
        
        foreach ($this->headers as $key=>$value) {
            if (fwrite($sock, $key.": ".$value."\r\n") === FALSE) {
                fclose($sock);
                $this->log .= "Error writing headers to socket<br />\n";
                return false;
            }
        }
        
        if (fwrite($sock, "\r\n") === FALSE) {
            fclose($sock);
            $this->log .= "Error writing end-of-line to socket<br />\n";
            return false;
        }
        
        #echo "<br />post_string: ".$post_string;
        if (count($this->postvars) > 0) {
            if (fwrite($sock, $post_string."\r\n") === FALSE) {
                fclose($sock);
                $this->log .= "Error writing POST string to socket<br />\n";
                return false;
            }
        } elseif (strlen($this->xmlrequest) > 0) {
            if (fwrite($sock, $this->xmlrequest."\r\n") === FALSE) {
                fclose($sock);
                $this->log .= "Error writing xml request string to socket<br />\n";
                return false;
            }
        }
        
        return $sock;
    }
    
    /*
    PRIVATE clean() method to reset the instance back to mostly new state.
    */
    function clean()
    {
        $this->status = "";
        $this->header = "";
        $this->body = "";
        $this->headers = array();
        $this->postvars = array();
        /*
        Try to use user agent of the user making this request. If not available,
        default to IE6.0 on WinXP, SP1.
        */
        if (isset($_SERVER['HTTP_USER_AGENT'])) {
            $this->headers["User-Agent"] = $_SERVER['HTTP_USER_AGENT'];
        } else {
            $this->headers["User-Agent"] = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)";
        }
        
        /*
        Set referrer to the current script since in essence, it is the referring
        page.
        */
        if (substr($_SERVER['SERVER_PROTOCOL'],0,5) == "HTTPS") {
            $this->headers["Referer"] = "https://".$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
        } else {
            $this->headers["Referer"] = "http://".$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
        }
    }
    
    /*
    PRIVATE getFromCache() method to retrieve content from cache file.
    */
    function getFromCache() {
        $this->log .= "getFromCache() called<br />";
        //create file pointer
        if (!$fp=@fopen($this->filename,"r")) {
            $this->log .= "Could not open ".$this->filename."<br />";
            return false;
        }
        return $fp;
    }
    
    /*
    PRIVATE saveToCache() method to save content to cache file.
    */
    function saveToCache() {
        $this->log .= "saveToCache() called<br />";
        
        //create file pointer
        if (!$fp=@fopen($this->filename,"w")) {
            $this->log .= "Could not open ".$this->filename."<br />";
            return false;
        }
        //write to file
        if (!@fwrite($fp,$this->header."\r\n".$this->body)) {
            $this->log .= "Could not write to ".$this->filename."<br />";
            fclose($fp);
            return false;
        }
        //close file pointer
        fclose($fp);
        return true;
    }
    
    /*
    PRIVATE getFile_ts() method to get cache file modified date.
    */
    function getFile_ts() {
        $this->log .= "getFile_ts() called<br />";
        if (!file_exists($this->filename)) {
            $this->data_ts = 0;
            $this->log .= $this->filename." does not exist<br />";
            return false;
        }
        $this->data_ts = filemtime($this->filename);
        return true;
    }
    
    /*
    Static method table_into_array()
    Generic function to return data array from HTML table data
    rawHTML: the page source
    needle: optional string to start parsing source from
    needle_within: 0 = needle is BEFORE table, 1 = needle is within table
    allowed_tags: list of tags to NOT strip from data, e.g. "<a><b>"
    */
    function table_into_array($rawHTML,$needle="",$needle_within=0,$allowed_tags="") {
        $upperHTML = strtoupper($rawHTML);
        $idx = 0;
        if (strlen($needle) > 0) {
            $needle = strtoupper($needle);
            $idx = strpos($upperHTML,$needle);
            if ($idx === false) { return false; }
            if ($needle_within == 1) {
                $cnt = 0;
                while(($cnt < 100) && (substr($upperHTML,$idx,6) != "<TABLE")) {
                    $idx = strrpos(substr($upperHTML,0,$idx-1),"<");
                    $cnt++;
                }
            }
        }
        $aryData = array();
        $rowIdx = 0;
        /*    If this table has a header row, it may use TD or TH, so
        check special for this first row. */
        $tmp = strpos($upperHTML,"<TR",$idx);
        if ($tmp === false) { return false; }
        $tmp2 = strpos($upperHTML,"</TR>",$tmp);
        if ($tmp2 === false) { return false; }
        $row = substr($rawHTML,$tmp,$tmp2-$tmp);
        $pattern = "/<TH>|<TH\ |<TD>|<TD\ /";
        preg_match($pattern,strtoupper($row),$matches);
        $hdrTag = $matches[0];
        
        while ($tmp = strpos(strtoupper($row),$hdrTag) !== false) {
            $tmp = strpos(strtoupper($row),">",$tmp);
            if ($tmp === false) { return false; }
            $tmp++;
            $tmp2 = strpos(strtoupper($row),"</T");
            $aryData[$rowIdx][] = trim(strip_tags(substr($row,$tmp,$tmp2-$tmp),$allowed_tags));
            $row = substr($row,$tmp2+5);
            preg_match($pattern,strtoupper($row),$matches);
            $hdrTag = $matches[0];
        }
        $idx = strpos($upperHTML,"</TR>",$idx)+5;
        $rowIdx++;
        
        /* Now parse the rest of the rows. */
        $tmp = strpos($upperHTML,"<TR",$idx);
        if ($tmp === false) { return false; }
        $tmp2 = strpos($upperHTML,"</TABLE>",$idx);
        if ($tmp2 === false) { return false; }
        $table = substr($rawHTML,$tmp,$tmp2-$tmp);
        
        while ($tmp = strpos(strtoupper($table),"<TR") !== false) {
            $tmp2 = strpos(strtoupper($table),"</TR");
            if ($tmp2 === false) { return false; }
            $row = substr($table,$tmp,$tmp2-$tmp);
            
            while ($tmp = strpos(strtoupper($row),"<TD") !== false) {
            $tmp = strpos(strtoupper($row),">",$tmp);
            if ($tmp === false) { return false; }
            $tmp++;
            $tmp2 = strpos(strtoupper($row),"</TD");
            $aryData[$rowIdx][] = trim(strip_tags(substr($row,$tmp,$tmp2-$tmp),$allowed_tags));
            $row = substr($row,$tmp2+5);
            }
            $table = substr($table,strpos(strtoupper($table),"</TR>")+5);
            $rowIdx++;
        }
        return $aryData;
    }
    
    /*
    Static method table_into_xml()
    Generic function to return xml dataset from HTML table data
    rawHTML: the page source
    needle: optional string to start parsing source from
    allowedTags: list of tags to NOT strip from data, e.g. "<a><b>"
    */
    function table_into_xml($rawHTML,$needle="",$needle_within=0,$allowedTags="") {
        if (!$aryTable = http::table_into_array($rawHTML,$needle,$needle_within,$allowedTags)) { return false; }
        $xml = "<?xml version=\"1.0\" standalone=\"yes\" \?\>\n";
        $xml .= "<TABLE>\n";
        $rowIdx = 0;
        foreach ($aryTable as $row) {
            $xml .= "\t<ROW id=\"".$rowIdx."\">\n";
            $colIdx = 0;
            foreach ($row as $col) {
                $xml .= "\t\t<COL id=\"".$colIdx."\">".trim(utf8_encode(htmlspecialchars($col)))."</COL>\n";
                $colIdx++;
            }
            $xml .= "\t</ROW>\n";
            $rowIdx++;
        }
        $xml .= "</TABLE>";
        return $xml;
    }
}

?>

=====================================================
example.php
=====================================================
<?php
/*
* example.php
* class_http.php example usage
* Author: Troy Wolf (troy@troywolf.com)
* Comments: Please be a good neighbor when screen-scraping. Don't write code
            that will needlessly make hits to third-party websites. Use
            class_http's caching feature whenever possible. It is designed to
            make you a good neighbor!
*/

/*
Include the http class. Modify path according to where you put the class
file.
*/
require_once(dirname(__FILE__).'/class_http.php');

/* -----------------------------------------------------------------------------
Example to screen-scrape the Google home page without caching.
----------------------------------------------------------------------------- */
/* First, instantiate a new http object. */
$h = new http();

/* Screen-scrape a url. */
if (!$h->fetch("http://www.google.com")) {
  /*
  The class has a 'log' property that contains a log of events. This log is
  useful for testing and debugging.
  */
  echo "<h2>There is a problem with the http request!</h2>";
  echo $h->log;
  exit();
}

/* Echo out the body content fetched from the url. */
echo $h->body;

/* If you just want to know the HTTP status code: */
echo "Status: ".$h->status;

/* If you are interested in seeing all the response headers: */
echo "<pre>".$h->header."</pre>";


/* -----------------------------------------------------------------------------
Example to screen-scrape the MSFT stock page at moneycentral.com WITH caching.
----------------------------------------------------------------------------- */
/* First, instantiate a new http object. */
$h = new http();

/*
Set a TTL (Time-to-Live) in seconds. A value of 600 means your site will not hit
the source site more than once every 10 minutes. This makes your page faster and
makes you a better neighbor to the external site.
*/
$h->ttl = 600;

/*
You can give the request a 'name' which will be used in the cache file name.
If you don't name the request, an MD5 hash of the url will be used.
*/
#$h->name = "msft_quote";

/* Screen-scrape a url. */
if (!$h->fetch("http://moneycentral.msn.com/detail/stock_quote?Symbol=MSFT")) {
  /*
  The class has a 'log' property that contains a log of events. This log is
  useful for testing and debugging.
  */
  echo "<h2>There is a problem with the http request!</h2>";
  echo $h->log;
  exit();
}

/* Echo out the body content fetched from the url. */
echo $h->body;


/* -----------------------------------------------------------------------------
Example to extract a specific table of data out of scraped content. The class
comes with 2 static methods you can use for this purpose.
  table_into_array() will rip a single table into an array.
  table_into_xml() will internally call table_into_array() then create an
  XML document from the array. I thought this would be cool, but in practice,
  I've never used this method since the array is so easy to work with.

This example builds on the previous example to extract the MSFT stats out
of the body content. Read the comments in the class file to learn how to use
this static method.
----------------------------------------------------------------------------- */
$msft_stats = http::table_into_array($h->body, "Avg Daily Volume", 1, null);

/* Print out the array so you can see the stats data. */
echo "<pre>";
print_r($msft_stats);
echo "</pre>";


/* -----------------------------------------------------------------------------
Scraping content that is username/password protected. The class can do basic
authentication. Pass your username and password in like this:
----------------------------------------------------------------------------- */
#$h->fetch("http://someprivatesite.net","MyUserName","MyPassword");


/* -----------------------------------------------------------------------------
If your need to access content on a port other than 80, just put the port in
the URL in the standard way:
----------------------------------------------------------------------------- */
#$h->fetch("http://somedomain.org:8088");


/* -----------------------------------------------------------------------------
Example of using the image_cache.php companion script to cache images. Why not
just link directly to a neighbor's images? If your site has a lot of traffic,
that's a lot of hits to your neighbor's site. So why not just copy their image
to your own server? That's fine for images that do not change, but some sites
create dynamic images such as stock charts that are generated new every minute.
image_cache.php in conjunction with class_http.php makes it easy to directly
link to third-party images and cache the image data for whatever ttl makes
sense for your application.

In this example, we will cache the chart image found at this moneycentral page:
http://moneycentral.msn.com/investor/charts/chartdl.asp?FC=1&Symbol=MSFT&CA=1&CB=1&CC=1&CD=1&CP=0&PT=5
You have to look at the page source code to find the url to their image. Then
you url encode their image url, and pass it as a parameter to image_cache.php.
----------------------------------------------------------------------------- */
?>

<img src="image_cache.php?ttl=60&url=http%3A%2F%2Fdata.moneycentral.msn.com%2Fscripts%2Fchrtsrv.dll%3FSymbol%3DMSFT%26C1%3D0%26C2%3D1%26C9%3D2%26CA%3D1%26CB%3D1%26CC%3D1%26CD%3D1%26CF%3D0%26EFR%3D236%26EFG%3D246%26EFB%3D254%26E1%3D0" width="448" height="300" alt="Chart Graphic" />

<?


/*
The class has a 'log' property that is very useful for testing and debugging.
During development, I suggest you always print this out so you can see what is
happening.
*/
echo "<h3>http log</h3>";
echo $h->log;

?>


=====================================================
image_cache.php
=====================================================
<?php
/*
* Filename.......: image_cache.php
* Author.........: Troy Wolf [troy@troywolf.com]
* Last Modified..: Date: 2005/06/21 10:30:00
* Description....: Companion script to clas_http.php. When used in conjunction
                   with class_http.php, can be used to "screen-scrape" images
                   and cache them locally for any number of seconds. You use
                   this script in-line within img tags like so:
<img src="image_cache.php?ttl=300&url=http%3A%2F%2Fwww.somedomain.com%2Fsomeimage.gif" />
                  (You must url encode the url within the src attribute.)
*/

/*
Include the http class. Modify path according to where you put the class
file.
*/
require_once(dirname(__FILE__).'/class_http.php');

$h = new http();
$h->ttl = $_GET['ttl'];
$h->fetch($_GET['url']);
header("Content-Type: image/jpeg");
echo $h->body;
?>
Samir 0 Newbie Poster

I'm not a programmer, but I have a coding problem in front of me where I think this snippet could help.

I go to a particular link and it returns a random image via a url to that image. I don't want the image--I want the url to the image in a variable.

Can this piece of code do this, and if so, what would an example look like? I think I could figure it out from there.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.