Greetings. Trying to scrape data from search results in a library catalog, but cannot return anything at all. The same script below works fine pulling from another catalog, but not with this one. (It's a Voyager catalog by ExLibris, in case that helps.)
Below for simplicity is a boiled-down version of the script, with all scraping functions removed. The script runs on this page.
As you might already know, lots of library catalogs generate session URLs. But that is not the issue in this case. The script won't even scrape the URL of the catalog's 'home page,' the first link above.
Is there a way to diagnose what the catalog server is sending that prevents returning its HTML? And then to properly set a CURLOPT to overcome that?
Thank you for your thoughts!
<?php
function curl($url) {
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_AUTOREFERER => TRUE,
CURLOPT_CONNECTTIMEOUT => 90,
CURLOPT_TIMEOUT => 90,
CURLOPT_MAXREDIRS => 10,
CURLOPT_URL => $url,
CURLOPT_HEADER => false,
CURLOPT_ENCODING => "",
CURLOPT_USERAGENT => "'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13')",
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => $curl_data,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1
);
$ch = curl_init();
curl_setopt_array($ch, $options);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
//SETS UP A (STABLE) URL OF A SEARCH RESULTS PAGE:
$DDCnumber = 873;
$url = "http://pilot.passhe.edu:8042/cgi-bin/Pwebrecon.cgi?DB=local&CNT=90&Search_Arg=" . $DDCnumber . "&Search_Code=CALL%2B&submit.x=23&submit.y=23";
echo "The URL we'd like to scrape is " . $url . "<br />";
$results_page = curl($url);
if ($results_page != "") {echo "Something was retrieved"; }
?>