Hi and I am having 2 technical difficulties with a web search engine I have created. One is that the search bot comes to a url and never stops loading. Below is the current script I use to validate a url and to capture the information but my question is: How do you time how long it takes for file_get_contents() to load a url and if it takes longer than 10 seconds how to make it skip the url automatically and continue with the rest?
<?
function url_exists($url) {
// Version 4.x supported
$handle = curl_init($url);
if (false === $handle)
{
return false;
}
curl_setopt($handle, CURLOPT_HEADER, false);
curl_setopt($handle, CURLOPT_FAILONERROR, true);
curl_setopt($handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); // request as if Firefox
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, false);
$connectable = curl_exec($handle);
curl_close($handle);
return $connectable;
}
$address='http://www.google.com.au/';
if (url_exists($address))
{
file_get_contents($address);
}
$address='http://www.example.com/';
if (url_exists($address))
{
file_get_contents($address);
}
?>
My second question which I though I would ask while I'm posting is how do I correct the below preg_replace to filter any url to just the domain? The below code will only work half the time but the rest of the time will return the entire url.
<?
$domain=@preg_replace('/((http\:\/\/|https\:\/\/)?.*(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/))(.*)(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/)?(.*)?(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/)?(.*)?/i','$1',$url);
?>