php search engine problems

Question

cwarn23 387 Occupation: Genius

16 Years Ago

Hi and I am having 2 technical difficulties with a web search engine I have created. One is that the search bot comes to a url and never stops loading. Below is the current script I use to validate a url and to capture the information but my question is: How do you time how long it takes for file_get_contents() to load a url and if it takes longer than 10 seconds how to make it skip the url automatically and continue with the rest?

<?
function url_exists($url) {
    // Version 4.x supported
    $handle   = curl_init($url);
    if (false === $handle)
    {
        return false;
    }
    curl_setopt($handle, CURLOPT_HEADER, false);
    curl_setopt($handle, CURLOPT_FAILONERROR, true);
    curl_setopt($handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") ); // request as if Firefox   
    curl_setopt($handle, CURLOPT_NOBODY, true);
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, false);
    $connectable = curl_exec($handle);
    curl_close($handle);  
    return $connectable;
}

$address='http://www.google.com.au/';
if (url_exists($address))
    {
    file_get_contents($address);
    }
$address='http://www.example.com/';
if (url_exists($address))
    {
    file_get_contents($address);
    }
?>

My second question which I though I would ask while I'm posting is how do I correct the below preg_replace to filter any url to just the domain? The below code will only work half the time but the rest of the time will return the entire url.

<?
$domain=@preg_replace('/((http\:\/\/|https\:\/\/)?.*(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/))(.*)(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/)?(.*)?(\.[a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z]\/|\.[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\/)?(.*)?/i','$1',$url);
?>

php

2 Contributors
4 Replies
100 Views
4 Days Discussion Span
Latest Post 16 Years Ago Latest Post by cwarn23

humbug 5 Junior Poster in Training

16 Years Ago

"^(http://|https://)?(www\.)?([a-zA-Z0-9-]+\.)*([a-zA-Z]+)\.(com|net|org|co\.uk|com\.au|tv|biz|etc)(/|$)" should work (as a pattern) and will return false for invalid URL's. The domain will be in $4.

If you know you have a valid URL and you can assume that the domain is longer then 3 chars then you can use the simpler "([a-zA-Z0-9-]+)(\.[a-zA-Z]{2,3})+(/|$)" which will leave the domain in $1. If you don't want to assume that the domain more then 3 chars then you need to have "([a-zA-Z)]+\.(ALL POSSIBLE COM/NET VARIATIONS)(/|$)" with all the possible com/net/org variations filled in.

(Adding the "^(http://|https://)?(www\.)?([a-zA-Z0-9-]+\.)*" will allow for checking that there is nothing except (all optional) "http://", "www." and any subdomains between the start of the string and the supposed domain.)

I'm not 100% sure, but I think that nothing needs to be escaped in "http://". Also don't forget the non-case sensitive thingo and whatnot. I also haven't tried any of these so, you know, learn what you must from them and then write your own or whatever.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 1 · 2008-10-31T12:03:28+00:00

"^(http://|https://)?(www\.)?([a-zA-Z0-9-]+\.)*([a-zA-Z]+)\.(com|net|org|co\.uk|com\.au|tv|biz|etc)(/|$)"
should work (as a pattern) and will return false for invalid URL's. The domain will be in $4.

When I was first making this preg_replace(), I was at first thinking that until I descovered that businesses can pay for their own top level domains. So technically there are potentially infinite top level domains (eg. .com or .qld.au). Also another problem has occurred for the your preg replace (and my version) is when trying to filter the below url it messes up in different ways each time:

http://www.example.com/index.php?link=http://www.example.com

So if businesses can make their own top level domains then that means I need to program the preg replace to detect the .com/.qld.au/.mil or whatever it may be. And also I need the preg replace to somehow not get mixed up the the above example url as it occurs a lot. But nice try though since you may not have known about custom top level domains.

humbug 5 Junior Poster in Training · Answer 2 · 2008-10-31T14:05:21+00:00

^(http://|https://)?(www\.)?([a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+)(\.[a-zA-Z]{2,3})+(/|$) That is as close as I can get. It will let you get "domain" out of all the following examples:

domain.com
www.sub.domain.com.au/lolerskates
http://www.massive.domain.com.au.cd.wiz/index.php?link=http://www.anotherdomain.com/
http://domain.com/
//but it won't pick up on the following domains:
www.sub.dom.com //wait... it will by chance
www.sub.dom.com.au //this will grab "com" instead of "dom"

So unless the domain name is 3 letters or less, that will work. I also don't know how the $n variable is treated when there are multiple matches for ([a-zA-Z]+)* . You may have to use an alternative method...

Just thought of this:

$bits = explode('/', $url);
//will yield ("http:","","www.domain.com","images","example.jpg")
//or ("www.domain.com","images","example.jpg") ("www." optional)
$i = 0;
if ($bits[i] == 'http:'){ //which bit of $bits do we want?
  $i += 2;
}
$bits = explode('.', $bits[$i]);
//will yield ("www", "sub", "domain", "com", "au")
//where most of those bits are optional except "domain" and at least one of the "com" bits
$i=0;
if ($bits[$i] == 'www'){
  $i++;
}
while (isset($bits[$i])){
  //do a check on the length or something else
  $i++;
}
//then try to determine which element in $bits is the domain.
//I suggest that it's the last element that is longer then 3 chars but not always.

cwarn23 387 Occupation: Genius Team Colleague Featured Poster · Answer 3 · 2008-11-03T13:12:23+00:00

^(http://|[url]https://)?(www\.)?([/url][a-zA-Z0-9-]+\.)*([a-zA-Z0-9-]+)(\.[a-zA-Z]{2,3})+(/|$) That is as close as I can get. It will let you get "domain" out of all the following examples:

domain.com
www.sub.domain.com.au/lolerskates
http://www.massive.domain.com.au.cd.wiz/index.php?link=http://www.anotherdomain.com/
http://domain.com/
//but it won't pick up on the following domains:
www.sub.dom.com //wait... it will by chance
www.sub.dom.com.au //this will grab "com" instead of "dom"

So unless the domain name is 3 letters or less, that will work. I also don't know how the $n variable is treated when there are multiple matches for ([a-zA-Z]+)* . You may have to use an alternative method...

Just thought of this:

$bits = explode('/', $url);
//will yield ("http:","","www.domain.com","images","example.jpg")
//or ("www.domain.com","images","example.jpg") ("www." optional)
$i = 0;
if ($bits[i] == 'http:'){ //which bit of $bits do we want?
  $i += 2;
}
$bits = explode('.', $bits[$i]);
//will yield ("www", "sub", "domain", "com", "au")
//where most of those bits are optional except "domain" and at least one of the "com" bits
$i=0;
if ($bits[$i] == 'www'){
  $i++;
}
while (isset($bits[$i])){
  //do a check on the length or something else
  $i++;
}
//then try to determine which element in $bits is the domain.
//I suggest that it's the last element that is longer then 3 chars but not always.

Great example with the explode() function that I didn't know even existed. With the information in the quote above, I was able to produce the following code that from what I can see, solves how to get just the domain name from a url.

function domain($domainb)
	{
	$bits = explode('/', $domainb);
	if ($bits[0]=='http:' || $bits[0]=='https:')
		{
		return $bits[0].'//'.$bits[2].'/';
		} else {
		return 'http://'.$bits[0].'/';
		}
	unset($bits);
	}
echo domain ('http://www.example.com.au/test/index.php?link=http://www.example.com.au/test/');

So that above code should work in every situation I can think of. Also until I saw that post, I started to get onto something with preg replace as follows but I haven't determined if the below code will work in every situation:

function domain($domainb)
	{
	$domainb=preg_replace('/((.*[^\/]\/)[^\/])(.*)/i','$2',$domainb);
	return preg_replace('/(([a-zA-Z])\/[^\/].*)/i','$2/',$domainb);
	}
echo domain('http://www.example.com.au/test/index.php?link=http://www.example.com.au/test/');

And so for the domain filtering problem, I will use the first block I code I have posted in this post which makes that part solved. But there is still the other problem I had mentioned in the first post with the search engine and that is making the file_get_contents() function abort after something like 15 seconds. Any ideas? Thanks