Can anyone see anything wrong with this. Im stuck :(

$sql1 = mysql_query("DELETE FROM spider WHERE url='$addtolist'");

if (!mysql_query($sql1,$con))
  {
  die('Error Deleting: ' . mysql_error());
  }
echo "Record Deleted <br />";

$sql2="INSERT INTO list (title, url, description)
VALUES
('$title','$url','$description')";

if (!mysql_query($sql2,$con))
  {
  die('Error Adding: ' . mysql_error());
  }
echo "1 record added";

The first line you posted will run the query, so the if statement will always return false as the row is already deleted, change

$sql1 = mysql_query("DELETE FROM spider WHERE url='$addtolist'");

to

$sql1 = "DELETE FROM spider WHERE url='$addtolist'";

Thanks for that.

Also, when people submit data there is alot of duplicate data being submitted. How can I filter through everything to avoid this?

From what I have seen of your posts, your database contains URLs

You could use regex to strip everything from the name so instead of http://www.something.com/somepage.somefile you have something.com, store this in another column in the table, and on submission, strip everything from the submitted URL as above and see if it is already in the db, if it is, throw an error.

I can get the domain name using this

preg_match('@^(?:http://)?([^/]+)@i',
    $addtolist, $hostname);
$site = $hostname[1];

But then where would I go from here?

Sorry, Ive not done any work with MySQL before.

Make another column in the database for hostname (call it whatever you like) and store the result of $hostname[1] in there.

Then when a new URL is submitted, get the hostname, and see if it already exists in the database, for example:

preg_match ('@^(?:http://)?([^/]+)@i', $addtolist, $hostname);
$site = $hostname[1];
$result = mysql_query("SELECT `hostname` FROM `spider` WHERE `hostname` = '$site'");
if (mysql_num_rows($result) != 0) {
  // Hostname already in db
} else {
  // Not in the db
}

Right, Ok. I understand it but Im reeeaaaally confused :|

This is what Ive come up with. What do you make of it?

$result = mysql_query("SELECT url FROM blacklist WHERE url = '$hostname'");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	
	if (mysql_num_rows($result) != 0) {
  	echo 'URL already in DB';
	} 	
	else {
		echo $url.'<br />';
  	$query = "INSERT INTO spider (url, site) VALUES ('$url', '$hostname')";
	mysql_query($query) or die('Error, insert query failed');
	$query1 = "INSERT INTO blacklist (url) VALUES ('$url')";
	mysql_query($query1) or die('Error, Blacklisting failed');
}
}

I am genuinly trying. Im not just taking advantage here :p

Any chance you can post some more of the script, for example what $hrefs contains?

I dont think that will work, but if you post up some more it would be easier to provide a decent response/suggestion.

Yeah sure. Essentially its a spider. It finds all the links on a page :)

<?php
include('includes/config.php');

//Collect URLS

$target_url = $_GET['url'];

//Get host name from URL
preg_match('@^(?:http://)?([^/]+)@i',
    $target_url, $hostname);
$hostname = $hostname[1];


$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

$result = mysql_query("SELECT url FROM blacklist WHERE url = '$hostname'");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	
	if (mysql_num_rows($result) != 0) {
  	echo 'URL already in DB';
	} 	
	else {
		echo $url.'<br />';
  	$query = "INSERT INTO spider (url, site) VALUES ('$url', '$hostname')";
	mysql_query($query) or die('Error, insert query failed');
	$query1 = "INSERT INTO blacklist (url) VALUES ('$url')";
	mysql_query($query1) or die('Error, Blacklisting failed');
}
}

?>

The last part should be

$result = mysql_query("SELECT url FROM blacklist WHERE url = '$hostname'");
if (mysql_num_rows($result) != 0) {
  echo 'URL already in DB';
} else {
  for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    echo $url.'<br />';
    $query = "INSERT INTO spider (url, site) VALUES ('$url', '$hostname')";
    mysql_query($query) or die('Error, insert query failed');
    $query1 = "INSERT INTO blacklist (url) VALUES ('$url')";
    mysql_query($query1) or die('Error, Blacklisting failed');
  }
}

Ah, we were getting mixed messages there. I needed to check if $url was already in the database where as you were checking for $hostname :p

I ended up with this if anyone else was curious...

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	$result = mysql_query("SELECT * FROM blacklist WHERE url = '$url'");
		if(mysql_num_rows($result) != 0){
			echo 'URL already in database';
		}
		else {
			echo $url.'<br />';
  	$query = "INSERT INTO spider (url, site) VALUES ('$url', '$hostname')";
	mysql_query($query) or die('Error, insert query failed');
	$query1 = "INSERT INTO blacklist (url) VALUES ('$url')";
	mysql_query($query1) or die('Error, Blacklisting failed');
		}
}
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.