Folks,
This is absurd!
As you know, some crawler codes on the internet exist where you get it to navigate to a page and it extracts all html links. hrefs.
Code such as this one:
//Sitemap Protocol: https://www.sitemaps.org/protocol.html
include_once('simplehtmldom_1_9_1/simple_html_dom.php');
//WORKS.
//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
//FAILS. Shows blank page.
$sitemap = "https://bytenota.com/sitemap.xml";
$html = new simple_html_dom();
$html->load_file($sitemap);
foreach($html->find("loc") as $link)
{
echo $link->innertext."<br>";
}
And there are those that extract links from xml files.
Like this one:
//Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html
// sitemap url or sitemap file
//FAILS.
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
//WORKS
//$sitemap = "https://bytenota.com/sitemap.xml";
//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
// get sitemap content
$content = file_get_contents($sitemap);
// parse the sitemap content to object
$xml = simplexml_load_string($content);
// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement)
{
// get properties
$url = $urlElement->loc;
$lastmod = $urlElement->lastmod;
$changefreq = $urlElement->changefreq;
$priority = $urlElement->priority;
// print out the properties
echo 'url: '. $url . '<br>';
echo 'lastmod: '. $lastmod . '<br>';
echo 'changefreq: '. $changefreq . '<br>';
echo 'priority: '. $priority . '<br>';
echo '<br>---<br>';
}
But guess what ?
Both these do not work if you get the crawlers to navigate to an xml file sitemap that lists further xml links or sitemaps.
And so, I am trying to build my own crawler, where when I set it to navigate to an xml sitemap then it should check if the listed links are href links or further xml links to more xml sitemaps.
So what I did was, I first got my crawler to navigate to an xml file.
And now I want it to extract all found links and check whether they found links are hrefs or further xml links.
If the links are hrefs, then add them to the $extracted_urls array.
Else add them to the $crawl_xml_files array.
So later on, the crawler can crawl those extracted href & xml links.
Now, I am stuck on the part where, the code fails to echo the link extensions of the found links on the initially navigated page.
It fails to extract any links to the respective arrays.
Here is the code. Test it and see for yourself where I am going wrong. I am scratching my head.
My UNWORKING CODE
//Sitemap Crawler: If starting url is an xml file listing further xml files then it will show blank page and not visit the found xml files to extract links from them.
//Sitemap Protocol: https://www.sitemaps.org/protocol.html
//$sitemap = 'https://www.rocktherankings.com/post-sitemap.xml';
//$sitemap = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
$sitemap = 'https://bytenota.com/sitemap.xml';
//$sitemap = 'https://www.daniweb.com/home-sitemap.xml';
// get sitemap content
//$sitemap = 'sitemap.xml';
// get sitemap content
$content = file_get_contents($sitemap);
// parse the sitemap content to object
$xml = simplexml_load_string($content);
//var_dump($xml);
// Init arrays
$crawl_xml_files = [];
$extracted_urls = [];
$extracted_last_mods = [];
$extracted_changefreqs = [];
$extracted_priorities = [];
// retrieve properties from the sitemap object
foreach ($xml->url as $urlElement) {
// provide path of curren xml/html file
$path = (string)$urlElement->loc;
// get pathinfo
$ext = pathinfo($path, PATHINFO_EXTENSION);
echo 'The extension is: ' . $ext;
echo '<br>'; //DELETE IN DEV MODE
echo $urlElement; //DELETE IN DEV MODE
if ($ext == 'xml') //This means, the links found on the current page are not links to the site's webpages but links to further xml sitemaps. And so need the crawler to go another level deep to hunt for the site's html pages.
{
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
//Add Xml Links to array.
$crawl_xml_files[] = $path;
} elseif ($ext == 'html' || $ext == 'htm' || $ext == 'shtml' || $ext == 'shtm' || $ext == 'php' || $ext == 'py') //This means, the links found on the current page are the site's html pages and are not not links to further xml sitemaps.
{
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
//Add hrefs to array.
//$extracted_urls[] = $path;
// get properties
$extracted_urls[] = $extracted_url = $urlElement->loc; //Add hrefs to array.
$extracted_last_mods[] = $extracted_lastmod = $urlElement->lastmod; //Add lastmod to array.
$extracted_changefreqs[] = $extracted_changefreq = $urlElement->changefreq; //Add changefreq to array.
$extracted_priorities[] = $extracted_priority = $urlElement->priority; //Add priority to array.
}
}
var_dump($crawl_xml_files); //Print all extracted Xml Links.
var_dump($extracted_urls); //Print all extracted hrefs.
var_dump($extracted_last_mods); //Print all extracted last mods.
var_dump($extracted_changefreqs); //Print all extracted changefreqs.
var_dump($extracted_priorities); //Print all extracted priorities.
foreach($crawl_xml_files as $crawl_xml_file)
{
echo 'Xml File to crawl: ' .$crawl_xml_file; //Print all extracted Xml Links.
}
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
foreach($extracted_urls as $extracted_url)
{
echo 'Extracted Url: ' .$extracted_url; //Print all extracted hrefs.
}
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
foreach($extracted_last_mods as $extracted_last_mod)
{
echo 'Extracted last Mod: ' .$extracted_last_mod; //Print all extracted last mods.
}
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
foreach($extracted_changefreqs as $extracted_changefreq)
{
echo 'Extracted Change Frequency: ' .$extracted_changefreq; //Print all extracted changefreqs.
}
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
foreach($extracted_priorities as $extracted_priority)
{
echo 'Extracted Priority: ' .$extracted_priority; //Print all extracted priorities.
}
echo __LINE__;
echo '<br>'; //DELETE IN DEV MODE
How to fix this ?
I get this echoed ....
The extension is:
The extension is:
The extension is:
The extension is:
The extension is:
The extension is:
C:\wamp64\www\Work\buzz\Templates\crawler_Test.php:66:
array (size=0)
empty
C:\wamp64\www\Work\buzz\Templates\crawler_Test.php:67:
array (size=0)
empty
C:\wamp64\www\Work\buzz\Templates\crawler_Test.php:68:
array (size=0)
empty
C:\wamp64\www\Work\buzz\Templates\crawler_Test.php:69:
array (size=0)
empty
C:\wamp64\www\Work\buzz\Templates\crawler_Test.php:70:
array (size=0)
empty
77
85
93
101
109
Obviously, I get tonnes of lines of ...
The extension is: