Folks,
Using DomDocument, I am trying to build a crawler that, when I feed it a starting point url (initial url to start the crawling & link extracting from), it should navigate to the starting url and extract all the links found on the page.
<?php
$xml = file_get_contents($sitemapUrl); //Should I stick to this line or below line ?
// parse the sitemap content to object
$xml = simplexml_load_string($sitemapUrl); //Should I stick to this line or above line ?
$dom = new DOMDocument();
$dom->loadXML($xml);
if ($dom->nodeName === 'sitemapindex')
{
//parse the index
// retrieve properties from the sitemap object
foreach ($xml->urlset as $urlElement) //Extracts html file urls.
{
// get properties
$url = $urlElement->loc;
$lastmod = $urlElement->lastmod;
$changefreq = $urlElement->changefreq;
$priority = $urlElement->priority;
// print out the properties
echo 'url: '. $url . '<br>';
echo 'lastmod: '. $lastmod . '<br>';
echo 'changefreq: '. $changefreq . '<br>';
echo 'priority: '. $priority . '<br>';
echo '<br>---<br>';
}
}
else if ($dom->nodeName === 'urlset')
{
//parse url set
// retrieve properties from the sitemap object
foreach ($xml->sitemapindex as $urlElement) //Extracts Sitemap Urls.
{
// get properties
$url = $urlElement->loc;
$lastmod = $urlElement->lastmod;
$changefreq = $urlElement->changefreq;
$priority = $urlElement->priority;
// print out the properties
echo 'url: '. $url . '<br>';
echo 'lastmod: '. $lastmod . '<br>';
echo 'changefreq: '. $changefreq . '<br>';
echo 'priority: '. $priority . '<br>';
echo '<br>---<br>';
}
}
Now, how to write code to extract meta tags using DomDocument ?
Where can I find the code here ?
https://www.php.net/domdocument