Bing AI Fixed My Crawler

Question

borobhaisab 117 Posting Whiz

2 Years Ago

@dani

I got ChatGpt to fix my Crawler that was showing error.
This code that was showing error:

My Buggy Code

<?php

//START OF SCRIPT FLOW.

//Preparing Crawler & Session: Initialising Variables.

//Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
//SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps  = []; //This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods  = []; //This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs  = []; //his will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities  = []; //This will list SiteMap pages priorities - found on Sitemaps.

//Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls  = []; //This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods  = []; //This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs  = []; //his will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities  = []; //This will list html pages priorities - found on Sitemaps.

//Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
//Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names  = []; //This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions  = []; //This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles  = []; //This will list crawled pages Titles - found on html pages.
// -----

//Step 1: Initiate Session - Feed Xml SiteMap Url. Crawing Starting Point.
//Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml SItemap).
//$initial_url = "https://www.rocktherankings.com/sitemap_index.xml"; //Has more xml files.
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";

//$xmls = file_get_contents($initial_url); //Should I stick to this line or below line ?
//Parse the sitemap content to object
//$xml = simplexml_load_string($xmls); //Should I stick to this line or above line ?
$xml = simplexml_load_string(file_get_contents($initial_url)); //Code from Dani: https://www.daniweb.com/programming/web-development/threads/540168/what-to-lookout-for-to-prevent-crawler-traps

$dom = new DOMDocument();
$dom->loadXML($xml); //LINE: 44z
//$result = @$dom->loadXML($xml); //LINE: 44

echo __LINE__; echo '<br>'; //LINE: 46

extract_links($xml);

echo __LINE__; echo '<br>';  //LINE: 50

foreach($sitemaps AS $sitemap)
{
    echo __LINE__; echo '<br>';
    extract_links($sitemap); //Extract Links on page.
}

foreach($html_page_urls AS $html_page_url)
{
    echo __LINE__; echo '<br>';
    $scrape_page_data($html_page_url); //Extract Links on page.
}

//END OF SCRIPT FLOW.

//DUNCTIONS BEYOND THIS POINT.

//Links Extractor.
function extract_links()
{
    echo __LINE__; echo '<br>';  //LINE: 73

    GLOBAL $dom;
    //Trigger following IF/ELSEs on each Crawled Page to check for link types. Whether Links lead to more SiteMaps (.xml) or webpages (.html, .htm, .php, etc.).
    if ($dom->nodeName === 'sitemapindex')  //Current Xml SiteMap Page lists more Xml SiteMaps. Lists links to Xml links. Not lists links to html links.
    {
        echo __LINE__; echo '<br>';

        //parse the index
        // retrieve properties from the sitemap object
        foreach ($xml->sitemapindex as $urlElement) //Extracts xml file urls.
        {
            // get properties
            $sitemaps[] = $sitemap_url = $urlElement->loc;
            $sitemaps_last_mods[] = $last_mod = $urlElement->lastmod;
            $sitemaps_change_freqs[] = $change_freq = $urlElement->changefreq;
            $sitemaps_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $sitemap_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 
    else if ($dom->nodeName === 'urlset')  //Current Xml SiteMap Page lists no more Xml SiteMap links. Lists only html links.
    {
        echo __LINE__; echo '<br>';

        //parse url set
        // retrieve properties from the sitemap object
        foreach ($xml->urlset as $urlElement) //Extracts Sitemap Urls.
        {
            // get properties
            $html_page_urls[] = $html_page_url = $urlElement->loc;
            $html_page_last_mods[] = $last_mod = $urlElement->lastmod;
            $html_page_change_freqs[] = $change_freq = $urlElement->changefreq;
            $html_page_priorities[] = $priority = $urlElement->priority;

            // print out the properties
            echo 'url: '. $html_page_url . '<br>';
            echo 'lastmod: '. $last_mod . '<br>';
            echo 'changefreq: '. $change_freq . '<br>';
            echo 'priority: '. $priority . '<br>';

            echo '<br>---<br>';
        }
    } 

    GLOBAL $sitemaps;
    GLOBAL $sitemaps_last_mods;
    GLOBAL $sitemaps_change_freqs;
    GLOBAL $sitemaps_priorities;

    GLOBAL $html_page_urls;
    GLOBAL $html_page_last_mods;
    GLOBAL $html_page_change_freqs;
    GLOBAL $html_page_priorities;

    echo 'SiteMaps Crawled: ---'; echo '<br><br>'; 
    if(array_count_values($sitemaps)>0)
    {   
        print_r($sitemaps);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_last_mods)>0)
    {   
        print_r($sitemaps_last_mods);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_change_freqs)>0)
    {   
        print_r($sitemaps_change_freqs);
        echo '<br>';
    }
    elseif(array_count_values($sitemaps_priorities)>0)
    {   
        print_r($sitemaps_priorities);
        echo '<br><br>'; 
    }

    echo 'Html Pages Crawled: ---'; echo '<br><br>'; 

    if(array_count_values($html_page_urls)>0)
    {   
        print_r($html_page_urls);
        echo '<br>';
    }
    if(array_count_values($html_page_last_mods)>0)
    {   
        print_r($html_page_last_mods);
        echo '<br>';
    }
    if(array_count_values($html_page_change_freqs)>0)
    {   
        print_r($html_page_change_freqs);
        echo '<br>';
    }
    if(array_count_values($html_page_priorities)>0)
    {   
        print_r($html_page_priorities);
        echo '<br>';
    }
}

//Meta Data & Title Extractor.
function scrape_page_data()
{
    GLOBAL $html_page_urls;
    if(array_count_values($html_page_urls)>0)
    {       
        foreach($html_page_urls AS $url)
        {
            // https://www.php.net/manual/en/function.file-get-contents
            $html = file_get_contents($url);

            //https://www.php.net/manual/en/domdocument.construct.php
            $doc = new DOMDocument();

            // https://www.php.net/manual/en/function.libxml-use-internal-errors.php
            libxml_use_internal_errors(true);

            // https://www.php.net/manual/en/domdocument.loadhtml.php
            $doc->loadHTML($html, LIBXML_COMPACT|LIBXML_NOERROR|LIBXML_NOWARNING);

            // https://www.php.net/manual/en/function.libxml-clear-errors.php
            libxml_clear_errors();

            // https://www.php.net/manual/en/domdocument.getelementsbytagname.php
            $meta_tags = $doc->getElementsByTagName('meta');

            // https://www.php.net/manual/en/domnodelist.item.php
            if ($meta_tags->length > 0)
            {
                // https://www.php.net/manual/en/class.domnodelist.php
                foreach ($meta_tags as $tag)
                {
                    // https://www.php.net/manual/en/domnodelist.item.php
                    echo 'Meta Name: ' .$meta_name = $tag->getAttribute('name'); echo '<br>';
                    echo 'Meta Content: ' .$meta_content = $tag->getAttribute('content');  echo '<br>';
                    $html_page_meta_names[] = $meta_name;
                    $html_page_meta_descriptions[] = $meta_content;
                }
            }

            //EXAMPLE 1: Extract Title
            $title_tag = $doc->getElementsByTagName('title');
            if ($title_tag->length>0)
            {
                echo 'Title: ' .$title = $title_tag[0]->textContent; echo '<br>';
                $html_page_titles[] = $title;
            }

            //EXAMPLE 2: Extract Title
            $title_tag = $doc->getElementsByTagName('title');

            for ($i = 0; $i < $title_tag->length; $i++) {
                echo 'Title: ' .$title = $title_tag->item($i)->nodeValue . "\n";
                $html_page_titles[] = $title;
            }
        }
    }
}

if(array_count_values($html_page_meta_names)>0)
{   
    print_r($html_page_meta_names);
    echo '<br>';
}

if(array_count_values($html_page_meta_descriptions)>0)
{   
    print_r($html_page_meta_descriptions);
    echo '<br>';
}

if(array_count_values($html_page_titles)>0)
{   
    print_r($html_page_titles);
    echo '<br>';
}

//END OF FUNCTIONS.

Bing AI Fix v1:

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

//No Page Loading Time-Out feature.
//No Page Loading Status Codes Feature.

// Preparing Crawler & Session: Initializing Variables.

// Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
// SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps = []; // This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods = []; // This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs = []; // This will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities = []; // This will list SiteMap pages priorities - found on Sitemaps.

// Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls = []; // This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods = []; // This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs = []; // This will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities = []; // This will list html pages priorities - found on Sitemaps.

// Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
// Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names = []; // This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions = []; // This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles = []; // This will list crawled pages Titles - found on html pages.

// Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml Sitemap).
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";

// Call the extract_links function with the initial sitemap URL
extract_links($initial_url);

// Links Extractor.
function extract_links($url)
{
    global $sitemaps, $sitemaps_last_mods, $sitemaps_change_freqs, $sitemaps_priorities;
    global $html_page_urls, $html_page_last_mods, $html_page_change_freqs, $html_page_priorities;

    // Load the sitemap content
    $xml = simplexml_load_string(file_get_contents($url));

    // Check if the sitemap contains other sitemaps
    if (isset($xml->sitemap)) {
        // Loop through the sitemaps
        foreach ($xml->sitemap as $sitemap) {
            // Recursively call the extract_links function on the sitemap URL
            extract_links($sitemap->loc);

            // Store information about the sitemap
            $sitemaps[] = (string)$sitemap->loc;
            if (isset($sitemap->lastmod)) {
                $sitemaps_last_mods[] = (string)$sitemap->lastmod;
            }
            if (isset($sitemap->changefreq)) {
                $sitemaps_change_freqs[] = (string)$sitemap->changefreq;
            }
            if (isset($sitemap->priority)) {
                $sitemaps_priorities[] = (string)$sitemap->priority;
            }
        }
    } else if (isset($xml->url)) {
        // Loop through the URLs in the sitemap
        foreach ($xml->url as $url) {
            // Store information about the URL
            $html_page_urls[] = (string)$url->loc;
            if (isset($url->lastmod)) {
                $html_page_last_mods[] = (string)$url->lastmod;
            }
            if (isset($url->changefreq)) {
                $html_page_change_freqs[] = (string)$url->changefreq;
            }
            if (isset($url->priority)) {
                $html_page_priorities[] = (string)$url->priority;
            }
        }
    }
}

// Scrape meta data from HTML pages
foreach ($html_page_urls as $url) {
    scrape_page_data($url);
}

// Meta Data & Title Extractor.
function scrape_page_data($url)
{
    global $html_page_meta_names, $html_page_meta_descriptions, $html_page_titles;

    // Load the HTML content
    @$doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));

    // Extract meta tags
    foreach ($doc->getElementsByTagName('meta') as $meta) {
        if ($meta->hasAttribute('name') && $meta->hasAttribute('content')) {
            $name = strtolower($meta->getAttribute('name'));
            if ($name === 'description') {
                $html_page_meta_descriptions[] = trim($meta->getAttribute('content'));
            } else {
                $html_page_meta_names[] = trim($meta->getAttribute('content'));
            }
        }
    }

    // Extract title tag
    foreach ($doc->getElementsByTagName('title') as $title) {
        $html_page_titles[] = trim($title->nodeValue);
    }
}

// Print out the extracted data in a more readable format
echo "<pre>";
echo "SITEMAPS:\n";
echo print_r($sitemaps, true);
echo "SITEMAPS LAST MODS:\n";
echo print_r($sitemaps_last_mods, true);
echo "SITEMAPS CHANGE FREQS:\n";
echo print_r($sitemaps_change_freqs, true);
echo "SITEMAPS PRIORITIES:\n";
echo print_r($sitemaps_priorities, true);

echo "HTML PAGE URLS:\n";
echo print_r($html_page_urls, true);
echo "HTML PAGE LAST MODS:\n";
echo print_r($html_page_last_mods, true);
echo "HTML PAGE CHANGE FREQS:\n";
echo print_r($html_page_change_freqs, true);
echo "HTML PAGE PRIORITIES:\n";
echo print_r($html_page_priorities, true);

echo "HTML PAGE META NAMES:\n";
echo print_r($html_page_meta_names, true);
echo "HTML PAGE META DESCRIPTIONS:\n";
echo print_r($html_page_meta_descriptions, true);
echo "HTML PAGE TITLES:\n";
echo print_r($html_page_titles, true);
echo "</pre>";

I told Bing AI to add Time-Out. Crawler v2:

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

// Set the timeout value in seconds
$timeout = 10;

// Preparing Crawler & Session: Initializing Variables.

// Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
// SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps = []; // This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods = []; // This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs = []; // This will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities = []; // This will list SiteMap pages priorities - found on Sitemaps.

// Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls = []; // This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods = []; // This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs = []; // This will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities = []; // This will list html pages priorities - found on Sitemaps.

// Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
// Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names = []; // This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions = []; // This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles = []; // This will list crawled pages Titles - found on html pages.

// Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml Sitemap).
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";

// Call the extract_links function with the initial sitemap URL
extract_links($initial_url);

// Links Extractor.
function extract_links($url)
{
    global $timeout;
    global $sitemaps, $sitemaps_last_mods, $sitemaps_change_freqs, $sitemaps_priorities;
    global $html_page_urls, $html_page_last_mods, $html_page_change_freqs, $html_page_priorities;

    // Create a stream context with the timeout option
    $context = stream_context_create([
        'http' => [
            'timeout' => $timeout
        ]
    ]);

    // Load the sitemap content
    $xml = simplexml_load_string(file_get_contents($url, false, $context));

    // Check if the sitemap contains other sitemaps
    if (isset($xml->sitemap)) {
        // Loop through the sitemaps
        foreach ($xml->sitemap as $sitemap) {
            // Recursively call the extract_links function on the sitemap URL
            extract_links($sitemap->loc);

            // Store information about the sitemap
            $sitemaps[] = (string)$sitemap->loc;
            if (isset($sitemap->lastmod)) {
                $sitemaps_last_mods[] = (string)$sitemap->lastmod;
            }
            if (isset($sitemap->changefreq)) {
                $sitemaps_change_freqs[] = (string)$sitemap->changefreq;
            }
            if (isset($sitemap->priority)) {
                $sitemaps_priorities[] = (string)$sitemap->priority;
            }
        }
    } else if (isset($xml->url)) {
        // Loop through the URLs in the sitemap
        foreach ($xml->url as $url) {
            // Store information about the URL
            $html_page_urls[] = (string)$url->loc;
            if (isset($url->lastmod)) {
                $html_page_last_mods[] = (string)$url->lastmod;
            }
            if (isset($url->changefreq)) {
                $html_page_change_freqs[] = (string)$url->changefreq;
            }
            if (isset($url->priority)) {
                $html_page_priorities[] = (string)$url->priority;
            }
        }
    }
}

// Scrape meta data from HTML pages
foreach ($html_page_urls as $url) {
    scrape_page_data($url);
}

// Meta Data & Title Extractor.
function scrape_page_data($url)
{
    global $html_page_meta_names, $html_page_meta_descriptions, $html_page_titles;

    // Load the HTML content
    @$doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));

    // Extract meta tags
    foreach ($doc->getElementsByTagName('meta') as $meta) {
        if ($meta->hasAttribute('name') && $meta->hasAttribute('content')) {
            $name = strtolower($meta->getAttribute('name'));
            if ($name === 'description') {
                $html_page_meta_descriptions[] = trim($meta->getAttribute('content'));
            } else {
                $html_page_meta_names[] = trim($meta->getAttribute('content'));
            }
        }
    }

    // Extract title tag
    foreach ($doc->getElementsByTagName('title') as $title) {
        $html_page_titles[] = trim($title->nodeValue);
    }
}

// Print out the extracted data in a more readable format
echo "<pre>";
echo "SITEMAPS:\n";
echo print_r($sitemaps, true);
echo "SITEMAPS LAST MODS:\n";
echo print_r($sitemaps_last_mods, true);
echo "SITEMAPS CHANGE FREQS:\n";
echo print_r($sitemaps_change_freqs, true);
echo "SITEMAPS PRIORITIES:\n";
echo print_r($sitemaps_priorities, true);

echo "HTML PAGE URLS:\n";
echo print_r($html_page_urls, true);
echo "HTML PAGE LAST MODS:\n";
echo print_r($html_page_last_mods, true);
echo "HTML PAGE CHANGE FREQS:\n";
echo print_r($html_page_change_freqs, true);
echo "HTML PAGE PRIORITIES:\n";
echo print_r($html_page_priorities, true);

echo "HTML PAGE META NAMES:\n";
echo print_r($html_page_meta_names, true);
echo "HTML PAGE META DESCRIPTIONS:\n";
echo print_r($html_page_meta_descriptions, true);
echo "HTML PAGE TITLES:\n";
echo print_r($html_page_titles, true);
echo "</pre>";

I told Bing AI to add Status Codes for 4xx & 5xx ranges. Crawler v3a:

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

// Set the timeout value in seconds
$timeout = 10;

// Preparing Crawler & Session: Initializing Variables.

// Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
// SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps = []; // This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods = []; // This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs = []; // This will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities = []; // This will list SiteMap pages priorities - found on Sitemaps.

// Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls = []; // This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods = []; // This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs = []; // This will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities = []; // This will list html pages priorities - found on Sitemaps.

// Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
// Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names = []; // This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions = []; // This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles = []; // This will list crawled pages Titles - found on html pages.

// Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml Sitemap).
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";

// Call the extract_links function with the initial sitemap URL
extract_links($initial_url);

// Links Extractor.
function extract_links($url)
{
    global $timeout;
    global $sitemaps, $sitemaps_last_mods, $sitemaps_change_freqs, $sitemaps_priorities;
    global $html_page_urls, $html_page_last_mods, $html_page_change_freqs, $html_page_priorities;

    // Check the status code of the URL
    $status_code = get_status_code($url);
    if ($status_code >= 400) {
        echo "Error: Unable to process URL $url (status code: $status_code)\n";
        return;
    }

    // Create a stream context with the timeout option
    $context = stream_context_create([
        'http' => [
            'timeout' => $timeout
        ]
    ]);

    // Load the sitemap content
    $xml = simplexml_load_string(file_get_contents($url, false, $context));

    // Check if the sitemap contains other sitemaps
    if (isset($xml->sitemap)) {
        // Loop through the sitemaps
        foreach ($xml->sitemap as $sitemap) {
            // Recursively call the extract_links function on the sitemap URL
            extract_links($sitemap->loc);

            // Store information about the sitemap
            $sitemaps[] = (string)$sitemap->loc;
            if (isset($sitemap->lastmod)) {
                $sitemaps_last_mods[] = (string)$sitemap->lastmod;
            }
            if (isset($sitemap->changefreq)) {
                $sitemaps_change_freqs[] = (string)$sitemap->changefreq;
            }
            if (isset($sitemap->priority)) {
                $sitemaps_priorities[] = (string)$sitemap->priority;
            }
        }
    } else if (isset($xml->url)) {
        // Loop through the URLs in the sitemap
        foreach ($xml->url as $url) {
            // Store information about the URL
            $html_page_urls[] = (string)$url->loc;
            if (isset($url->lastmod)) {
                $html_page_last_mods[] = (string)$url->lastmod;
            }
            if (isset($url->changefreq)) {
                $html_page_change_freqs[] = (string)$url->changefreq;
            }
            if (isset($url->priority)) {
                $html_page_priorities[] = (string)$url->priority;
            }
        }
    }
}

// Scrape meta data from HTML pages
foreach ($html_page_urls as $url) {
    scrape_page_data($url);
}

// Meta Data & Title Extractor.
function scrape_page_data($url)
{
    global $html_page_meta_names, $html_page_meta_descriptions, $html_page_titles;

    // Check the status code of the URL
    $status_code = get_status_code($url);
    if ($status_code >= 400) {
        echo "Error: Unable to process URL $url (status code: $status_code)\n";
        return;
    }

    // Load the HTML content
    @$doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));

    // Extract meta tags
    foreach ($doc->getElementsByTagName('meta') as $meta) {
        if ($meta->hasAttribute('name') && $meta->hasAttribute('content')) {
            $name = strtolower($meta->getAttribute('name'));
            if ($name === 'description') {
                $html_page_meta_descriptions[] = trim($meta->getAttribute('content'));
            } else {
                $html_page_meta_names[] = trim($meta->getAttribute('content'));
            }
        }
    }

    // Extract title tag
    foreach ($doc->getElementsByTagName('title') as $title) {
        $html_page_titles[] = trim($title->nodeValue);
    }
}

// Get HTTP status code for a given URL
function get_status_code($url)
{
    global $timeout;

    // Create a stream context with the timeout option
    $context = stream_context_create([
        'http' => [
            'timeout' => $timeout,
            'method' => 'HEAD'
        ]
    ]);

    // Send a HEAD request to get only headers and not body content
    @$headers = get_headers($url, 0, $context);

    // Check if headers were returned
    if ($headers === false) {
        return false;
    }

    // Get the first header line (status line)
    $status_line = $headers[0];

    // Extract the status code from the status line
    list(, $status_code) = explode(' ', $status_line);

    // Return the status code
    return (int)$status_code;
}

// Print out the extracted data in a more readable format
echo "<pre>";
echo "SITEMAPS:\n";
echo print_r($sitemaps, true);
echo "SITEMAPS LAST MODS:\n";
echo print_r($sitemaps_last_mods, true);
echo "SITEMAPS CHANGE FREQS:\n";
echo print_r($sitemaps_change_freqs, true);
echo "SITEMAPS PRIORITIES:\n";
echo print_r($sitemaps_priorities, true);

echo "HTML PAGE URLS:\n";
echo print_r($html_page_urls, true);
echo "HTML PAGE LAST MODS:\n";
echo print_r($html_page_last_mods, true);
echo "HTML PAGE CHANGE FREQS:\n";
echo print_r($html_page_change_freqs, true);
echo "HTML PAGE PRIORITIES:\n";
echo print_r($html_page_priorities, true);

echo "HTML PAGE META NAMES:\n";
echo print_r($html_page_meta_names, true);
echo "HTML PAGE META DESCRIPTIONS:\n";
echo print_r($html_page_meta_descriptions, true);
echo "HTML PAGE TITLES:\n";
echo print_r($html_page_titles, true);
echo "</pre>";

?>

I told Bing AI to add Status Codes for 4xx & 5xx ranges. Crawler v3b:

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

// Set the timeout value in seconds
$timeout = 10;

// Preparing Crawler & Session: Initializing Variables.

// Preparing $ARRAYS For Step 1: To Deal with Xml Links meant for Crawlers only.
// SiteMaps Details Scraped from SiteMaps or Xml Files.
$sitemaps = []; // This will list extracted further Xml SiteMap links (.xml) found on Sitemaps (.xml).
$sitemaps_last_mods = []; // This will list dates of SiteMap pages last modified - found on Sitemaps.
$sitemaps_change_freqs = []; // This will list SiteMap dates of html pages frequencies of page updates - found on Sitemaps.
$sitemaps_priorities = []; // This will list SiteMap pages priorities - found on Sitemaps.

// Webpage Details Scraped from SiteMaps or Xml Files.
$html_page_urls = []; // This will list extracted html links Urls (.html, .htm, .php) - found on Sitemaps (.xml).
$html_page_last_mods = []; // This will list dates of html pages last modified - found on Sitemap.
$html_page_change_freqs = []; // This will list dates of html pages frequencies of page updates - found on Sitemaps.
$html_page_priorities = []; // This will list html pages priorities - found on Sitemaps.

// Preparing $ARRAYS For Step 2: To Deal with html pages meant for Human Visitors only.
// Data Scraped from Html Files. Not Xml SiteMap Files.
$html_page_meta_names = []; // This will list crawled pages Meta Tag Names - found on html pages.
$html_page_meta_descriptions = []; // This will list crawled pages Meta Tag Descriptions - found on html pages.
$html_page_titles = []; // This will list crawled pages Titles - found on html pages.

// Crawl Session Starting Page/Initial Xml Sitemap. (NOTE: Has to be .xml Sitemap).
$initial_url = "http://localhost/Work/buzz/Templates/0.xml";

// Call the extract_links function with the initial sitemap URL
extract_links($initial_url);

// Links Extractor.
function extract_links($url)
{
    global $timeout;
    global $sitemaps, $sitemaps_last_mods, $sitemaps_change_freqs, $sitemaps_priorities;
    global $html_page_urls, $html_page_last_mods, $html_page_change_freqs, $html_page_priorities;

    // Check the status code of the URL
    $status_code = get_status_code($url);
    if ($status_code >= 400 && $status_code < 600) {
        echo "Error: Unable to process URL $url (status code: $status_code)\n";
        return;
    }

    // Create a stream context with the timeout option
    $context = stream_context_create([
        'http' => [
            'timeout' => $timeout
        ]
    ]);

    // Load the sitemap content
    $xml = simplexml_load_string(file_get_contents($url, false, $context));

    // Check if the sitemap contains other sitemaps
    if (isset($xml->sitemap)) {
        // Loop through the sitemaps
        foreach ($xml->sitemap as $sitemap) {
            // Recursively call the extract_links function on the sitemap URL
            extract_links($sitemap->loc);

            // Store information about the sitemap
            $sitemaps[] = (string)$sitemap->loc;
            if (isset($sitemap->lastmod)) {
                $sitemaps_last_mods[] = (string)$sitemap->lastmod;
            }
            if (isset($sitemap->changefreq)) {
                $sitemaps_change_freqs[] = (string)$sitemap->changefreq;
            }
            if (isset($sitemap->priority)) {
                $sitemaps_priorities[] = (string)$sitemap->priority;
            }
        }
    } else if (isset($xml->url)) {
        // Loop through the URLs in the sitemap
        foreach ($xml->url as $url) {
            // Store information about the URL
            $html_page_urls[] = (string)$url->loc;
            if (isset($url->lastmod)) {
                $html_page_last_mods[] = (string)$url->lastmod;
            }
            if (isset($url->changefreq)) {
                $html_page_change_freqs[] = (string)$url->changefreq;
            }
            if (isset($url->priority)) {
                $html_page_priorities[] = (string)$url->priority;
            }
        }
    }
}

// Scrape meta data from HTML pages
foreach ($html_page_urls as $url) {
    scrape_page_data($url);
}

// Meta Data & Title Extractor.
function scrape_page_data($url)
{
    global $html_page_meta_names, $html_page_meta_descriptions, $html_page_titles;

    // Check the status code of the URL
    $status_code = get_status_code($url);
    if ($status_code >= 400 && $status_code < 600) {
        echo "Error: Unable to process URL $url (status code: $status_code)\n";
        return;
    }

    // Load the HTML content
    @$doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));

    // Extract meta tags
    foreach ($doc->getElementsByTagName('meta') as $meta) {
        if ($meta->hasAttribute('name') &&$meta->hasAttribute('content')) {
            $name = strtolower($meta->getAttribute('name'));
            if ($name === 'description') {
                $html_page_meta_descriptions[] = trim($meta->getAttribute('content'));
            } else {
                $html_page_meta_names[] = trim($meta->getAttribute('content'));
            }
        }
    }

    // Extract title tag
    foreach ($doc->getElementsByTagName('title') as$title) {
        $html_page_titles[] = trim($title->nodeValue);
    }
}

// Get HTTP status code for a given URL
function get_status_code($url)
{
    global$timeout;

    // Create a stream context with the timeout option
    $context=stream_context_create([
        'http'=>[
            'timeout'=>$timeout,
            'method'=>'HEAD'
        ]
    ]);

    // Send a HEAD request to get only headers and not body content
   @$headers=get_headers($url,0,$context);

   // Check if headers were returned
   if ($headers===false){
       return false;
   }

   // Get the first header line(status line)
   $status_line=$headers[0];

   // Extract the status code from the status line
   list(,$status_code)=explode(' ',$status_line);

   // Return the status code
   return(int)$status_code;
}

// Print out the extracted data in a more readable format
echo "<pre>";
echo "SITEMAPS:\n";
echo print_r($sitemaps,true);
echo "SITEMAPS LAST MODS:\n";
echo print_r($sitemaps_last_mods,true);
echo "SITEMAPS CHANGE FREQS:\n";
echo print_r($sitemaps_change_freqs,true);
echo "SITEMAPS PRIORITIES:\n";
echo print_r($sitemaps_priorities,true);

echo "HTML PAGE URLS:\n";
echo print_r($html_page_urls,true);
echo "HTML PAGE LAST MODS:\n";
echo print_r($html_page_last_mods,true);
echo "HTML PAGE CHANGE FREQS:\n";
echo print_r($html_page_change_freqs,true);
echo "HTML PAGE PRIORITIES:\n";
echo print_r($html_page_priorities,true);

echo "HTML PAGE META NAMES:\n";
echo print_r($html_page_meta_names,true);
echo "HTML PAGE META DESCRIPTIONS:\n";
echo print_r($html_page_meta_descriptions,true);
echo "HTML PAGE TITLES:\n";
echo print_r($html_page_titles,true);
echo "</pre>";

?>

php seo web-browser

1 Contributor
1 Reply
212 Views
51 Seconds Discussion Span
Latest Post 2 Years Ago Latest Post by borobhaisab

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

borobhaisab 117 Posting Whiz · Answer 1 · 2023-05-28T22:00:16+00:00

@dani

Which of these 2 I should stick to and why ?
Bing Ai Fixed: Crawler v3a
Bing Ai Fixed: Crawler v3b: