Hii
Will anyone help me how to read website data in php.
Please Help me i have been doing this for long time.
Thanks for help

I mean to get data from website by reading its content.

Member Avatar for diafol

Or simply $var = file_get_contents("http://www.example.com"); Should work on most servers.

True enough, but it won't get modified output (e.g. from javascripts).

True enough, but it won't get modified output (e.g. from javascripts).

Also true. But is there anything in PHP that can do that?
Short of actually copying the entire thing, JavaScript files include, and having a browser execute it all.

I mean, building a PHP function that would actually execute JavaScript would be a massive project.
I don't believe such a project exists. Am I wrong?

Member Avatar for diafol

If you use cURL (see the link I posted above), you extract the html (after all scripts have run), so it's job done! I couldn't believe it when I used it for the first time. I was trying to get hold of js files all over the place, but this seems to do the trick.

That would be awesome, but I can't seem to get it to work like that.

Meaning; the cUrl itself doesn't execute any of the script... it simply downloads the response, like the file_get_contents does. It's obviously a lot more powerful than that, but I can't get it to actually execute the scripts before returning the output to me.

As an example, I'm using:

<?php
// So we see the response, and to prevent
// my browser from executing the JavaScript
header("content-type: text/plain");

// URL to my test HTML page
$myurl = 'http://localhost/test.html';

// Fetch the page
$curl_handle = curl_init();
curl_setopt ($curl_handle, CURLOPT_URL, $myurl);
curl_setopt ($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl_handle, CURLOPT_CONNECTTIMEOUT, 1);
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);   

// Print the content
echo $buffer;
?>

This returns:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
    <head>
        <title></title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <script type="text/javascript">
        window.onload = function() {
            document.getElementById('MainContent').innerHTML = "This is the content!";
        }
        </script>
    </head>
    <body>
        <h1>There should be text below this!</h1>
        <p id="MainContent"></p>
        <p id="InlineGenerated">
            <script type="text/javascript">
                document.write('This is generated on the spot, within no event.');
            </script>
        </p>
    </body>
</html>

All the scripts remain in the markup, and none of them are executed.

Obviously, if I had printed this directly into my browser, my browser would have execute the scripts, but that's not what I'm after here.

Am I missing something?

Member Avatar for diafol

It seems to work for me. You need to change the references to the js files to absolute (e.g. <script src="/js/myfile.js"></script> to <script src="http://www.example.com/js/myfile.js"></script>). This can be done easily enough with a str_replace().

It seems to work for me. You need to change the references to the js files to absolute (e.g. <script src="/js/myfile.js"></script> to <script src="http://www.example.com/js/myfile.js"></script>). This can be done easily enough with a str_replace().

Ok, so you are talking about getting the raw HTML (including the JavaScript), like I did, altering it (changing the script URIs), and then printing it to your browser to be executed?

Initially you made it sound like you were getting the markup to PHP after the scripts had run:

If you use cURL (see the link I posted above), you extract the html (after all scripts have run), so it's job done!

Maybe I'm misinterpreting what you meant?

Member Avatar for diafol

Ok, so you are talking about getting the raw HTML (including the JavaScript), like I did, altering it (changing the script URIs), and then printing it to your browser to be executed?

Initially you made it sound like you were getting the markup to PHP after the scripts had run:


Maybe I'm misinterpreting what you meant?

Nah, my fault - I didn't make it clear that you need to change the js src or/and link href. As the whole page is placed into a string, a str_replace(), e.g.

$buffer = str_replace('src="/','src="http://www.example.com/',$buffer);
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.