Use web browser engine to extract URLs from a page?

Question

kingneil 0 Newbie Poster

10 Years Ago

In PHP, I've tried using simple_html_dom in order to extract URLs from web pages.

And it works a lot of the time, but not all of the time.

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

Is this possible? How do I do it?

2 Contributors
3 Replies
211 Views
2 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by pixelsoul

pixelsoul 272 Red Pill

10 Years Ago

I can get the links from that site using file_get_contents, DOMDocument, and DOMXPath. If you're looking for more of a browser type behavior, I would recommend looking at a library like http://phantomjs.org/

Edited 10 Years Ago by pixelsoul

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

kingneil 0 Newbie Poster · Answer 1 · 2014-04-17T14:53:00+00:00

OK... I've done as you said with DOMDocument and DOMXPath

Here is my code

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'\n';
}

----

The issue is... this only gets URLs... How do I get the text overlay for the URLs?

Thanks

pixelsoul 272 Red Pill Featured Poster · Answer 2 · 2014-04-17T15:37:19+00:00

Do you mean the text value of the link?

You can use this to get the text of link $text = $href->textContent;