Hi, I am trying to do the following: produce an interactive web application that will scrape the contents of a movie web page and extract the movie title, year of release, User rating, Director, runtime and the first three cast members.

This is the typical data in red that I am trying to extract from the page. Or even if this is the correct way to approach this problem.


<div id="tn15title">
<h1>Letters to Juliet <span>(<a href="/year/2010/">2010</a>) <span class="pro-link"><a href="http://pro.imdb.com/rg/maindetails-title/tconst-pro-header-link/title/tt0892318/">More at <strong>IMDbPro</strong></a>&nbsp;&raquo;</span><span class="title-extra"></span></span></h1>
</div>


<div class="info-content">
14 May 2010 (USA)
<a class="tn15more inline" href="/title/tt0892318/releaseinfo" onClick="(new Image()).src='/rg/title-tease/releasedates/images/b.gif?link=/title/tt0892318/releaseinfo';">See more</a>&nbsp;&raquo;
</div>


<div class="starbar-meta">
<b>6.3/10</b>


&nbsp;&nbsp;<a href="ratings" class="tn15more">5,802 votes</a>&nbsp;&raquo;


</div>

<h5>Runtime:</h5><div class="info-content">USA:105 min </div>

<div id="director-info" class="info">
<h5>Director:</h5>
<div class="info-content">

<a href="/name/nm0935095/" onclick="(new Image()).src='/rg/directorlist/position-1/images/b.gif?link=name/nm0935095/';">Gary Winick</a><br/>

</div>

This is the code I have tried so far

<?php 
$url = "http://www.imdb.com/title/tt0892318/";

$raw = file_get_contents($url);

echo $raw;// this value displays the whole web page

// nothing works from this point

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));


$start = strpos($content,'<table cellpadding="2" class="standard_table"');

$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);


preg_match_all("|<tr(.*)</tr>|U",$table,$rows);

foreach ($rows[0] as $row){

    if ((strpos($row,'<th')===false)){
   
        preg_match_all("|<td(.*)</td>|U",$row,$cells);
       
        $number = strip_tags($cells[0][0]);
       
        $name = strip_tags($cells[0][1]);
       
        $position = strip_tags($cells[0][2]);
       
        echo "{$position} - {$name} - Number {$number} <br>\n";
   
    }

}
?>

Thanks for your help in advance

DJ

Hi there, I have been doing something similar recently to obtain data contained in an external webpage. I have used the DOMDocument technique to achieve this. Take a look:

<?php
$some_link = 'www.imdb.com/webpage.html';
$tagName = 'div';
$attrName = 'id';
$attrValue = 'tn15title';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

$html = getTags($dom, $tagName, $attrName, $attrValue);
echo $html;

function getTags($dom, $tagName, $attrName, $attrValue)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
	
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
		
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
	
}
?>

You could adopt this for your needs... Here is the man page for it:
http://www.php.net/manual/en/book.dom.php

Hi Nonshatter, thanks for the reply, I have run your example on my server and there is no output to screen? This is indicative of similar tests that I have performed. What output if any are you getting?

Thanks inadvance

David

Hey David, you were right the code I posted just returned an empty array. No problem though, I just tested this code and it works on my set up.

<?php

$some_link = 'http://www.imdb.com/title/tt1055292/';
$tagName = 'h1';
$attrName = 'class';
$attrValue = 'header';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

$html = getTags($dom, $tagName, $attrName, $attrValue);

	$i=0;
	while(isset($html[$i]))
	{
		echo $html[$i];

		$take = preg_split("/\ /", $html[$i]);

		for($n=0;$n<count($take);$n++)
		{ 
			print($take[$n]);
			echo "<br/>";
		}		
		$i++;
	}

function getTags($dom, $tagName, $attrName, $attrValue)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
	
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
		
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
	
}
?>

This returns:

Life as We Know It (2010) 
Life
as
We
Know
It (2010)

I used the preg_split function to chop up the array value. This is useful if you are returning rows of a table, <tr>'s for example. I hope this helps and let me know how it goes!

Hi Nonshatter,It works! Now I have played about with the $tagName = 'h1';
$attrName = 'class'; $attrValue = 'header';

Now if I want to extract this data <div class="starbar-meta"> <b>6.3/10</b> I did the following but does not work?:

$tagName = 'div';
$attrName = 'class';
$attrValue = 'starbar-meta';

Can you point me in the right direction

Thanks in advance

David

Great! I'm glad it's working. Now I'm not 100% as I haven't had a chance to play around, but try something like this: (I got this from Here)

$tagName = 'div';
$attrName = 'class';
$attrValue = 'starbar-meta';
$subtag = 'b';                   // add the new subtag <b>

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);      //call the function with the new subtag

	$i=0;
	$x=0;
	while(isset($html[$i]))
	{
		echo $html[$i];		
                                  //remove the preg_split function if you don't need it
		$i++;
	}

function getTags($dom, $tagName, $attrName, $attrValue, $subtag)      //pass the subtag parameter
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
	
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']/$subtag");     //the new query should select all <b> tags within the node tree div -> class -> starbar-meta -> b
		
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
	
}

Hi Nonshatter, I have tested the output and it is an empty string. I have looked on the website you recommended and I don't see the solution yet?

Are you able to assist me further

Thanks in advance

David

Hmm... It should work. Can you send me the URL you are trying to get that data from and I will have a look

No, I meant what imdb url you were trying to get the starbar-meta class from. In this page I am testing from, the rating is in a span tag as follows:

<span style="display:none" id="star-bar-user-rate"><b>4.9</b>

Here is the same code I posted before, and it works!

<?php
$some_link = 'http://www.imdb.com/title/tt1055292/';
$tagName = 'span';
$attrName = 'id';
$attrValue = 'star-bar-user-rate';
$subtag = 'b';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);

$html = getTags($dom, $tagName, $attrName, $attrValue);

	$i=0;
	while(isset($html[$i]))
	{
		echo $html[$i];

		
		$i++;
	}

function getTags($dom, $tagName, $attrName, $attrValue)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
	
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
		
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
	
}
?>

Hi Nonshatter, This the link :$some_link = 'http://www.imdb.com/title/tt0892318/';

But you example works well with it. I must be interpreting the source code incorrectly. when i place the whole html file in dream-weaver I see this code:

<div class="starbar-meta">
<b>6.3/10</b>

which is clearly not span, id, star-bar-user-rate, b. I have interpreted it as div, class, starbar-meta, b?

Can you explain how you interpreted the code? is this where I am going wrong?

Thanks

David

Hmm, yeah that's a bit strange. Personally, I don't trust these IDE environments like Dreamweaver. They can often be more trouble than they're worth!

To interpret the source code, I just navigated to that page (http://www.imdb.com/title/tt0892318/) in mozilla firefox, then right-click and select 'View-Source'. Then I Ctrl+F to search the source code for the interesting bit (the rating).

The rating as I see it on this page is held in the html:

<span style="display:none" id="star-bar-user-rate"><b>6.3</b><span class="mellow">/10</span></span>

Hope this helps! :)

Hi Nonshatter, I can see that. So to enable me to do the next example from the next html code: I have substituted the below code in the $tagName etc. would this be correct?

<div class="txt-block"><h4 class="inline">Director:</h4><a
href="/name/nm0935095/">Gary Winick</a></div>

$tagName = 'a';
$attrName = 'href';
$attrValue = '/name/nm0935095/';

Cheers

David

Yes, that should work just fine. Just remember to remove the "/$subtag" part of the domxpath query if you're only using three nodes:

$domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']/$subtag"); //remove "/$subtag"

Or failing that, you could keep the $subtag and try:
$tagName = 'div';
$attrName = 'class';
$attrValue = 'txt-block';
$subtag = 'a'

Make sense?

Hi Nonshatter, Thanks again for your help.

<?php 

$some_link = 'http://www.imdb.com/title/tt1055292/';
$tagName = 'div';
$attrName = 'class';
$attrValue = 'txt-block';
$subtag = 'a';
 
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);
 
$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);
 
	$i=0;
	while(isset($html[$i]))
	{
		echo $html[$i];
 
 
		$i++;
	}
 
function getTags($dom, $tagName, $attrName, $attrValue)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
 
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
 //$domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']/$subtag"); //remove "/$subtag"
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
 
}
?>

Output produces a lot of unwanted data as below

Director: Greg Berlanti
Writers: Ian Deitchman, Kristin Rusk Robinson
Release Date: 8 October 2010 (UK)
Taglines: A comedy about taking it one step at a time.
Motion Picture Rating (MPAA) Rated PG-13 for sexual material, language and some drug content. See all certifications »
Parents Guide: Add content advisory for parents »
Official Sites: Official site |  »
Country: USA
Language: English
Release Date: 8 October 2010 (UK) See more »
Also Known As: Ilyen az élet See more »
Filming Locations: Atlanta, Georgia, USA See more »
Production Co: Josephson Entertainment, Gold Circle Films, Village Roadshow Pictures See more »
Show detailed company contact information on IMDbPro »
Runtime: USA: 112 min
Sound Mix: Dolby Digital
Color: Color
Soundtracks "For You Now" Written and Performed by Bruno MerzCourtesy of MassiveTalent by MassiveMusic See more »

Any ideas

Thanks

David

Hi Nonchatter, thanks for your help resolved the last with a loop and fixed the problem.

To enable the output to display in one table. would it be better to call this bit of code as function each time I request an element of the web-page.

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);
 
$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);
 
	$i=0;
	while(isset($html[$i]))
	{
		echo $html[$i];
 
 
		$i++;
	}
 
function getTags($dom, $tagName, $attrName, $attrValue)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
 
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
 //$domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']/$subtag"); //remove "/$subtag"
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
 
}

Thanks

David

Hi Again, saying that they all have different settings to extract the data. what would you suggest?

Thanks

David

Yes you're correct. I'm not sure how to return just the one element, other than to only echo out the first value of the $html array E.g. $html[0];

This works:

<?php 

$some_link = 'http://www.imdb.com/title/tt1055292/';
$tagName = 'div';
$attrName = 'class';
$attrValue = 'txt-block';
$subtag = 'a';
 
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);
 
$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);
 
	
		echo $html[0];
	
 
function getTags($dom, $tagName, $attrName, $attrValue, $subtag)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
 
	//$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
 	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']/$subtag");
    	$i = 0;
   	$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
 
}
?>

And yea, it would be best to call it as a function. Although it might be a bit tricky as each page may have different tags to display the same information, particularly with imdb as it's so rich of information.

I've only used this technique manually - plugging in the tag names and attributes as and when I need them. Let me know if you find a solution to this. Cheers

Hi Nonshatter,, I certainally will let you you know, I shall work on it over the weekend. Another area I am looking at is to present the user with a text field that can take any of these inputs:

The URL such as http://www.imdb.com/title/tt0892318/
or
a number which will be the title code such as tt0892318 and a number that will be prefixed with tt and be interpreted as the title code such as 0892318

On pressing the submit button the application should find the relevant page on the site.

Hope you can help or if you would prefer I can start a new tread?

Thanks


David

Hey yea, that's no problem. Just start a new thread when you want to get started. Nonshatter

Hi Nonshatter, I am still trying to get to grips with controlling the output to extract the precise variable in red below i.e the release date. Do you have any idea what is the best way to do this.

<?php
 // produce the release date of film
 
$some_link = 'http://www.imdb.com/title/tt0892318/';

$tagName = 'div';
$attrName = 'class';
$attrValue = 'txt-block';
$subtag = 'h4';
 


$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);
 
$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);
	
	
	$i=0;
	$x=0;
	while(isset($html[$i]) and ($i < 3))
	//while(isset($html[$i]))
	{
		//echo $html[$i];
 
		$take = preg_split("/\ /", $html[$i]);
 
		for($n=0;$n<count($take); $n++)
		{ 
			
			
			print($take[$n]);// output line
			echo "<br/>";
		
		}
 
		$i++;
		
	}
 
function getTags($dom, $tagName, $attrName, $attrValue, $subtag)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
 
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
 
  
    	$i = 0;
   		$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
 
}
?>

Director:

Gary
Winick

Writers:

Jose
Rivera,
Tim
Sullivan

Release
Date: 9
June
2010 (UK)

Taglines:
What
if
you
had
a
second
chance
to
find
true
love? See
more »

Motion
Picture
Rating
(MPAA) Rated
PG
for
brief
rude
behavior,
some
language
and
incidental
smoking.
See
all
certifications »

Parents
Guide:
View
content
advisory »

Official
Sites: Official
site
|  »

Country:
USA

Language: English |
Italian |
Spanish

Release
Date: 9
June
2010 (UK) See
more »

Also
Known
As:
Cartas
a
Julieta See
more »

Filming
Locations:
Bryant
Park,
Manhattan,
New
York
City,
New

Hey bud, Well from printing out the entire contents of the html array, you can see that the Release date is the 3rd row down.

$i=0;
	$x=0;
	while(isset($html[$i]))
	{
			echo $html[$i];
		$i++;
	}
Director: Gary Winick
Writers: Jose Rivera, Tim Sullivan
[B]Release Date: 9 June 2010 (UK) [/B]
Taglines: What if you had a second chance to find true love? See more » ..................
..................

This means that the line you want is located in array position 2. As the array starts from 0 and counts up. So it would be:

$html[0] = Director: Gary Winick
$html[1] = Writers: Jose Rivera, Tim Sullivan
$html[2] = Release Date: 9 June 2010 (UK)
$html[3] = Taglines: What if you had a second chance to find true love? See more

So the easy solution would be to simply echo out position 2.

echo $html[2];

However, if the Release date from a different imdb page was put into a different array position, this wouldn't work. So the most precise and dynamic way of doing this would be to create a regular expression to check what each array value contains before deciding whether or not to print it out. (This is something I've just started to learn as well, so it's probably not the most efficient way of doing it, but hey - if it ain't broke don't fix it!)

<?php
 // produce the release date of film
 
$some_link = 'http://www.imdb.com/title/tt0892318/';

$tagName = 'div';
$attrName = 'class';
$attrValue = 'txt-block';
$subtag = 'h4';

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
@$dom->loadHTMLFile($some_link);
 
$html = getTags($dom, $tagName, $attrName, $attrValue, $subtag);
	
	$i=0;
	$x=0;
	while(isset($html[$i]))
	{
		//Find array value that contains "Release". The "i" means case-Insensitive
		if (preg_match("/Release/i", $html[$i]))
		{ 
			if(preg_match("/See/i", $html[$i]))
			{
				//Ignore the array value that says "See more"
			}
			else
			{
				echo $html[$i] ."<br>";
			}
		}
		$i++;
	}
 
function getTags($dom, $tagName, $attrName, $attrValue, $subtag)
{
    	$html = '';
    	$domxpath = new DOMXPath($dom);
    	$newDom = new DOMDocument;
    	$newDom->formatOutput = true;
 
	$filtered = $domxpath->query("//$tagName" . '[@' . $attrName . "='$attrValue']");
 
  
    	$i = 0;
   		$x = 0;
    	while( $myItem = $filtered->item($i++) )
	{		
		$html[$x] = $filtered->item($x)->nodeValue ."<br>";
    		$x++;
	}
    	return $html;
}
?>

You could also adapt this to Rating, directory, title etc. So regardless of the page you're pulling, it should always print out the values you're interested in.

P.S:- I will look into a better preg_match() to achieve this

Hi Nonshatter, thanks for that.

Look forward to hearing from you

Thanks

David

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.