Hi, I am trying to do the following: produce an interactive web application that will scrape the contents of a movie web page and extract the movie title, year of release, User rating, Director, runtime and the first three cast members.
This is the typical data in red that I am trying to extract from the page. Or even if this is the correct way to approach this problem.
<div id="tn15title">
<h1>Letters to Juliet <span>(<a href="/year/2010/">2010</a>) <span class="pro-link"><a href="http://pro.imdb.com/rg/maindetails-title/tconst-pro-header-link/title/tt0892318/">More at <strong>IMDbPro</strong></a> »</span><span class="title-extra"></span></span></h1>
</div>
<div class="info-content">
14 May 2010 (USA)
<a class="tn15more inline" href="/title/tt0892318/releaseinfo" onClick="(new Image()).src='/rg/title-tease/releasedates/images/b.gif?link=/title/tt0892318/releaseinfo';">See more</a> »
</div>
<div class="starbar-meta">
<b>6.3/10</b>
<a href="ratings" class="tn15more">5,802 votes</a> »
</div>
<h5>Runtime:</h5><div class="info-content">USA:105 min </div>
<div id="director-info" class="info">
<h5>Director:</h5>
<div class="info-content">
<a href="/name/nm0935095/" onclick="(new Image()).src='/rg/directorlist/position-1/images/b.gif?link=name/nm0935095/';">Gary Winick</a><br/>
</div>
This is the code I have tried so far
<?php
$url = "http://www.imdb.com/title/tt0892318/";
$raw = file_get_contents($url);
echo $raw;// this value displays the whole web page
// nothing works from this point
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table cellpadding="2" class="standard_table"');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$number = strip_tags($cells[0][0]);
$name = strip_tags($cells[0][1]);
$position = strip_tags($cells[0][2]);
echo "{$position} - {$name} - Number {$number} <br>\n";
}
}
?>
Thanks for your help in advance
DJ