Hi guys!
I've been using PHP for fun for a while, and now I'm interested in playing with some scraping. I know regex is the way to go. So I'm trying to scrape a page of 4chan. I want to grab the images and the title of the thread of the images.
So here's the URL I'm trying to scrape: http://boards.4chan.org/p/
It's a photography chan, and the idea is to grab all of the images and know the title of the thread. So if you look at the page, you can see a bunch of threads, the title in blue, then the poster ("anonymous") in green.
Right now, this grabs all of the images:
preg_match_all('#<img[^>]*>#i',
$html,
$posts, // will contain the posts
PREG_SET_ORDER // formats data into an array of posts
);
But now I want to grab the title I described above. How in the world do I do that? The hard part is that each picture does not necessarily have a title because it is not the head of the thread.
I want to be able to say:$posts[0]
to get the <img> tag$posts[1]
to grab the title of the thread the image is from
Thanks!