How can I write my web crawler for data extraction?

Question

gunbuster363 0 Junior Poster

15 Years Ago

Hi, I am new to here. I am doing with my honours project in University, it is about data mining. Before I can do data mining, first I need data.

I want to extract data from a site : www.tripadvisor.com
Should I write my web crawler with python? I don't know python but I've seen people using this to do it.

I don't need the extracted hyperlinks(Although I need to extract them in the process), I only want to extract the words(String) within the pages. Can python to that? Are there any help with me writing that?

Thank you very much.

Raymond

python

Edited 15 Years Ago by gunbuster363 because: n/a

4 Contributors
6 Replies
446 Views
1 Month Discussion Span
Latest Post 15 Years Ago Latest Post by gunbuster363

jlm699 320 Veteran Poster

15 Years Ago

You have a lot of options. If you're looking to purely look at html you can use urllib2, or if you'd rather have the module parse out all the elements for you and give you purely the text data you'd be better off using beautifulsoup. Search this forum to find plenty of examples of using both.

willygstyle 5 Junior Poster in Training

15 Years Ago

Yes it can grab specific links, but it can't do it without instructions. You are going to have to tell it where to crawl and why. I'm not sure what you mean by that link is encrypted, but to me it looks like it could have valuable information that you could use in your logic. Such as "showuserreviews", "place", "hongkong". Possibly extract all the links then look for the ones with these kinds of keywords and use that to decide where to go next.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gunbuster363 0 Junior Poster · Answer 1 · 2009-11-09T20:20:51+00:00

does it have the ability to grab a specific link inside the html?
for example, the "next" page
because the link of the website was encrypted like this:
http://www.tripadvisor.com/ShowUserReviews-g294217-d305813-r45616996-Langham_Place_Hong_Kong-Hong_Kong_Hong_Kong_Region.html

in order to go to the next page, we can only grab the "next" hyperlink in the page, for it to parse the next page

gunbuster363 0 Junior Poster · Answer 2 · 2009-11-11T21:34:21+00:00

gunbuster363 0 Junior Poster

15 Years Ago

bump

somnia 0 Newbie Poster · Answer 3 · 2009-11-22T00:14:51+00:00

Try Scrapy.

It's a very simple (though quite powerful) web crawling and screen scraping framework for Python. It's also pretty well documented, and has a growing community.

gunbuster363 0 Junior Poster · Answer 4 · 2010-01-08T16:46:49+00:00

I've already be able to use BeautifulSoup to write it.
Thank you all!