Extracting main contents from webpage

Question

realmayo 0 Newbie Poster

15 Years Ago

Basically I want to do something like this;

give a URL. And find the part of the webpage that has the main content.

e.g. a main page of a blog should return the div that contains the posts and nothing more.

a post article of a blog should return the div that contains only the main article.

a forum page should return the div that contains the list of forum threads

a forum thread page should return the container that holds all the posts.

etc. etc. etc.

Basically what I have worked out is a series of weights that weighs the size of the div's text, against the child elements of the div, etc. etc. etc. It works OK, but it isnt perfect (well I dont expect it to be). It grabs most things. I am using lxml (switched from beautiful soup for speed reasons).

I was wondering if anyone heard of such a thing.

Thanks

python

2 Contributors
1 Reply
57 Views
8 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2010-02-06T13:26:03+00:00

I've never heard of such a thing, but if it works, you should post the code, so that everybody can try it on some web pages !