filtering normal readable text from junk

Question

dancks 4 Light Poster

10 Years Ago

I'm using beautifulsoup to grab text from HTML files. Buts its not perfect: For example it seems to keep css and javascript code that was added haphazardly. My overall goal is to make a list of words and their frequency to compare and contrast html files to categorize them. Dealing with 1000's of words I have to trim fat but its difficult know whats important to keep or not without context.

Here is the first 1000 characters from bs4.get_text() output parsing a wiki article. get_text is supposed to return only the textual content in between tags:

htmlSocks (novel) - Wikipedia, the free encyclopedia
@-webkit-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-webkit-transform:translateY(-20px)}100%{opacity:1;-webkit-transform:translateY(0)}}@-moz-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-moz-transform:translateY(-20px)}100%{opacity:1;-moz-transform:translateY(0)}}@-o-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-o-transform:translateY(-20px)}100%{opacity:1;-o-transform:translateY(0)}}@keyframes centralAuthPPersonalAnimation{0%{opacity:0;transform:translateY(-20px)}100%{opacity:1;transform:translateY(0)}}.centralAuthPPersonalAnimation{-webkit-animation-duration:1s;-moz-animation-duration:1s;-o-animation-duration:1s;animation-duration:1s;-webkit-animation-fill-mode:both;-moz-animation-fill-mode:both;-o-animation-fill-mode:both;animation-fill-mode:both;-webkit-animation-name:centralAuthPPersonalAnimation;-moz-animation-name:centralAuthPPersonalAnimation;-o-animation-name:centralAuthPPersonalAnimation;animation

a lot of junk. I would like to think of a fast way to do this. The only thing I can think of is to keep statistics of frequency of non word (letters and numbers) characters on segments of varying size, like in a tree structure, and filter down through sections until I find a segment size that's the highest in non-word density, and cut away from there. Like with a binary tree.

But that seems like a terrible idea especially if the documents is really long.

traditional parsing looks like it would be very time consuming to do, and most likely not work since I'm already using an 3rd party library that can't do it right.

The trick is that since the subject can be literally anything, I have to be careful not to cut relevant parts. But if the document is large I can still afford to lose a few without the overall categorizing process being effected.

html-css python

Edited 10 Years Ago by dancks

4 Contributors
3 Replies
298 Views
5 Hours Discussion Span
Latest Post 10 Years Ago Latest Post by Gribouillis

All 3 Replies

iamthwee

10 Years Ago

Have you tried this.

http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

snippsat 661 Master Poster · Answer 1 · 2014-08-11T02:29:04+00:00

snippsat 661 Master Poster

10 Years Ago

Post link and what you what out.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2014-08-11T03:28:29+00:00

In linux you could use a command line such as

lynx --dump URL

to get a textual version of the web page. Here is what it does to this page (attachment).

Edit: it seems that lynx is available in various platforms too.

filtering normal readable text from junk

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers