I'm using beautifulsoup to grab text from HTML files. Buts its not perfect: For example it seems to keep css and javascript code that was added haphazardly. My overall goal is to make a list of words and their frequency to compare and contrast html files to categorize them. Dealing with 1000's of words I have to trim fat but its difficult know whats important to keep or not without context.
Here is the first 1000 characters from bs4.get_text() output parsing a wiki article. get_text is supposed to return only the textual content in between tags:
htmlSocks (novel) - Wikipedia, the free encyclopedia
@-webkit-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-webkit-transform:translateY(-20px)}100%{opacity:1;-webkit-transform:translateY(0)}}@-moz-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-moz-transform:translateY(-20px)}100%{opacity:1;-moz-transform:translateY(0)}}@-o-keyframes centralAuthPPersonalAnimation{0%{opacity:0;-o-transform:translateY(-20px)}100%{opacity:1;-o-transform:translateY(0)}}@keyframes centralAuthPPersonalAnimation{0%{opacity:0;transform:translateY(-20px)}100%{opacity:1;transform:translateY(0)}}.centralAuthPPersonalAnimation{-webkit-animation-duration:1s;-moz-animation-duration:1s;-o-animation-duration:1s;animation-duration:1s;-webkit-animation-fill-mode:both;-moz-animation-fill-mode:both;-o-animation-fill-mode:both;animation-fill-mode:both;-webkit-animation-name:centralAuthPPersonalAnimation;-moz-animation-name:centralAuthPPersonalAnimation;-o-animation-name:centralAuthPPersonalAnimation;animation
a lot of junk. I would like to think of a fast way to do this. The only thing I can think of is to keep statistics of frequency of non word (letters and numbers) characters on segments of varying size, like in a tree structure, and filter down through sections until I find a segment size that's the highest in non-word density, and cut away from there. Like with a binary tree.
But that seems like a terrible idea especially if the documents is really long.
traditional parsing looks like it would be very time consuming to do, and most likely not work since I'm already using an 3rd party library that can't do it right.
The trick is that since the subject can be literally anything, I have to be careful not to cut relevant parts. But if the document is large I can still afford to lose a few without the overall categorizing process being effected.