I don't understand the documentation

Your question is irrelevant to the title. Are you saying that you don't know how to use it, or you know how to use it, but you want to make it faster.

BeautifulSoup is a third party module for Python2 that allows you to access even badly coded HTML code. What do you want to do with it?

If you have very large HTML documents you have the option to parse only selected parts of the document. Here is an example (Python2 code) ...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')
# create a list
a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

# show all the a_tag lines
for line in a_tags:
    print( line )

If you use Python2, you can also try to apply module psyco from:
http://psyco.sourceforge.net/
Psyco is a JIT i386 compiler that compiles to native i386 code rather than Python bytecode, displaying speed improvements of 3 - 10 fold.

Recoding never hurts, usually the first things I do to optimize codes is replace all range() funstions with xrange(), then I look for loops that can be replaced with lambda, then I look for if statments that can be changed to elif so it doesn't have to go over unnecessary if statments, then i look for regex functions that can be replaced with faster string functions, and finally i see if i can compact the code into less lines and less characters to allow faster interpreting and making it more sleek and compact.

If you have very large HTML documents you have the option to parse only selected parts of the document. Here is an example (Python2 code) ...

import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')
# create a list
a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

# show all the a_tag lines
for line in a_tags:
    print( line )

If you use Python2, you can also try to apply module psyco from:
http://psyco.sourceforge.net/
Psyco is a JIT i386 compiler that compiles to native i386 code rather than Python bytecode, displaying speed improvements of 3 - 10 fold.

this line return a list instead of a beautiful soup object....so I cannot use findall() on it....

a_tags = [tag for tag in BeautifulSoup(html, parseOnlyThese=a_tag)]

Then you do something like this ...

# search the http://python.org html code for all the 
# <a tag lines that have a title with the word Python in it

import re
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer

html = urllib.urlopen("http://python.org").read()

# parse only the <a tags
a_tag = SoupStrainer('a')

html_atag = BeautifulSoup(html, parseOnlyThese=a_tag)

# find all titles that contain the word "Python"
title_py = html_atag.findAll(attrs={'title' : re.compile("Python+")})
for line in title_py:
    print( line )

Parsing a page with 8000+ urls with BeautifulSoup

this is the page

http://www.thehindubusinessline.com/cgi-bin/bl2002.pl?mainclass=03

this is my code

from urllib2 import URLError,urlopen
import re
from BeautifulSoup import BeautifulSoup, SoupStrainer

def gethtml(address):
 	try:
		raw=urlopen(address)
		raw=raw.read()
	except URLError:
		raw='Error occured'
	return raw
	
	
dat=gethtml("http://www.thehindubusinessline.com/cgi-bin/bl2002.pl?mainclass=03")
print 'got html'
a_tag=SoupStrainer('a')
html_atag = BeautifulSoup(dat, parseOnlyThese=a_tag)
print 'soup done'
linklist=html_atag.findAll('a',href=re.compile(r'stories'))

The last step, .findall , takes forever. Is there any other way to do it faster??

thanks.

I do not know about this Beautifull soup and your url did not function for me, but I tried with other url your pickup of data by filtering by partition <a> tags from New York Times.

It seemed fast enough for me.

from urllib2 import URLError,urlopen
import re

def gethtml(address):
    try:
        raw=urlopen(address)
        raw=raw.read()
    except URLError:
        raw='Error occured'
    return raw
    
## your url did not function, so I put something more functional for me to test your program    
dat=gethtml("http://www.nytimes.com/")
print 'got html'
print 'length',len(dat)
##a_tag=SoupStrainer('a')
##html_atag = BeautifulSoup(dat, parseOnlyThese=a_tag)
##print 'soup done'
##linklist=html_atag.findAll('a',href=re.compile(r'stories'))
print 'simple filter with partition'
print '-'*80
rest=dat
find=' '
while find:
    start,find1,rest=rest.partition('<a')
    i,find,rest= rest.partition('</a>')
    if 'india' in i.lower() or 'asia' in i.lower():
        print find1+i+find
print '-'*80

That website is closed for 2 hours everyday from 2:30 am to 4:30 am for updating the news.

Thanks tonyjv, you really went an extra mile there. That was very informative. I am very new to python < 1week and i am learning through hacking and duct taping, for now. An its going great. BeautifulSoup is awesome for newbies.
There was a minor problem with my code. vegaseat's code cleared it up.

For now I am gonna stick with BeautifulSoup . After his project I will dive in to hardcore Python.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.