html data parsing

Question

sravi.pearl 0 Newbie Poster

13 Years Ago

hai i am new to python, can any one please help me how to parse data from an html file,
i want to display the content which lies under a particular tag,and also can you please tell where can i find tutorials for this topic with sample examples.

python

3 Contributors
12 Replies
247 Views
17 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by griswolf

griswolf 304 Veteran Poster

13 Years Ago

Look here for Python documentation: http://docs.python.org/
And here for various parsers: http://docs.python.org/library/markup.html

griswolf 304 Veteran Poster

13 Years Ago

http://docs.python.org/library/htmlparser.html#example-html-parser-application

snippsat 661 Master Poster

13 Years Ago

Python has 2 very good 3 party parser BeautifulSoup and lxml.
This parser can handle html that is no good,this can be important.
An example with BeautifulSoup.
We want the price of beans from this site.
http://beans.itcarlow.ie/prices.html

from BeautifulSoup import BeautifulSoup
import urllib2

#Read in website
url = urllib2.urlopen('http://beans.itcarlow.ie/prices.html')
soup = BeautifulSoup(url)
print soup #website contents

tag = soup.findAll('strong') #Find strong tag
print tag                    #[<strong>$6.36</strong>]
print tag[0].string          #Print out info we want "$6.36"

Firebug is a good tool to navigate in source code of a website.
http://getfirebug.com/

Edited 13 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

sravi.pearl 0 Newbie Poster · Answer 1 · 2010-09-28T01:26:15+00:00

Thanks for your reply, can you please provide me sample working code... so that i will understatnd clearly

sravi.pearl 0 Newbie Poster · Answer 2 · 2010-09-28T09:20:23+00:00

Can i do without using BeautifulSoap?
i head that i have to use regular Expresions to search tag and display Text inbetween that tag.

sravi.pearl 0 Newbie Poster · Answer 3 · 2010-09-28T09:29:54+00:00

Please give me an sample program which works ,for example if we consider this site

http://www.quackit.com/html/tutorial/introduction.cfm

, and after saving it as an html file. i should write a prgram so that it should display the content uder "" What do I need to create HTML ? "" ,if you clearly oserve that site you will find his heading

Requirement : have to display only :

You don't need any special equipment or software to create HTML. In fact, you probably already have everything you need. Here is what you need:

•Computer
•Text or HTML editor. Most computers already have a text editor and you can easily create HTML files using a text editor. Having said that, there are definite benefits to be gained in downloading an HTML editor.
If you want the best HTML editor, and you don't mind paying money for it, you can't go past Adobe Dreamweaver. Dreamweaver is probably the best HTML editor available, and you can download a trial version for starters.

If you don't have the cash to purchase an editor, you can always download a free one. Examples include SeaMonkey, Coffee Cup (Windows) and TextPad (Windows).

If you don't have an HTML editor, and you don't want to download one just now, a text editor is fine. Most computers already have a text editor. Examples of text editors include Notepad (for Windows), Pico (for Linux), or Simpletext/Text Edit/Text Wrangler (Mac).

•Web Browser. For example, Internet Explorer or Firefox.

So for this type of methods, i think i have to write a regular Expression to find tht particulat tag to display that partucular data. But i dont know how to do this..please guys help me!!!!

griswolf 304 Veteran Poster · Answer 4 · 2010-09-28T09:45:10+00:00

If you insist, you can look for just the particular tag using regex, but that is the hard way.

If you really have a one-shot parsing need that looks for a particular "<h2>", then reads until the next one, you can just look in each line for "<h2>" and "</h2>" and use a simple state machine to copy all the lines between the particular "</h2>" and the next "<h2>". This is easier than regex, and may even be faster.

If you need to be able to parse HTML in general, with this case as a particular example, then look here (again) for the easy way that does not use BeautifulSoup: http://docs.python.org/library/htmlparser.html This really is (one of) the right way(s) to do what you want.

And no, it is not our job to provide your code.

sravi.pearl 0 Newbie Poster · Answer 5 · 2010-09-28T09:53:51+00:00

Thanks for your reply....

sorry,i dont need code,i just need a sample example using those regular expressions,because i am supossed to follow only that procedure.yeah i already had a look at this site but its showing only methods i have to use.

snippsat 661 Master Poster · Answer 6 · 2010-09-28T11:21:40+00:00

Read this post by bobince why regex is a very bad idèe when it`s come to parse (x)html.
It`s one of the best answer on stackoverflow.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

For small website is it`s possibly to use only regex.
But the way is to regex in a combo with BeautifulSoup,lxml
I use this two parser,because the are best at parsing html.
HTMLParser will break if html is a little malformed,very few sites has perfect html.
Around only 5% on all websites on internet has 100% valid html.

Here i get price with just regex.

import urllib2
import re

url = urllib2.urlopen('http://beans.itcarlow.ie/prices.html').read()
print re.search(r'\$\d.+\d', url).group()  #$5.30

sravi.pearl 0 Newbie Poster · Answer 7 · 2010-09-28T11:33:55+00:00

ThanQ soooo much,
can u please explain

print re.search(r'\$\d.+\d', url).group() #$5.30

snippsat 661 Master Poster · Answer 8 · 2010-09-28T11:50:53+00:00

You have to read about regex,there is no way to explain it a short way.
http://docs.python.org/library/re.html

Just one more example a text where a take out price.
\$ match $
\d match any number.
. matches any character except a newline.

import re

text = '''\
Hi this is a string with a price $5.30.
Text has also a number we dont want $55.99
'''

test_match = re.findall(r'\$\d.\d+', text)
print test_match  #5.30

griswolf 304 Veteran Poster · Answer 9 · 2010-09-28T11:53:10+00:00

You need to think about what regex patterns you will need; and you need to think about the fact that opening tags may not appear on the same line as their closing tags. Regex information is here: http://docs.python.org/library/re.html There are code snippets at that URL. Read the whole page (or at least scan it and read the interesting parts). Look for the difference between search and find functions, and for a way to split lines at a particular regex occurrence.

The general idea is that you read the file one line at a time, you parse each line looking for the next thing that is needed (which changes depending on what you have already seen). At some point, you begin to collect (partial) lines until you find another regex hit for the ending line, when you may store a partial line, then break out of the read/parse loop and display/write the data you were looking for.