how do I extract data from html file ?

Question

masterinex 0 Newbie Poster

15 Years Ago

Hi,
I want write a program which
extract 'Rated PG for some scary moments and mild language' from the following html file and return it as a list .

html file:
<div class="info">
<h5><a href="/mpaa">MPAA</a>:</h5>

<div class="info-content">
Rated PG for some scary moments and mild language. (also 2009 extended version)
</div>
</div>

Why wouldnt this code work ?
mpaaget = re.compile('<h5><a href="/mpaa">MPAA</a>:</h5><div class="info-content">(.*?)</div>')
mpaa = mpaaget.findall(htmlr)

html-css python

4 Contributors
10 Replies
236 Views
3 Days Discussion Span
Latest Post 15 Years Ago Latest Post by vegaseat

All 10 Replies

ghostdog74 57 Junior Poster

15 Years Ago

use a html parser for this job, such as BeautifulSoup. If you don't want to, then another way is to read the whole html, split on "</div>", go through each element in the list, check for "<div class="info-content">", if found, replace it will null. You will get your string

ghostdog74 57 Junior Poster

15 Years Ago

if you want to use regex, you should compile your regex with re.DOTALL and re.M for multiline match.

vegaseat 1,735 DaniWeb's Hypocrite

15 Years Ago

They are Flags for compile():
re.MULTILINE (or re.M) string and each line
re.DOTALL (or re.S) match any character, including a newline
re,IGNORECASE (or re.I) case-insensitive matching

Edited 15 Years Ago by vegaseat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

masterinex 0 Newbie Poster · Answer 1 · 2009-12-23T06:28:42+00:00

use a html parser for this job, such as BeautifulSoup. If you don't want to, then another way is to read the whole html, split on "</div>", go through each element in the list, check for "<div class="info-content">", if found, replace it will null. You will get your string

hey, I tried it with
mpaaget = re.compile('<div class="info-content">(.*?)</div>')
but then I got something else . Could it be because there is a new line after <div class="info-content"> ? How do I take care of that?

<div class="info-content">
Rated PG for some scary moments and mild language. (also 2009 extended version)
</div>

masterinex 0 Newbie Poster · Answer 2 · 2009-12-23T06:37:11+00:00

masterinex 0 Newbie Poster

15 Years Ago

ohw I got it now , thanks for pointing it out .

jlm699 320 Veteran Poster · Answer 3 · 2009-12-23T06:44:34+00:00

hey, I tried it with
mpaaget = re.compile('<div class="info-content">(.*?)</div>')
but then I got something else . Could it be because there is a new line after <div class="info-content"> ? How do I take care of that?

Yes, the white space does not fit into your regular expression. Modify like so to match 0 or any number (*) of white space characters (\s):

>>> m = re.compile('<h5><a href="/mpaa">MPAA</a>:</h5>\s*<div class="info-content">\s*(.*?)\s*</div>')
>>> m.findall(h)
['Rated PG for some scary moments and mild language. (also 2009 extended version)']
>>> m.match(h)
>>>

masterinex 0 Newbie Poster · Answer 4 · 2009-12-23T07:14:09+00:00

Yea , that was problem ,thanks for pointing it out again .
Looks like \n and \s* are the same character .

I have another question. lets say
I want to extract the number 7.2 from the html string below :

<a href="/ratings_explained">weighted average</a> vote of <a href="/List?ratings=7">7.2</a> / 10</p><p>

how come this doesnt work ?

averageget = re.compile('<a href="/List?ratings=7">(.*?)</a>')
average = averageget.findall(htmlr)

Could it be that there some special structures in the html file again which I missed out ?

jlm699 320 Veteran Poster · Answer 5 · 2009-12-24T01:04:18+00:00

This time it's because '?' is a special character in regular expressions (you're using it inside your group). The question mark indicates a greedy match of 1 or more (where as the asterick (*) is a greedy match of 0 or more). To match the question mark character itself you need to escape it in your regex like so: \? . The full regular expression then becomes:

>>> c = re.compile('<a href="/List\?ratings=7">(.*?)</a>')
>>> c.findall(t)
['7.2']

masterinex 0 Newbie Poster · Answer 6 · 2009-12-24T01:45:26+00:00

ohw I see so its the '?' that causing the trouble ,
what is t btw do I need to assighn a value to it ?

masterinex 0 Newbie Poster · Answer 7 · 2009-12-25T01:32:50+00:00

Im a little unfamiliar with Python , what are re.DOTALL and re.M are they modules ?

how do I extract data from html file ?

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers