Python Regular Expression Help

Question

debasishgang7 0 Junior Poster in Training

13 Years Ago

Hi all,

I wanna extract a certain link from a web page using python regular expression.

The scenario is like this..

The code:

blah...
...
....
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;"
....
blah
blah
blah

I wanna extract the url "http://www.test.com/file.ext" from the page using python regular expression.

Thanks in advance!

python regex

Edited 13 Years Ago by debasishgang7 because: n/a

4 Contributors
5 Replies
286 Views
12 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by snippsat

snippsat 661 Master Poster

13 Years Ago

Read this.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

So regex it not the right tool when it comes to html/xml.
There is a reason why parser excit,python has 2 very good lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.find('div')
print tag['src']
#--> http://www.test.com/file.ext

So in a lager page your search would be more specific something like this.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.findAll('div', {'class': 'test'})
print tag[0]['src']
#--> http://www.test.com/file.ext

Edited 13 Years Ago by snippsat because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Here is the solution.

Edited 13 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

debasishgang7 0 Junior Poster in Training · Answer 1 · 2012-01-14T21:13:36+00:00

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between  this tags.
I will be very thank full if you can solve this with a regular expression which will extract the url between <div class="test" src=" and " style="top:0px

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 2 · 2012-01-14T22:14:15+00:00

What have you tried looks simple match betseen 'start and end tags'?

snippsat 661 Master Poster · Answer 3 · 2012-01-15T02:02:27+00:00

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between  this tags.

That may be because you making and error,impossibile to say without seeing some code.
Regex ...no,but here something you can look at.

>>> import re
>>> re.findall(r'class="test" src="(.*?)"', html)
['http://www.test.com/file.ext']
>>> ''.join(re.findall(r'class="test" src="(.*?)"', html))
'http://www.test.com/file.ext'
>>>