parsing question.

Question

Bluerain 0 Newbie Poster

14 Years Ago

#
I'm trying to extract the url's from the below text, without the added html tags.
Is there anyway I can get just parts starting with ( http://) and ending with (")?
#

<a href="http://www.gumtree.sg/?ChangeLocation=Y" rel="nofollow">Singapore</a>, <a href="http://www.gumtree.com.au/?ChangeLocation=Y" rel="nofollow">Australia</a>, <a href="http://www.gumtree.co.nz/?ChangeLocation=Y" rel="nofollow">New Zealand</a>, <a href="http://www.gumtree.com" rel="nofollow">England</a>, <a href="http://edinburgh.gumtree.com" rel="nofollow">Scotland</a>, <a href="http://cardiff.gumtree.com" rel="nofollow">Wales</a>, <a href="http://www.gumtree.ie" rel="nofollow">Ireland</a>, <a

#
Just looking for the simplest way, thanks this community is excellent.
I'm using python 2.6.
#
(I'll store the html in a .txt file)

python

4 Contributors
6 Replies
170 Views
1 Week Discussion Span
Latest Post 14 Years Ago Latest Post by aml25

All 6 Replies

cghtkh 9 Junior Poster

14 Years Ago

line = '''<a href="http://www.gumtree.sg/?ChangeLocation=Y" rel="nofollow">Singapore</a>'''

startpos = line.find('http')
endpos = line.find('>')
print line[startpos:endpos]

snippsat 661 Master Poster

14 Years Ago

Use parser BeautifulSoup is good.

from BeautifulSoup import BeautifulSoup

html = '''\
<a href="http://www.gumtree.sg/?ChangeLocation=Y" rel="nofollow">Singapore</a>,
<a href="http://www.gumtree.com.au/?ChangeLocation=Y" rel="nofollow">Australia</a>,
<a href="http://www.gumtree.co.nz/?ChangeLocation=Y" rel="nofollow">New Zealand</a>,
<a href="http://www.gumtree.com" rel="nofollow">England</a>, <a href="http://edinburgh.gumtree.com" rel="nofollow">Scotland</a>,
<a href="http://cardiff.gumtree.com" rel="nofollow">Wales</a>,
<a href="http://www.gumtree.ie" rel="nofollow">Ireland</a>, <a>
'''

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
for link in links:
    print link['href']
    
''' output-->
http://www.gumtree.sg/?ChangeLocation=Y
http://www.gumtree.com.au/?ChangeLocation=Y
http://www.gumtree.co.nz/?ChangeLocation=Y
http://www.gumtree.com
http://edinburgh.gumtree.com
http://cardiff.gumtree.com
http://www.gumtree.ie
'''

Gribouillis commented: very simple +4

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Bluerain 0 Newbie Poster · Answer 1 · 2010-10-20T21:39:34+00:00

Thank you guys are badass.
Going to use BeautifulSoup.

Thanks again

Bluerain 0 Newbie Poster · Answer 2 · 2010-10-20T21:44:45+00:00

Again I can't say how much that helped.
Here's how I'm using it to get all the links.

import urllib2,sys
from BeautifulSoup import BeautifulSoup
import re
adress = sys.argv
html = urllib2.urlopen('http://www.mylinkwenthere.com')
soup = BeautifulSoup(html)

cost = soup.findAll('a')
for link in cost:
print link

aml25 0 Newbie Poster · Answer 3 · 2010-11-03T05:07:11+00:00

can someone please explain the:

for link in links:
print link

part of the code?

I am trying to store each of the links found into a list but am unable to.

Thank you

aml25 0 Newbie Poster · Answer 4 · 2010-11-03T05:53:36+00:00

aml25 0 Newbie Poster

14 Years Ago

nevermind, got it. Thanks a bunch

parsing question.

Recommended Answers Collapse Answers

All 6 Replies

Recommended Answers