How to extact data from html doc

Question

akie2741 0 Newbie Poster

14 Years Ago

How can i extact the personal address from html file..
After i get the source from the html file using read() method,what pattern should i consider if i wanna extact the address?
Currently i think is use the compile() method to set the matching the address' pattern, but what rule should i set for the address?

Eg)Unit 1,1 King St,Sydney NSW 2123

python

2 Contributors
8 Replies
136 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by jice

jice 53 Posting Whiz in Training

14 Years Ago

this may be working.

import re
datas=open("file.html").read()

expr=re.compile("<adress>(.*?)</adress>")
for match in expr.findall(datas):
    print match

jice 53 Posting Whiz in Training

14 Years Ago

It's easier with a clear demand ;-)...
Here, you don't have any easy pattern to isolate the adress.
You'll need to look at the source of the html page to see the elements that can help you to identify the adress.
Here, for example, i'd look for "Permanent Address" and get the text from the following line to the following </tr> (between these two, you've got all the adress).
Then you just have to clean the text by removing all the tags.

import re
infile="personal-details.htm"
patternIN="Permanent Address" # Where to begin to keep the text
patternOUT="</tr>"  # Where to end to keep the text (after the begining)
keepText=False  # Do we keep the text ?
address=""      # We init the address
# Now, we read the file to keep the text
for line in open(infile):
    if keepText:
        address+=line.strip()  # We store the line, stripping the \n
        if patternOUT in line: # Next line won't be kept any more
            keepText=False
    if patternIN in line: # Starting from next line, we keep the text
        keepText=True

# Now, it's time to clean all this
rTags=re.compile("<.*?>") # the regexp to recognise any tag
address=rTags.sub(":", address) # we replace the tags with ":" (I could have chosen anything else,
                # especially if there is some ":" in the address
rSep=re.compile(":+") # Now, we replace any number of ":" with a \n
address=rSep.sub("\n", address)
print address

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

akie2741 0 Newbie Poster · Answer 1 · 2009-09-24T19:11:28+00:00

thx....but it doesn't work for some html file..

if i wanna extact the address from this html doc>>>http://www.sc.iitb.ac.in/~bijnan/personal-details.htm
which is Permanent Address :
B. Bandyopadhyay,

P.O. Kirnahar 731302,

Dist. Birbhum,

West Bengal, INDIA

what pattern should i use to match it?

akie2741 0 Newbie Poster · Answer 2 · 2009-09-24T21:35:33+00:00

thx very very much,it is very helpful...
i have got another question is.. if the address is not beginning with Permanent Address,and not ending with </tr>,this program cannot be use...which is this program can only be used in this situation.
Eg)if the address is beginning with something like Location, Home, Live in etc...how can extact those address if so,is that possible to create a program to fit and extact the contact address from all the websites?

jice 53 Posting Whiz in Training · Answer 3 · 2009-09-24T22:00:29+00:00

I'm afraid not :
in html pages, there is no way to know where the address is : no special tag or whatever. So you have to look at the sites you want to process and look how you can identify the address.
I'd even say that sometimes, it may be impossible (for example, if you hadn't "Permanent Address", the example you gave me would have been very difficult to process)

akie2741 0 Newbie Poster · Answer 4 · 2009-09-24T22:41:51+00:00

akie2741 0 Newbie Poster

14 Years Ago

thx very much!!

akie2741 0 Newbie Poster · Answer 5 · 2009-09-25T00:22:20+00:00

i dun know wts wrong of my code:

import re
import urllib 
import urllib2 

webURL="http://www.sc.iitb.ac.in/~bijnan/personal-details.htm" #the website is
connect=urllib.urlopen(webURL) #connect to this website
htmlDoc=connect.read()#get the html document from this website

patternIN="Permanent Address" # Where to begin to keep the text
patternOUT="</tr>"  # Where to end to keep the text (after the begining)
keepText=False  # Do we keep the text ?
address=""      # We init the address

# Now, we read the file to keep the text
for line in htmlDoc:
    if keepText:
        address+=line.strip()  # We store the line, stripping the \n
        if patternOUT in line: # Next line won't be kept any more
            keepText=False
    if patternIN in line: # Starting from next line, we keep the text
        keepText=True

# Now, it's time to clean all this
rTags=re.compile("<.*?>") # the regexp to recognise any tag
address=rTags.sub(":", address) # we replace the tags with ":" (I could have chosen anything else,
                # especially if there is some ":" in the address
rSep=re.compile(":+") # Now, we replace any number of ":" with a \n
address=rSep.sub("\n", address)
print address

jice 53 Posting Whiz in Training · Answer 6 · 2009-09-25T22:33:58+00:00

jice 53 Posting Whiz in Training

14 Years Ago

And what is the error ?