HELP!!why i cannot read the html file?

Question

akie2741 0 Newbie Poster

15 Years Ago

My code is:

import re
import urllib 
import urllib2 

webURL="http://www.sc.iitb.ac.in/~bijnan/personal-details.htm" #the website is
connect=urllib.urlopen(webURL) #connect to this website
htmlDoc=connect.read()#get the html document from this website

patternIN="Permanent Address" # Where to begin to keep the text
patternOUT="</tr>"  # Where to end to keep the text (after the begining)
keepText=False  # Do we keep the text ?
address=""      # We init the address

# Now, we read the file to keep the text
for line in htmlDoc:
    if keepText:
        address+=line.strip()  # We store the line, stripping the \n
        if patternOUT in line: # Next line won't be kept any more
            keepText=False
    if patternIN in line: # Starting from next line, we keep the text
        keepText=True

# Now, it's time to clean all this
rTags=re.compile("<.*?>") # the regexp to recognise any tag
address=rTags.sub(":", address) # we replace the tags with ":" (I could have chosen anything else,
                # especially if there is some ":" in the address
rSep=re.compile(":+") # Now, we replace any number of ":" with a \n
address=rSep.sub("\n", address)
print address

For line 15...whats wrong there?why i cannot do the for loop in the html file?

html-css python

2 Contributors
1 Reply
159 Views
6 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by vegaseat

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

vegaseat 1,735 DaniWeb's Hypocrite Team Colleague · Answer 1 · 2009-09-25T22:26:37+00:00

htmlDoc=connect.read()
gives you the whole code as a string, and your for loop would iterate this one character at a time.

You need to use:
htmlDoc=connect.readlines()
to get a list of code lines.