remove HTML markup in the input text, return a plain text string

Question

boiishuvo 0 Junior Poster in Training

13 Years Ago

I want this program to read a text file then target and replace anything start with < and end with >
for example it finds <html>, replace that into ****
but somehow i tested it and it didn't work than i expected. any suggestions?

def remove_html(text):
    txtLIST = list(text)
    i = 0
    while i < len(txtLIST):
        if txtLIST[i] == '<':
            while txtLIST[i] != '>':
                txtLIST.pop(i)
            txtLIST.pop(i)
        else:
            i = i + 1
    replace = 4*'*'
    return replace.join(txtLIST)

file = open('remHTML.txt','r')
test = file
display = remove_html(test)
print display

html-css python

4 Contributors
10 Replies
448 Views
11 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by boiishuvo

All 10 Replies

snippsat 661 Master Poster

13 Years Ago

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

I agree with this,but now it look like boiishuvo will destroy the stucture of html.
Should it replace like this or keep <> intact?

>>> s = '<html>'
>>> s.replace('<html>', '***')
'***'

Something like this with regex.

import re

html = '''\
<html>
<head>
    <title></title>
</head>
<body>

</body>
</html>'''

print re.sub(r'<.*>', '****', html)
"""Output-->
****
****
    ****
****
****

****
****
"""

snippsat 661 Master Poster

13 Years Ago

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

It can dependent how that text file look.
Can cange regex to something like this.

import re

data = '''\
<title>Lachlan Osborn</title>
<head>hello world</head>
'''

text = re.sub(r'<.*?>', '', data)
print text.strip()
"""Output-->
Lachlan Osborn
hello world
"""

TrustyTony 888 ex-Moderator

13 Years Ago

you are not returning anything.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

nosehat 0 Newbie Poster · Answer 1 · 2012-03-11T14:32:49+00:00

Lines 11 and 12 put "****" between every single character. Other than this, is the output what you expect?

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

boiishuvo 0 Junior Poster in Training · Answer 2 · 2012-03-11T16:59:37+00:00

Wait I read the guide, they expect me to write a code that removes all HTML markup, including < and >, from a text file then display the rest of HTML left.

I've never heard of BeautifulSoup Python module but the guide don't expect me to use that though.

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

boiishuvo 0 Junior Poster in Training · Answer 3 · 2012-03-11T18:05:01+00:00

Thanks. That's a good example but I modified the code to meet the guide requirement and the output didn't show anything.

def remove_html(text):
    import re
    info = open('remHTML.txt','r')
    data = info
    info.close()
    text = re.sub(r'<.*?>', '', data)

text = []
text_list = remove_html(text)
print(text_list.strip())

boiishuvo 0 Junior Poster in Training · Answer 4 · 2012-03-11T18:24:07+00:00

Yeah I forgot to add that

but the line 8-10 seems incorrect.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 5 · 2012-03-11T18:41:02+00:00

You are modifying the passed in list text aren't you? I do not understand why you try to set other variable text_list.

boiishuvo 0 Junior Poster in Training · Answer 6 · 2012-03-11T19:03:46+00:00

boiishuvo 0 Junior Poster in Training

13 Years Ago

how do I run the procedure

def remove_html(text)

boiishuvo 0 Junior Poster in Training · Answer 7 · 2012-03-11T19:54:04+00:00

info = '''<table>
    <tr align = "center">
        <h1> Lachlan Osborn </h1>
        <p> Address: 5 Smith Street, Manly <br>
        Date of Birth: 26th April 1993 </p>
        
        <a href="semester.html"><b>My Semester Units</b></a>
        <p><b>Check out my <a href="hobbies.html">hobbies.</a></b></p>
    </tr>
</center>'''

def remove_html(text, info):
    import re
    text = re.sub(r'<.*?>', '', info)
    return text

remove_html(text.strip())

remove HTML markup in the input text, return a plain text string

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers