Regular expression in Python: search vs. finditer

Question

jtbcswiss 0 Newbie Poster

14 Years Ago

I perform the following search using regular expression in Python on about 3000 html documents in .txt format and obtain about 600 cases where the following program returns 1 (finds the chain):

def risk_committee_search1(str):

    ## get the re object
    co = re.compile(r"""

        
    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)
    #""",re.VERBOSE|re.IGNORECASE)
    ## search the input string for the sequence described in cro
    com_risk = co.search(str)
    ## if find nothing
    if com_risk == None:
        risk_committee = '0'
    else:
        risk_committee = '1'

    return risk_committee

In a second stage, I ran on the same files the following function, with the same regular expression but now, I want it to return not just a 0/1 value but the actual text:

def risk_committee_search2(str):
  
    ## get the re object The word risk within five words of the word committee
    co = re.compile(r"""
    

    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)
    ## search the input string for the sequence described in cro
    com_risk = co.finditer(str)

    txt_to_ret = []
    
    for i in com_risk :
        txt_to_ret.append(i.group('nam'))

    out =  string.join(txt_to_ret, sep='^^^')
    return out

It turns out that there are much fewer cases (about 50% less) than above despite searching for the same pattern. Is it because of the use of “finditer” relative to “search”? I heard of the “multiline” issue in the forums but I am not sure if that is the problem. If anyone can help, I would appreciate it tremendously.

Thanks!

python

3 Contributors
2 Replies
464 Views
7 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by Gribouillis

snippsat 661 Master Poster

14 Years Ago

A general rule is that use regular expression on html is not god at all.
If you want to read why is a bad idèe.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

BeautifulSoup and lxml are god tool for this.

To get bettter help give us a sample off the html.
And be very correct what you want to get out html.

Edited 14 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2010-02-14T04:28:25+00:00

It seems unbelievable to me that you obtain less matches with finditer. I wrote a small function to test this. Run it instead of risk_committee_search. It compares the output of search and finditer. If it raises AssertError, you should post the output here. If it does not, it means that there is no difference between the 2 methods

import re

co = re.compile(r"""
    

    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)

def compare_searches(s):

    matches = list(co.finditer(s))
    single_match = co.search(s)

    try:
        assert(bool(matches) == (single_match is not None))
    except AssertError:
        m = matches[0] if matches else single_match
        print("Match found, span = (%d, %d)." % m.span(0))
        print("Finditer found %d matches." % len(matches))
        print("Search found %d match." % int(single_match is not None))
        print("Matched string was:\n")
        print(repr(s))
        raise

By the way, 'str' is a bad variable name in python (name of a builtin type).