I perform the following search using regular expression in Python on about 3000 html documents in .txt format and obtain about 600 cases where the following program returns 1 (finds the chain):

def risk_committee_search1(str):

    ## get the re object
    co = re.compile(r"""

        
    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)
    #""",re.VERBOSE|re.IGNORECASE)
    ## search the input string for the sequence described in cro
    com_risk = co.search(str)
    ## if find nothing
    if com_risk == None:
        risk_committee = '0'
    else:
        risk_committee = '1'

    return risk_committee

In a second stage, I ran on the same files the following function, with the same regular expression but now, I want it to return not just a 0/1 value but the actual text:

def risk_committee_search2(str):
  
    ## get the re object The word risk within five words of the word committee
    co = re.compile(r"""
    

    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)
    ## search the input string for the sequence described in cro
    com_risk = co.finditer(str)

    txt_to_ret = []
    
    for i in com_risk :
        txt_to_ret.append(i.group('nam'))

    out =  string.join(txt_to_ret, sep='^^^')
    return out

It turns out that there are much fewer cases (about 50% less) than above despite searching for the same pattern. Is it because of the use of “finditer” relative to “search”? I heard of the “multiline” issue in the forums but I am not sure if that is the problem. If anyone can help, I would appreciate it tremendously.

Thanks!

It seems unbelievable to me that you obtain less matches with finditer. I wrote a small function to test this. Run it instead of risk_committee_search. It compares the output of search and finditer. If it raises AssertError, you should post the output here. If it does not, it means that there is no difference between the 2 methods

import re

co = re.compile(r"""
    

    (?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
    |
    .{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
    
    
    """,re.VERBOSE|re.IGNORECASE|re.DOTALL)

def compare_searches(s):

    matches = list(co.finditer(s))
    single_match = co.search(s)

    try:
        assert(bool(matches) == (single_match is not None))
    except AssertError:
        m = matches[0] if matches else single_match
        print("Match found, span = (%d, %d)." % m.span(0))
        print("Finditer found %d matches." % len(matches))
        print("Search found %d match." % int(single_match is not None))
        print("Matched string was:\n")
        print(repr(s))
        raise

By the way, 'str' is a bad variable name in python (name of a builtin type).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.