I perform the following search using regular expression in Python on about 3000 html documents in .txt format and obtain about 600 cases where the following program returns 1 (finds the chain):
def risk_committee_search1(str):
## get the re object
co = re.compile(r"""
(?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
|
.{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
""",re.VERBOSE|re.IGNORECASE|re.DOTALL)
#""",re.VERBOSE|re.IGNORECASE)
## search the input string for the sequence described in cro
com_risk = co.search(str)
## if find nothing
if com_risk == None:
risk_committee = '0'
else:
risk_committee = '1'
return risk_committee
In a second stage, I ran on the same files the following function, with the same regular expression but now, I want it to return not just a 0/1 value but the actual text:
def risk_committee_search2(str):
## get the re object The word risk within five words of the word committee
co = re.compile(r"""
(?P<nam>:.{300} RISK\W+ (\w+\W+){0,5}? COMMITTEE .{300}
|
.{300} COMMITTEE\W+ (\w+\W+){0,5}? RISK .{300})
""",re.VERBOSE|re.IGNORECASE|re.DOTALL)
## search the input string for the sequence described in cro
com_risk = co.finditer(str)
txt_to_ret = []
for i in com_risk :
txt_to_ret.append(i.group('nam'))
out = string.join(txt_to_ret, sep='^^^')
return out
It turns out that there are much fewer cases (about 50% less) than above despite searching for the same pattern. Is it because of the use of “finditer” relative to “search”? I heard of the “multiline” issue in the forums but I am not sure if that is the problem. If anyone can help, I would appreciate it tremendously.
Thanks!