Searching for string in regular expression

Question

aint 13 Newbie Poster

12 Years Ago

Hi,

I am working on Protein and DNA sequences. When you have a protein sequence it can be translated in to DNA sequence in many ways. A nice way to express this is using regular expression. I would like to create this long regular expression for a protein sequence, and than seach for a smaller string. So that is the other way around of what regular expression does. Is there a way to do this?

a short example would be

DNAsequence
r'atata[atg]ca[atgc]...atatagtagcctgt'

search for
'agcacg'

Thanks

python regex

Edited 12 Years Ago by aint

4 Contributors
3 Replies
319 Views
4 Hours Discussion Span
Latest Post 12 Years Ago Latest Post by TrustyTony

Gribouillis 1,391 Programming Explorer

12 Years Ago

I see a way, but there is some work to do. First you convert your regular expression to a NFA (non deterministic finite automaton) with a start state and
an accept state. In this NFA, you add an epsilon-transition from the start state to every other state and an epsilon transition from any state to the accept state.
This NFA should be able to find your subsequences.

A second step could be to convert the NFA into a deterministic automaton (DFA) to produce a fast tool. You could use a program like this one for the conversion (again there is some work, because the NFA in this link doesn't seem to have epsilon transitions).

Edited 12 Years Ago by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

griswolf 304 Veteran Poster · Answer 1 · 2012-08-05T22:38:58+00:00

You can use an NDFA (NFA) where each transition character in your short string triggers a transition in the NFA, if possible. The google knows about how you might want to start doing that.

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 2 · 2012-08-05T23:54:50+00:00

If you have only not optional alternatives without repeats in [], you could do something like this to avoid complicated transformations (you could improve the matching with more advanced string matching algorithm) :

dna = r'atata[atg]ca[atgc]atatagtagcctgt'

grouping = []
s = set()
reading = False
for acid in dna:
    if acid == '[':
        s = set()
        reading = True
    elif acid == ']':
        reading = False
        grouping.append(s)
    elif reading:
        s.add(acid)
    else:
        grouping.append(acid)

print grouping

target = 'tat'
matches = list()

for index in range(len(grouping) - len(target) + 1):
    if all((c1 in c2) for c1, c2 in zip(target, grouping[index:])):
        g = grouping[index:index+len(target)]
        matches.append((index, g))

if matches:
    print 'Match at:'
    for index, g in matches:
        print index,':',g
else:
    print 'Match not found'