An interesting Perl problem to extract file content

Question

ghosh22 0 Junior Poster in Training

14 Years Ago

Hi everybody..Here's an interesting problem to solve. I have a text file like this (also attached):

>first
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGGGTGATTCGGTACGAATGATTCG
GTTACCAGAACTTACCGAAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTTGTTTTT
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA
>firsta
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGG----------------------
-----------------AAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTT------
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA

Both >first and >firsta containing same characters except the part with hyphens. Now is it possible to write a perl script that would extract the text starting after >firsta and before the start of - for each line? Also, would it be possible to extract the unmatched text from >first?
Please note that both >first and >firsta are in the same text file and other similar text files which I am using might contain more lines like these.
Thanks a lot in advance..

perl

example.txt (0.52 KB)

>first
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGGGTGATTCGGTACGAATGATTCG
GTTACCAGAACTTACCGAAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTTGTTTTT
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA
>firsta
TTCCCAAAAAAGACCTACTAAGTCAAGCGGATGCGTTTTGTGTCTTATGG
AAAGTCCCTGACGGATACGAGGCTTTGG----------------------
-----------------AAGAAGAAATGGGACGAACCGAGGTTTCTCGTT
CGTGTGCTAATCCTACATTCAAACATCGATTTCGATCAGAGTTT------
CATGAAGAACAGACATTCGTATTACGTGTTTACGATGAAGATTTGAGGTA

3 Contributors
8 Replies
158 Views
22 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by ghosh22

All 8 Replies

k_manimuthu 43 Junior Poster in Training

14 Years Ago

Read the command line arguments and pass your file name as below.

perl filename.pl example.txt

$ARGV[0] - consider as the input file.

yuvanbala 2 Newbie Poster

14 Years Ago

try this code, Now you may not specify >first, >firsta like that, the script automatically recover that,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;


my $first=$1 if ($cont=~m#\A^>.*?$(.*?)(?=>)#sm);
my $firsta=$1 if ($cont=~m#.+?^>.*?$(.*?)\Z#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} eq $firsta{$_}){
#print ("$first{$_} matched in line ",  ($_ + 0), "\n");
push (@matched, ($_ + 0));
}
else{
$firsta{$_}=~s#\-##g;
$first{$_}=~s#$firsta{$_}##g;
#print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
push (@unmatched, ($_ + 0));
}
}
print "List of Matched lines\n", '=' x 25, "\n"; 
print "matched line are $_\n" for @matched;

print "\n\nList of Unmatched lines\n", '=' x 25, "\n"; 
foreach (@unmatched){
print "Line $_:";
$_=sprintf("%04d", $_);
print "\t$first{$_}\n";
}

Thanks and regards,
yuvanbala

k_manimuthu commented: Continuous help!!! +2

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

yuvanbala 2 Newbie Poster · Answer 1 · 2010-12-08T16:25:55+00:00

try This code,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} ne $firsta{$_}){
print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
}
}

Thanks and regards,
yuvanbala

ghosh22 0 Junior Poster in Training · Answer 2 · 2010-12-08T16:35:49+00:00

try This code,

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} ne $firsta{$_}){
print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
}
}

Thanks and regards,
yuvanbala

hi thanks..but it doesn't give me an opportunity to specify my file name and path..also..as I am a new programmer, comments would help me..thanks a lot..

ghosh22 0 Junior Poster in Training · Answer 3 · 2010-12-08T18:31:29+00:00

Read the command line arguments and pass your file name as below.
perl filename.pl example.txt
$ARGV[0] - consider as the input file.

hii..thanks.. though this one gives the unmatched part only..one thing..as I have other files which do not contain >first and >firsta..so do I need to change it everytime?
thnks a lot

ghosh22 0 Junior Poster in Training · Answer 4 · 2010-12-08T18:35:33+00:00

hii..thanks.. though this one gives the unmatched part only..one thing..as I have other files which do not contain >first and >firsta..so do I need to change it everytime?
thnks a lot

Another bug is..it actually extracts the whole line..not only the unmatched part..my problem was like this..
first, it would extract the matched part between >first and >firsta..then it should extract the unmatched part from >first and >firsta..
thanks

yuvanbala 2 Newbie Poster · Answer 5 · 2010-12-08T19:46:34+00:00

Hi, already you says that '-' is the only difference. hence i fixed it. then the scipt will be.

open my $txt, "<", "$ARGV[0]";
read $txt, my $cont, -s $txt;
close $txt;
my $first=$1 if ($cont=~m#\Q>first\E$(.*?)\Q>firsta\E$#sm);
my $firsta=$1 if ($cont=~m#\Q>firsta\E$(.*)$#sm);

my (%first, %firsta);
for (my $i=1;$first=~m#(.+)#mg;$i++){
$i=sprintf("%04d", $i);
$first{$i}=$1;
}
for (my $i=1;$firsta=~m#(.+)#g;$i++){
	$i=sprintf("%04d", $i);
	$firsta{$i}=$1;
}

foreach (sort keys %firsta){
if ($first{$_} eq $firsta{$_}){
#print ("$first{$_} matched in line ",  ($_ + 0), "\n");
push (@matched, ($_ + 0));
}
else{
$firsta{$_}=~s#\-##g;
$first{$_}=~s#$firsta{$_}##g;
#print ("$first{$_} unmatched in line ",  ($_ + 0), "\n");
push (@unmatched, ($_ + 0));
}
}
print "List of Matched lines\n", '=' x 25, "\n"; 
print "matched line are $_\n" for @matched;

print "\n\nList of Unmatched lines\n", '=' x 25, "\n"; 
foreach (@unmatched){
print "Line $_:";
$_=sprintf("%04d", $_);
print "\t$first{$_}\n";
}

thanks and regards,
yuvanbala

ghosh22 0 Junior Poster in Training · Answer 6 · 2010-12-08T20:59:32+00:00

hii..thanks..works fine...will mark it as solved so u get proper credit for it..
It Would be great if you could tell me which book(s) you are following/followed to learn Perl..since I am new..it'll help me ..cheers...

An interesting Perl problem to extract file content

Recommended Answers Collapse Answers

All 8 Replies

Recommended Answers