how to compare multiple columns from two seperate files and identify matches

Question

bio-grad 0 Junior Poster in Training

14 Years Ago

Hi,
Currently I have two files: (I've labeled the columns to help with the explanation)
mature.txt
[0m] [1m] [2m] [3m] [4m] [5m] [6m] [7m]
scaffold1088 121550 D 18 ppy mi 164g 88.6
scaffold1141 262270 D 18 ppy mi 896 90.2
scaffold1168 54635 D 18 peu mi 2138 87.5
scaffold1168 56190 D 18 ppt mi 2218 87.5

hairpin.txt
[0h] [1h] [2h] [3h] [4h] [5h] [6h] [7h]
ptc 164e scaffold1088 97.56 41 121570 121530 73.8
ppt 896 scaffold113 90 60 478993 478934 71.9
ppt 896 scaffold1141 90 60 257204 257145 71.9
ppt 896 scaffold1141 90 60 262302 262243 71.9
mmu 2138 scaffold1168 97.18 71 54618 54688 88.4
peu 2914 scaffold1168 100 63 56162 56224 86.7

What I want to do is:
1) find ever occurrence where [0m] matches [2h] AND [1m] falls between [5h] and [6h].
2) If condition 1) is met, then I need to know if [1h] and [6m] match (but the program should only compare the numbers - not letters). If yes, then I want to print to a new file [0h] [1h] [2h] [7m] [1m] [3h] [5h] [6h] as one line.

Sample output would look like this:
ptc 164 scaffold1088 88.6 121550 97.56 121570 121530
ppt 896 scaffold1141 90.2 262270 90 262302 262243
mmu 2138 scaffold1168 87.5 54635 97.18 54618 54688

*Note that the last item in hairpin.txt meets criteria 1 but because [6m] and [1h] differ, it is not a match

Below is the code I started working on, but I need some help figuring out how to compare the array pieces.

Many thanks!

use strict;
use FileHandle;
use Data::Dumper;

my ($hairpin,$mature) = @ARGV;
my $fh = new FileHandle;

#build array from hairpin miRNA data
open ($h,$hairpin) or die "Could not open $hairpin: $!";
my (@h,@hout)
while (<$h>) {
   chomp;
push @h, [split(/\t/)];#Build an array of arrays

   my $hhit = $t[0]."\t".$t[1]."\t".$t[2]."\t".$t[9]."\t".$t[10]; #species, miRNAfamily, scaffold_location, start, end

   }

#build array from mature miRNA data
open ($m,$mature) or die "Could not open $mature: $!";
my (@m,@mout)
while (<$m>) {
   chomp;
push @m, [split(/\t/)]; #Build array   
   my $mhit = $t[0]."\t".$t[1]."\t".$t[4]."\t".$t[6]."-".$t[7]; #scaffold_location, start, species, family, subidentity

    }

#compared scaffold name e.g. "scaffold201_4" in the hairpin data to the scaffold names in the mature data to identify matches
#then compare 
my ($hscaffold,$hlocatn,$hfamily,$mscaffold,$mlocatn,$mfamily);
foreach my $mscaffold (@m[0]) {#foreach scaffold reference
    if (!defined($mscaffold)

perl

3 Contributors
14 Replies
409 Views
2 Days Discussion Span
Latest Post 14 Years Ago Latest Post by bio-grad

dch26 2 Newbie Poster

14 Years Ago

hi.....i didnt test maba001's solution...but i dont think it handles duplicate keys in his hashes?? anyway, i had already started this code before i saw his response. i continued as a mental exercise. it works perfectly (according to your spec!). there are a few comments in there. if you need anything explained please shout. cheers.

use strict;
open(H,"<$ARGV[0]")||die"open h failed";
open(M,"<$ARGV[1]")||die"open m failed";
my %h_hash=();
while(<H>)  # first load h into hash of arrays
{
  chomp;
  my $key=(split/\s|\t/)[2];   # split by space or tab
  push @{$h_hash{$key}},[split(/\s|\t/)];
}
close H;
while(<M>) # then sequentially read this file and do the compares on the fly
{
  chomp;
  my @fields=split/\s|\t/;    #  split by space or tab
  foreach my $i ( 0 .. $#{ $h_hash{$fields[0]} } ) {  # for each array of this type
   # have to do this next check both ways round coz sometimes h5 bigger than h6 and vice versa
    	if ( (($fields[1] >= $h_hash{$fields[0]}[$i][5])&&($fields[1] <= $h_hash{$fields[0]}[$i][6]))  ||
	     (($fields[1] >= $h_hash{$fields[0]}[$i][6])&&($fields[1] <= $h_hash{$fields[0]}[$i][5])) )
		 { $h_hash{$fields[0]}[$i][1]=~s/[^0-9]//g;  # strip non-numerics
		   $fields[6]=~s/[^0-9]//g;
		   if($h_hash{$fields[0]}[$i][1] eq $fields[6])  # compare
		   {   # all good....so print
     		   print "$h_hash{$fields[0]}[$i][0] $h_hash{$fields[0]}[$i][1] $h_hash{$fields[0]}[$i][2] $fields[7] $fields[1] $h_hash{$fields[0]}[$i][3] $h_hash{$fields[0]}[$i][5] $h_hash{$fields[0]}[$i][6]\n";
		   }
		 }
  }
}
close M;
exit 0;

Edited 14 Years Ago by dch26 because: added a comment or 2

dch26 2 Newbie Poster

14 Years Ago

ps.....the syntax highlighter in the forum, incorrectly thinks that line 16 of the code has a comment in the middle of it......the # is part of the code....so just copy it as is......command line to execute is i guess same as yours:
perl frog.pl hairpin.txt mature.txt

best of luck
dch

dch26 2 Newbie Poster

14 Years Ago

yeah.....the OP's original test files has 'dupes' in both files if you call the 'scaffold*' type field the 'key', which i have.
would have been a whole lot easier if files were sorted in this order and no dupes!!
but life is never like that.
@bio-grad......what sort of data is this anyway? what you up to??
cheers!!

dch26 2 Newbie Poster

14 Years Ago

the 2 lines you commented out were correct....you just missed the curly brackets (had to split the line to comment them out). it now produces more output. hope its ok. dch

use strict;
open(H,"<$ARGV[0]")||die"open h failed";
open(M,"<$ARGV[1]")||die"open m failed";
my %h_hash=();
while(<H>)  # first load h into hash of arrays
{
  chomp;
  my $key=(split/\s|\t/)[2];   # split by space or tab
  push @{$h_hash{$key}},[split(/\s|\t/)];
}
close H;
while(<M>) # then sequentially read this file and do the compares on the fly
{
  chomp;
  my @fields=split/\s|\t/;    #  split by space or tab
  foreach my $i ( 0 .. $#{ $h_hash{$fields[0]} } ) {  # for each array of this type
   # have to do this next check both ways round coz sometimes h5 bigger than h6 and vice versa
    	#if ( (($fields[1] >= $h_hash{$fields[0]}[$i][5])&&($fields[1] <= $h_hash{$fields[0]}[$i][6]))  ||
	     #(($fields[1] >= $h_hash{$fields[0]}[$i][6])&&($fields[1] <= $h_hash{$fields[0]}[$i][5])) )
		# { 
		   $h_hash{$fields[0]}[$i][1]=~s/[^0-9]//g;
		   $fields[6]=~s/[^0-9]//g;
		   if($h_hash{$fields[0]}[$i][1] eq $fields[6])
		   { 
     		   print "$h_hash{$fields[0]}[$i][0] $h_hash{$fields[0]}[$i][1] $h_hash{$fields[0]}[$i][2] $fields[7] $fields[1] $h_hash{$fields[0]}[$i][3] $h_hash{$fields[0]}[$i][5] $h_hash{$fields[0]}[$i][6]\n";
		   }
		# }
  }
}
close M;
exit 0;

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

maba001 -1 Junior Poster in Training · Answer 1 · 2010-11-10T13:51:18+00:00

Hello,

I see that you have quite some code already developed that looks nice and clean. Where are you stuck?

From an algorithm point of view, I would create one numerically indexed hash from maturity.
In your loop to fill the "mature hash" you could do something like

# you are interested in the first two columns only
list($indexfull, $m1, $m2, $m3, $m4, $m5, $m6, $m7) = split(/\t/);
$indexfull =~ m/\w+(\d+)/;
$indexnumeric = $1;
# now you have the numeric portion only and for your comparison step 1 you could
# use a plain array style hash
$mature_test1[$indexnumeric] = $m1;
# for test 2 you could prepare a second plain hash
$mature_test2[$indexnumeric] = $m6;
# for the printing you could prepare the third hash
$mature_print[$indexnumeric] = "$m7\t$m1";

This will give you hashes that you can index by the number behind scaffold.
With that you could loop through hairpin directly.

Part two - looping through hairpin:

while (<$h>) {
  list($h0, $h1, $h2, $h3, $h5, $h6, $h7) = split(/\t);
  $testindex = $h2;
  $testindex =~ m/\w+(\d+)/;
  # here you might add some error handling in case it does not match
  $testnumeric = $1;
  # first test case
  if (! defined($mature_test1[$testnumeric])) next;
  # second test case
  if ($h5 < $mature_test2[$testnumeric]) next;
  if ($h6 > $mature_test2[$testnumeric]) next;
  # here you would do the rest of your processing.
}

maba001 -1 Junior Poster in Training · Answer 2 · 2010-11-10T13:52:53+00:00

Just a note of caution. I did not test this in any way. So there might be typos and syntax errors. But it should get across how I would go about it from an algorithm point of view.

maba001 -1 Junior Poster in Training · Answer 3 · 2010-11-10T14:48:35+00:00

hi.....i didnt test maba001's solution...but i dont think it handles duplicate keys in his hashes?

This is true. I am assuming that "maturity" would not contain duplicate indexes.
Duplicates in hairpin would be handled properly.

bio-grad 0 Junior Poster in Training · Answer 4 · 2010-11-10T22:02:32+00:00

yeah.....the OP's original test files has 'dupes' in both files if you call the 'scaffold*' type field the 'key', which i have.
would have been a whole lot easier if files were sorted in this order and no dupes!!
but life is never like that.
@bio-grad......what sort of data is this anyway? what you up to??
cheers!!

Hi Guys, Thanks for the help. I'll give this a go and see if it works for the data. The dupes are the continual annoyance in this process as the data is coming from multiple sources. I'm such a newbie to code that everything is difficult and while I appreciate the help I get here, much of it I don't fully understand, but nevertheless I'm trying to piece together the parts that will get the job done. So far, so good.

I am a PhD student in biology. As you may have heard, genomes are getting cheaper to sequence and so many more genomes are being sequenced. I'm working on a genome which is not completely assembled - that is why the pieces are referred to as scaffolds and not chromosomes. I am trying to match up the sequences in the scaffolds with a listing of biologically relevant sequences from other genomes which were previously identified. Every scaffold contains data. Where the data matches the biologically relevant sequences, I get a hit. Because I'm dealing with sequences from every genome published I get a lot of overlap - i.e. rice and corn genomes may have a hit at the same location on the scaffold data. I've got to sort these to get rid of the duplicates and determine which hits are "real" are which are due to chance. This piece I am working on now will help separate out the "real" hits by cross-referencing two data sets.

bio-grad 0 Junior Poster in Training · Answer 5 · 2010-11-11T02:45:54+00:00

I need to tweak the code to not include criteria 1B "AND [1m] falls between [5h] and [6h]." It is possible that the sequence could match 120 outside of the range and still be a match, so I want to remove this piece. I thought it was performing this function at lines 18 and 19 so I commented them out, but it gave me the same data as before when I know I should be pulling more hits.

Here is what I tried to do:

use strict;
open(H,"<$ARGV[0]")||die"open h failed";
open(M,"<$ARGV[1]")||die"open m failed";
my %h_hash=();
while(<H>)  # first load h into hash of arrays
{
  chomp;
  my $key=(split(/\t/))[2];   # split by space or tab
  push @{$h_hash{$key}},[split(/\t/)];
}
close H;
while(<M>) # then sequentially read this file and do the compares on the fly
{
  chomp;
  my @fields=split(/\t/);    #  split by space or tab
  foreach my $i ( 0 .. $#{ $h_hash{$fields[0]} } ) {  # for each array of this type
   # have to do this next check both ways round coz sometimes h5 bigger than h6 and vice versa
    	#if ( (($fields[1] >= $h_hash{$fields[0]}[$i][5])&&($fields[1] <= $h_hash{$fields[0]}[$i][6]))  ||
	 #    (($fields[1] >= $h_hash{$fields[0]}[$i][6])&&($fields[1] <= $h_hash{$fields[0]}[$i][5])) )
		 { $h_hash{$fields[0]}[$i][1]=~s/[^0-9]//g;  # strip non-numerics
		   $fields[6]=~s/[^0-9]//g;
		   if($h_hash{$fields[0]}[$i][1] eq $fields[6])  # compare
		   {   # all good....so print
     		   print "$h_hash{$fields[0]}[$i][0] $h_hash{$fields[0]}[$i][1] $h_hash{$fields[0]}[$i][2] $fields[7] $fields[1] $h_hash{$fields[0]}[$i][3] $h_hash{$fields[0]}[$i][5] $h_hash{$fields[0]}[$i][6]\n";
		   }
		 }
  }
}
close M;
exit 0;

bio-grad 0 Junior Poster in Training · Answer 6 · 2010-11-11T18:45:36+00:00

Thanks. I made the change (copied your script just to be sure) but it still isn't pulling all the hits. Here is another example I was running as my test.

mature.txt
scaffold262_22_ 19606 D 22 bta mi 2406 90.91
scaffold262_27_ 880 D 21 csi mi 390 100
scaffold262_27_ 49641 D 19 cin mi 4012 89.47
scaffold262_27_ 942 D 20 gma mi 390a 100
scaffold262_27_ 880 D 21 ath mi 390a 100
scaffold262_28_ 1790 D 18 mmu mi 696 88.89
scaffold324_25_ 36990 D 20 bta mi 1814b 90
scaffold324_25_ 33765 D 20 bta mi 1814c 90
scaffold324_33_ 639 D 22 zma mi 393b 90.91
scaffold324_34_ 1231 D 17 hsa mi 1279 88.24
scaffold324_34_ 13198 D 20 bta mi 2325a 90

hairpin.txt
[h0] [h1] [h2]
csi 390 scaffold262_27 97.67 43 939 981 77.8
mtr 393 scaffold324_33 97.5 40 677 638 71.9

When I run the script I get nothing. Both of the items in hairpin.txt should generate a hit within mature.txt. Both scaffolds[h2] are present with the corresponding family # [h1].

I should get the following hits from mature.txt:

scaffold262_27_ 880 D 21 csi mi 390 100
scaffold262_27_ 942 D 20 gma mi 390a 100
scaffold262_27_ 880 D 21 ath mi 390a 100
scaffold324_33_ 639 D 22 zma mi 393b 90.91

dch26 2 Newbie Poster · Answer 7 · 2010-11-11T20:26:36+00:00

:)

there is an error in your data....you have an underscore at the end of the scaffold field in the mature file. if you remove the underscore from the end, you get the four results as you expect. or do you need a version of the script that can handle these extra underscores now? cheers dch

if so.....here it is....added line 16.............let me know

use strict;
open(H,"<$ARGV[0]")||die"open h failed";
open(M,"<$ARGV[1]")||die"open m failed";
my %h_hash=();
while(<H>)  # first load h into hash of arrays
{
  chomp;
  my $key=(split/\s|\t/)[2];   # split by space or tab
  push @{$h_hash{$key}},[split(/\s|\t/)];
}
close H;
while(<M>) # then sequentially read this file and do the compares on the fly
{
  chomp;
  my @fields=split/\s|\t/;    #  split by space or tab
  $fields[0]=~s/_$//;
  foreach my $i ( 0 .. $#{ $h_hash{$fields[0]} } ) {  # for each array of this type
   # have to do this next check both ways round coz sometimes h5 bigger than h6 and vice versa
    	#if ( (($fields[1] >= $h_hash{$fields[0]}[$i][5])&&($fields[1] <= $h_hash{$fields[0]}[$i][6]))  ||
	     #(($fields[1] >= $h_hash{$fields[0]}[$i][6])&&($fields[1] <= $h_hash{$fields[0]}[$i][5])) )
		# { 
		   $h_hash{$fields[0]}[$i][1]=~s/[^0-9]//g;
		   $fields[6]=~s/[^0-9]//g;
		   if($h_hash{$fields[0]}[$i][1] eq $fields[6])
		   { 
     		   print "$h_hash{$fields[0]}[$i][0] $h_hash{$fields[0]}[$i][1] $h_hash{$fields[0]}[$i][2] $fields[7] $fields[1] $h_hash{$fields[0]}[$i][3] $h_hash{$fields[0]}[$i][5] $h_hash{$fields[0]}[$i][6]\n";
		   }
		# }
  }
}
close M;
exit 0;

bio-grad 0 Junior Poster in Training · Answer 8 · 2010-11-11T20:39:08+00:00

Perfect! And lesson learned...I need to double check the formatting of the data before setting up the scripts.

@dch26 - Thanks for your help in working out the kinks here. I was breaking down the code trying to see what each piece was doing but am having a little trouble with the first section you set up. Can you explain what's happening on line 4-9? I know you are setting up the reference key but I don't understand the [2] on line 8. I'm guessing that it looks up the scaffolds, but how do you then reference the family #'s later [1] if no reference is made to [1] before closing H?

dch26 2 Newbie Poster · Answer 9 · 2010-11-11T20:58:22+00:00

maybe this will help??.....

4. declare a new empty hash (associative array)
5. read the hairpin file
6. nothing!
7. strip carriage return and line feeds if present
8. split the input line, delimited by space OR tab, and then grab the 3rd entry (index 2) - to get the scaffold id and store it temporarily in a var called $key
9. split the input line, delimited by space OR tab, and then insert each of the components as an array entry into the hash of arrays.
the hash of arrays means we effectively can have more than one array stored under the same hash key name (scaffold id).

for example:

$h_hash{scaffold1168} will end up containing 2 complete arrays....one for each of:
mmu 2138 scaffold1168 97.18 71 54618 54688 88.4
peu 2914 scaffold1168 100 63 56162 56224 86.7

so....to get to all entries of the first array, you would use:
$h_hash{scaffold1168}[0]
and the second
$h_hash{scaffold1168}[1]

and then because each of those contains an array, we then have to reference the entry of that array...
so, to get to any of the entries we specify a further index:
$h_hash{scaffold1168}[1][3]
would return 100
(the 4th entry from the 2nd array)

make sense? sorry, i'm not the best at explaining...better at code...

bio-grad 0 Junior Poster in Training · Answer 10 · 2010-11-11T21:09:09+00:00

thanks. I think I understand. I haven't done anything like that before so I appreciate you explaining it. I think it will come in handy down the road.