Hi,
So I am modifying some code from discussion thread: http://www.daniweb.com/software-development/perl/threads/323871
I am still very much a newbie, but am trying to get a better handle on hashes - so if I've messed mine up please tell me ;).
I am attaching 3 files. One is an image of the scenario I am evaluating and the other two are the data files to be use by this code.
While looking at the image - we have two sets of data -
1) gene id [c0], scaffold location [c1], start [c2], stop [c3]and other info [c4]
2) hairpin ids [h0], scaffold locations [h1], start [h2], stop [h3],other info [h4], and sequence read [h15].
Note: c2 may be > c3 or < c3; likewise h2 may be > h3 or < h3 as these sequences have directionality and can read right to left or left to right.
I want to:
find every occurrence where h0 eq c0 (ie. scaffold1000 = scaffold1000) with the hairpin scaffold being the $key
when the hairpin scaffold $key matches any gene scaffold location [c1], I need it to test for conditions (ie. if h2<=c2 && h3>c2 etc.). However, there will be numerous genes with a matching scaffold for a single hairpin and I only want the output for the gene located closest to the hairpin start and stop, hence the need for a b-tree.
I'm thinking I should run the b-tree to locate the closes gene before testing the conditions, but I'm not sure. If there is another way to run this I'm open to suggestions.
Thanks everyone!
#!/usr/bin/perl;
#comparepremiRNAtocoding-pulloverlap3.pl;
use strict;
use warnings;
use FileHandle;
use Data::Dumper;
open(H,"<$ARGV[0]")||die"open h failed";
open(C,"<$ARGV[1]")||die"open c failed";
my %h_hash=();
while(<H>) # first load h into hash of arrays
{
chomp;
my $key=(split/\t/)[1]; # split by tab and make column 2 (index 1) the key (i.e. the scaffolds);
push @{$h_hash{$key}},[split(/\t/)];
}
close H;
while(<C>) # then sequentially read this file and do the compares on the fly
{
chomp;
my @c=split/\t/; # split by tab
foreach my $i ( 0 .. $#{ $h_hash{$c[1]} } ) # for each array of this type (scaffold)
{
if ( (($h_hash{$c[1]}[$i][2] <= $c[2] #outside hits
&& $h_hash{$c[1]}[$i][3] <= $c[2]
&& $h_hash{$c[1]}[$i][2] <= $c[3]
&& $h_hash{$c[1]}[$i][3] <= $c[3]))
|| (($h_hash{$c[1]}[$i][2] >= $c[2]
&& $h_hash{$c[1]}[$i][3] >= $c[2]
&& $h_hash{$c[1]}[$i][2] >= $c[3]
&& $h_hash{$c[1]}[$i][3] >= $c[3])) )
{
# need a B-tree for every $key[1] (aka scaffold) so that for every hairpin array I can which coding "region" is nearest- calculate distance between $h_hash{$c[1]}[$i][2] and $h_hash{$c[1]}[$i][3] to $c[2] and $c[3] foreach $i ( 0 .. $#{ $h_hash{$c[1]} } ) (i.e. for each $c[2] and $c[3] where $c[1] eq $key[1])
# such as:
#$h_hash{$c[1]}[$i][2] - $c[2] = |$a|
#$h_hash{$c[1]}[$i][2] - $c[3] = |$b|
#$h_hash{$c[1]}[$i][3] - $c[2] = |$d|
#$h_hash{$c[1]}[$i][3] - $c[3] = |$e|
# for all $c[2] and $c[3] where $key[1] == $c[1]
# sort ($a <==> $b) and select the smallest value - return scalar to be used in print output
{print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t outside"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n"";}
}
else if ( (($h_hash{$c[1]}[$i][2] <= $c[2] #inside hits:
&& $h_hash{$c[1]}[$i][3] <= $c[2]
&& $h_hash{$c[1]}[$i][2] >= $c[3]
&& $h_hash{$c[1]}[$i][3] >= $c[3]))
|| (($h_hash{$c[1]}[$i][2] >= $c[2]
&& $h_hash{$c[1]}[$i][3] >= $c[2]
&& $h_hash{$c[1]}[$i][2] <= $c[3]
&& $h_hash{$c[1]}[$i][3] <= $c[3])) )
{ print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t inside"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n""}
else (
{ print "$h_hash{$c[1]}[$i][0]."\t".$h_hast{$c[1]}[$i][4]."\t overlap"."\t".$c[0]."\t".$c[1]."\t".$h_hash{$c[1]}[$i][2]."\t".$h_hash{$c[1]}[$i][3]."\t".$h_hash{$c[1]}[$i][15]."\n"";}
}
}
close C;
exit 0;