I need some help setting up a perl script which will provide the statistics on some data I have in an excel file. My plan was to copy the excel file into a text file (with tab delineated columns) and run the script off of that. I downloaded a simple statistics package from http://search.cpan.org/~brianl/Statistics-Lite-3.2/Lite.pm and copied the file to my bin folder. However, the system is not recognizing the Statistics package when I include use Statistics::Lite qw(:all); in the script.
Meanwhile I've included some sample data and the script as it stands now. If anyone can help me with this script or knows of a simple way to produce the basic statistical analysis of min, max, mean, mode and standard deviation without an extra package installed that would be great.
The file needs to
1)first be parsed according to scaffold which is the value in column [1]
2) determine if there are more than 1 line item with the same [1], if yes, then...
3) all line items with the same value in [1] (i.e. they are all part of the same group) should have their start site (column [2]) put through the statistical analysis of minimum, maximum, mean, mode, and standard deviation. (To do this, all line items belonging to a single group would need to be sorted by column [2])
I would like the script to generate an output file with the statistical analysis for each group on a separate line.
here is the sample file:
12 scaffold656_7__ 793 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
17 scaffold657_1__ 10860 D 17 ptc mi 482.1_MI 0 1 2.36e 1 31 94.12
12 scaffold657_3__ 226 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_3__ 1348 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_4__ 259 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_5__ 8776 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_5__ 14581 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 11361 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 13353 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_6__ 20463 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
21 scaffold657_9__ 4998 D 21 ath mi 414_MIMA 0 2 3.42e 2 36 90.48
12 scaffold657_9__ 6733 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
12 scaffold657_9__ 6855 D 12 mdv1 mi M3*_MI 0 2 2.82e 3 18 83.33
#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Lite qw(:all);
my $data = @ARGV;
open $data or die("Cannot open data file\n");
my (@in,@out);
my @data = <$data>
while(<$data>){
# first load data into hash of arrays
chomp;
push @in, [split(split(/\s|\t/))];#Build an array of arrays
}
close $data;
my $prev_aref;
foreach my $aref (@in){#foreach array reference
if (!defined($prev_aref)
or $$aref[1] ne $$prev_aref[1]
# or abs($$aref[4] - $$prev_aref[4]) >= 250){#At least 250 away from prev loc
push @out, $aref;
$prev_aref = $aref;
my $scaffold = $$aref[1];
}
}
@out = sort my_sort @out;
foreach my $aref (@out){
my @start = $$aref[2]
my $min = min @start;
my $max = max @start;
my $mean = mean @start;
%calc= statshash @start;
my $stddev = $calc{stddev};
print join "\t" $scaffold "\t" $min "\t" $max "\t" $mean "\t" $stddev "\n";
}
#$min= min @data;
# $mean= mean @data;
#
# %data= statshash @data;
# print "sum= $data{sum} stddev= $data{stddev}\n";
# print statsinfo(@data);
# print join "\t", @$aref, "\n";
#}
sub my_sort{
my $r = $$a[1] cmp $$b[1]; #Compare group
if ($r == 0) { #If group in same group
$r = $$a[2] <=> $$b[2]; #Compare start
}
return $r;
}
exit 0;