First of all, hello!
I've been reading the forums for quite a while now, but this time I need real help.
To summarize things: I'm building a parser which has to consolidate data based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a file which should contain:
pdbID | startRes | endRes | secstructID
source array after parsing a file has the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N
expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N
As you can see each pdbID is assigned a secStructID in a sequential manner and any interruptions in the secStructID are considered points from which the assignment restarts (should).
Each pdbID can thus have multiple occurences of for example \N in different places of the sequence and they are differentiated by the startRes and endRes values which are all derived from the resNum.
All is wonderful and I have a working code which consolidates the data, unfortunately it doesn't recognize the occurence of the new secstructID automatically as the end of the previous one rather it finds the last possible in the whole sequence for one pdbID and considers that as the end.
and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate "entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B
And here's my code:
#!/usr/bin/perl -w
use strict;
use warnings;
# --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
# --------------------------------------------------------------
#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n");
}
my $filename = $ARGV[0];
#if either file not found return error message
if (! -e "$filename")
{
die("\n\nresidue file $filename does not exist!\n\n");
}
# Read residue.txt file, extracting the data of interest - only
# pdb id, resNum, resID, secondaryStructID
#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
my @dssplines= split(/\r/, <MYFILE>);
my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
my @dsspdata;
my $dsspdataSize=@dssplines;
my $n=0;
for (my $i=0; $i < $arraySize; $i++)
{
#each line from the array goes into a new dsspline variable
my $dsspline = $dssplines[$i];
for (my $j = 0; $j <=4; $j++)
{
#each time values inside are separated using the tabs
my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
# now each value of interest is stored into a new array @dsspdata
$dsspdata[$n][0] = $pdbID;
$dsspdata[$n][1] = $resNo;
$dsspdata[$n][2] = $resID;
$dsspdata[$n][3] = $secStruct;
}
$n++;
}
#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation
#initialize the hash and counter
my %dane;
my $k=0;
#loop around the dsspdata array
for (my $i=0; $i < $dsspdataSize; $i++)
{
#split each cell in a row into variables for the hash
for (my $k = 0; $k <=4; $k++)
{
my $pdb = $dsspdata[$i][0];
my $residueNum = $dsspdata[$i][1];
my $secStructure = $dsspdata[$i][3];
push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
}
$k++;
}
#now for each pdbID using the hash keys
foreach my $pdbID ( keys %dane )
{
#check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
foreach my $secID ( keys %{ $dane{$pdbID} } )
{
#finally create an array of residue numbers
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
#create a new file with the secondary structures list
open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
#append each line to the new file with tab separated data
print SStruc ("$pdbID \t @resnums \t $secID\n");
}
}
close(SStruc);
I have attached the source file (this file is already processed from another script, which wasn't nearly as complicated as this issue ;))
Run the program with residue.txt as attribute.
If anyone has an idea how to deal with this I would be very grateful for suggestions, as you can see from my code I am slightly java-twisted.
Cheers,
Matt