Hi all,

I am try to make the script to tranlate DNA to protein.
I found the script in internet, and I tried to mix that script to make the result in the colums but I was not seccessful.
Would you please show me solve this problems. For Ex:

I hope Out put :
Ref_aa   Mul_aa  Ref_Pro   Mul_pro   infor
AGA	AAA      R          K       change
CCA	CCT      P          P       same
GCA	ACA      A          T       same
GCA	ACA      A          T       same
print "\n\n\t\#################### DNA 2 PROTEIN #################### \n\n";
print "This script will convert your DNA sequence to PROTEIN Sequence\n\n";
 print 'ENTER THE DNA SEQUENCE THAT COMPARE WITH TRIPRET_1ST := '; 
     	my $DNA_filename = <STDIN>; 
	chomp $DNA_filename;
	unless (open(DNAFILE, $DNA_filename)) 
	{ 
    	die 'Cannot open file "' . $DNA_filename . '"' . "\n\n"; 
	} 
@DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', @DNA);
print " \nThe original DNA file is:\n$DNA \n";
$DNA =~ s/\s//g;


 open(BASE,">Ref_aa1.txt");
my $protein='';
my $codon;
for(my $i=0;$i<(length($DNA)-2);$i+=3)
{

$codon=substr($DNA,$i,3);
$protein.=&codon2aa($codon);
$out=$protein;
}
print (BASE "Ref_aa\t   Mul_aa \t Ref_Pro\t   Mul_pro\n ");
print BASE $DNA."\t";
print BASE $out."\n\n";
close (BASE);


sub codon2aa{
my($codon)=@_;
$codon=uc $codon;
my(%g)=(
'TCA'=>'S', #Serine
'TCC'=>'S', #Serine
'TCG'=>'S',  #Serine
'TCT'=>'S', #Serine 
'TTC'=>'F', #Phenylalanine 
'TTT'=>'F', #Phenylalanine 
'TTA'=>'L', #Leucine 
'TTG'=>'L', #Leucine 
'TAC'=>'Y', #Tyrosine 
'TAT'=>'Y', #Tyrosine 
'TAA'=>'_', #Stop 
'TAG'=>'_', #Stop 
'TGC'=>'C', #Cysteine 
'TGT'=>'C', #Cysteine 
'TGA'=>'_', #Stop 
'TGG'=>'W', #Tryptophan 
'CTA'=>'L', #Leucine 
'CTC'=>'L', #Leucine 
'CTG'=>'L', #Leucine 
'CTT'=>'L', #Leucine 
'CCA'=>'P', #Proline 
'CAT'=>'H', #Histidine 
'CAA'=>'Q', #Glutamine 
'CAG'=>'Q', #Glutamine 
'CGA'=>'R', #Arginine 
'CGC'=>'R', #Arginine 
'CGG'=>'R', #Arginine 
'CGT'=>'R', #Arginine 
'ATA'=>'T', #Isoleucine 
'ATC'=>'T', #Isoleucine 
'ATT'=>'T', #Isoleucine 
'ATG'=>'M', #Methionine 
'ACA'=>'T', #Threonine 
'ACC'=>'T', #Threonine 
'ACG'=>'T', #Threonine 
'ACT'=>'T', #Threonine 
'AAC'=>'N', #Asparagine 
'AAT'=>'N', #Asparagine 
'AAA'=>'K', #Lysine 
'AAG'=>'K', #Lysine 
'AGC'=>'S', #Serine#Valine 
'AGT'=>'S', #Serine 
'AGA'=>'R', #Arginine 
'AGG'=>'R', #Arginine 
'CCC'=>'P', #Proline 
'CCG'=>'P', #Proline 
'CCT'=>'P', #Proline 
'CAC'=>'H', #Histidine 
'GTA'=>'V', #Valine 
'GTC'=>'V', #Valine 
'GTG'=>'V', #Valine 
'GTT'=>'V', #Valine 
'GCA'=>'A', #Alanine 
'GCC'=>'A', #Alanine 
'GCG'=>'A', #Alanine 
'GCT'=>'A', #Alanine 
'GAC'=>'D', #Aspartic Acid 
'GAT'=>'D', #Aspartic Acid 
'GAA'=>'E', #Glutamic Acid 
'GAG'=>'E', #Glutamic Acid 
'GGA'=>'G', #Glycine 
'GGC'=>'G', #Glycine 
'GGG'=>'G', #Glycine 
'GGT'=>'G', #Glycine 
 ); 
if(exists $g{$codon})
{
return $g{$codon};
}
else
{
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}

}

I don't see how your script decides whether to print 'change' or 'same'? What is the rule that determines this?

I sorry I forget to log out so I did not answer your question. I do not know how to do the script to print 'change' or 'same' in this case. If the Ref_pro = mul_pro print same. another print 'change'.

The rule that determin in this case :
If Ref_pro = mul_pro print 'same'
else print 'change'.

#!/usr/bin/perl;
use strict;
use warnings;

my %codon2proteins = build_hash();

my $dna_filename = 'dna.txt';

open my $dna_fh, '<', $dna_filename or die "Failed to open $dna_filename: $!";

while (my $rec = <$dna_fh>){
    chomp($rec);
    my ($ref, $mul) = split /\s+/, $rec;
    my $ref_pro = convert_codon2prot($ref);
    my $mul_pro = convert_codon2prot($mul);
    my $info = compare_pros($ref_pro, $mul_pro);
    print "$ref\t$mul\t$ref_pro\t$mul_pro\t$info\n";
}

sub compare_pros{
    my ($r, $m) = @_;
    if ($r eq $m){
        return 'same';
    }
    else {
        return 'change';
    }
}

sub convert_codon2prot{
    my ($codon) = @_;
    
    if(exists $codon2proteins{$codon}){
        return $codon2proteins{$codon};
    }
    else{
        die "Bad codon $codon!!\n";
    }
}

sub build_hash{
    my(%g)=(
            'TCA'=>'S', #Serine
            'TCC'=>'S', #Serine
            'TCG'=>'S',  #Serine
            'TCT'=>'S', #Serine 
            'TTC'=>'F', #Phenylalanine 
            'TTT'=>'F', #Phenylalanine 
            'TTA'=>'L', #Leucine 
            'TTG'=>'L', #Leucine 
            'TAC'=>'Y', #Tyrosine 
            'TAT'=>'Y', #Tyrosine 
            'TAA'=>'_', #Stop 
            'TAG'=>'_', #Stop 
            'TGC'=>'C', #Cysteine 
            'TGT'=>'C', #Cysteine 
            'TGA'=>'_', #Stop 
            'TGG'=>'W', #Tryptophan 
            'CTA'=>'L', #Leucine 
            'CTC'=>'L', #Leucine 
            'CTG'=>'L', #Leucine 
            'CTT'=>'L', #Leucine 
            'CCA'=>'P', #Proline 
            'CAT'=>'H', #Histidine 
            'CAA'=>'Q', #Glutamine 
            'CAG'=>'Q', #Glutamine 
            'CGA'=>'R', #Arginine 
            'CGC'=>'R', #Arginine 
            'CGG'=>'R', #Arginine 
            'CGT'=>'R', #Arginine 
            'ATA'=>'T', #Isoleucine 
            'ATC'=>'T', #Isoleucine 
            'ATT'=>'T', #Isoleucine 
            'ATG'=>'M', #Methionine 
            'ACA'=>'T', #Threonine 
            'ACC'=>'T', #Threonine 
            'ACG'=>'T', #Threonine 
            'ACT'=>'T', #Threonine 
            'AAC'=>'N', #Asparagine 
            'AAT'=>'N', #Asparagine 
            'AAA'=>'K', #Lysine 
            'AAG'=>'K', #Lysine 
            'AGC'=>'S', #Serine#Valine 
            'AGT'=>'S', #Serine 
            'AGA'=>'R', #Arginine 
            'AGG'=>'R', #Arginine 
            'CCC'=>'P', #Proline 
            'CCG'=>'P', #Proline 
            'CCT'=>'P', #Proline 
            'CAC'=>'H', #Histidine 
            'GTA'=>'V', #Valine 
            'GTC'=>'V', #Valine 
            'GTG'=>'V', #Valine 
            'GTT'=>'V', #Valine 
            'GCA'=>'A', #Alanine 
            'GCC'=>'A', #Alanine 
            'GCG'=>'A', #Alanine 
            'GCT'=>'A', #Alanine 
            'GAC'=>'D', #Aspartic Acid 
            'GAT'=>'D', #Aspartic Acid 
            'GAA'=>'E', #Glutamic Acid 
            'GAG'=>'E', #Glutamic Acid 
            'GGA'=>'G', #Glycine 
            'GGC'=>'G', #Glycine 
            'GGG'=>'G', #Glycine 
            'GGT'=>'G', #Glycine 
    );
    return %g;
}

Outputs

AGA	AAA	R	K	change
CCA	CCT	P	P	same
GCA	ACA	A	T	change
GCA	ACA	A	T	change
etc...

Thank you very much for your help. It works very well. Could you show me how to analyze the data DNA1.txt If I want to add position colum in the out put?

Out put
Posi   Ref   Mul   pro_ref  pro_mul   infor
1      AGA   AAA	R	K	change
2      CCA   CCT	P	P	same
etc......

You could create another variable to hold the position value and include that variable in the string that you print.

Instead of my ($ref, $mul) = split /\s+/, $rec; you could have my ($pos, $ref, $mul) = split /\s+/, $rec;

Thank you very much. But with data 'DNA.txt' the script have error "bad codon $ref".
I mix the error with

while (<$dna_fh>){
    next unless /^\s*\d/; 
         
    my ($pos, $ref, $mul) =  split ;

Could you show me what the diffirent of the script?

You need a chomp; statement before the statement that does the split. Otherwise the last codon on each line will include a newline character and so will not match any codon in your hash.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.