Helo all, I wanted to parse EMBL format like file to fasta. i cannot use bioperl because this is not complete EMBL format. so please suggest me how to get this done..
ID 013789-0068
PS TBD
OO huringiensis
OS ringiensis
OX
SI 68
RA
RL 2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD
FT source 1..1176
MT
AC 67106
SV
CT
PN 013789
PT PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM
PA AMA UNIVERSITY,JAPAN LAMB CO LTD.
PI HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU
P8
P4 10013789
P5 0
PC International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25
PR 80199166;
PE 199166
AN 09JP63603
KC 1
P1 ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n
P7
P9 112
PO
PM 10013789;
PB 10013789
PQ 10013789;
EM esentative
W1 PRT
D1 0204
D2 0217
D3 0730
D4 0801
D5 0204
HL [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]]
CC mer C1-1-f FH Key Location/Qualifiers Copyright (c)Inc. 2011
LS Application
L2 Publ. Of int. appl. w4
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV
LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ
ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST
IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN
LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL
QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT
DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT
LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC
APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
//
ID 0223489-0068
PS TBD
OO huringiensis
OS ringiensis
OX
SI 68
RA
RL 2010. OKAYAMA UNIVERSITY,JAPAN LAMB CO LTD
FT source 1..1176
MT
AC 67106
SV
CT
PN 013789
PT PRN METHOD, FUSION PROTEIN, AND ANTISERUM
PA AMERSITY,JAMB CO LTD.
PI HAYAKAWA TORU (JP) SAKAI, HIROSHI, HAYAKAWA, TORU
P8
P4 10013789
P5 0
PC International Classification: \nUS Classification: \nEuropean Classification: C12N15/62; C07K14/47A25
PR 80199166;
PE 199166
AN 09JP63603
KC 1
P1 ng the DNA into a host bacterium to transform the host bacterium; and (c) causing the expression of the fusion protein in the transformed host bacterium.; The method may further comprise a step of removing the peptide chain (B) from the fusion protein. \n \n
P7
P9 112
PO
PM 10013789;
PB 10013789
PQ 10013789;
EM esentative
W1 PRT
D1 0204
D2 0217
D3 0730
D4 0801
D5 0204
HL [L[P9_GQ;0;3,WO2010013789,45,67]] [L[PM_PN_GQNUC;0;12,WO2010013789]] [L[PQ_PN_GQNUC;0;12,WO2010013789]]
CC mer C1-1-f FH Key Location/Qualifiers Copyright (c)Inc. 2011
LS Application
L2 Publ. Of int. appl. w4
VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
The output should be in fasta format which consists of lines starting with ID, PT, PA and Sequence. "//" the two slashes are dividing lines between two EMBL genes.
>013789-0068 ; PROTEIN PRODUCTION METHOD, FUSION PROTEIN, AND ANTISERUM PA ; AMA UNIVERSITY,JAPAN LAMB CO LTD.
MDNNPNINECIPYNCLSNPEVEVLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQI
EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
YVQAANLHLSVLRDVSVFGQRWGFDAATINSRYNDLTRLIGNYTDYAVRWYNTGLERVWGPDSRDWVRYNQFRRELTLTV
LDIVALFSNYDSRRYPIRTVSQLTREIYTNPVLENFDGSFRGMAQRIEQNIRQPHLMDILNSITIYTDVHRGFNYWSGHQ
ITASPVGFSGPEFAFPLFGNAGNAAPPVLVSLTGLGIFRTLSSPLYRRIILGSGPNNQELFVLDGTEFSFASLTTNLPST
IYRQRGTVDSLDVIPPQDNSVPPRAGFSHRLSHVTMLSQAAGAVYTLRAPTFSWQHRSAEFNNIIPSSQITQIPLTKSTN
LGSGTSVVKGPGFTGGDILRRTSPGQISTLRVNITAPLSQRYRVRIRYASTTNLQFHTSIDGRPINQGNFSATMSSGSNL
QSGSFRTVGFTTPFNFSNGSSVFTLSAHVFNSGNEVYIDRIEFVPAEVTFEAEYDLERAQKAVNELFTSSNQIGLKTDVT
DYHIDQVSNLVECLSDEFCLDEKQELSEKVKHAKRLSDERNLLQDPNFRGINRQLDRGWRGSTDITIQGGDDVFKENYVT
LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC
APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR
>0223489-0068 ; PRN METHOD, FUSION PROTEIN, AND ANTISERUM PA ; AMERSITY,JAMB CO LTD.
VLGGERIETGYTPIDISLSLTQFLLSEFVPGAGFVLGLVDIIWGIFGPSQWDAFPVQIMNSALTTAIPLLAVQREEMRIQLE
EQLINQRIEEFARNQAISRLEGLSNLYQIYAESFREWEADPTNPALREEMRIQFNDMNSALTTAIPLLAVQNYQVPLLSV
LLGTFDECYPTYLYQKIDESKLKAYTRYQLRGYIEDSQDLEIYLIRYNAKHETVNVPGTGSLWPLSAQSPIGKCGEPNRC
APHLEWNPDLDCSCRDGEKCAHHSHHFSLDIDVGCTDLNEDLGVWVIFKIKTQDGHARLGNLEFLEEKPLVGEALARVKR