A protein entry on swissprot database be something like this:
ID ATF6A_HUMAN
AC P18850; O15139; Q5VW62; Q6IPB5; Q9UEC9;
DT 01-NOV-1990, integrated into UniProtKB/Swiss-Prot.
DE AltName: Full=Activating transcription factor 6 alpha;
DE Short=ATF6-alpha;
OS Homo sapiens (Human).
RN [10]
RP REVIEW.
RX MEDLINE=21376119; PubMed=11483355; DOI=10.1016/S0378-1119(01)00551-0;
RA Hai T., Hartman M.G.;
RT "The molecular biology and nomenclature of the activating
RT transcription factor/cAMP responsive element binding family of
RT transcription factors: activating transcription factor proteins and
RT homeostasis.";
RL Gene 273:1-11(2001).
DR NextBio; 43639; -.
DR PMAP-CutDB; P18850; -.
DR ArrayExpress; P18850; -.
DR Bgee; P18850; -.
DR CleanEx; HS_ATF6; -.
DR GermOnline; ENSG00000118217; Homo sapiens.
DR Pfam; PF00170; bZIP_1; 1.
DR PROSITE; PS50217; BZIP; 1.
DR PROSITE; PS00036; BZIP_BASIC; 1.
PE 1: Evidence at protein level;
KW Activator; Complete proteome; DNA-binding; Endoplasmic reticulum;
KW Glycoprotein; Membrane; Nucleus; Phosphoprotein; Polymorphism;
KW Signal-anchor; Transcription; Transcription regulation; Transmembrane;
KW Unfolded protein response.
//
I wish to parse through the protein database to get all protein names and Pfam numbers that are "human" "transcription factors" AND also a "DNA-binding", ignoring all other information.
My protocol is as followed:
Find "ID" line.
Find "OS" line and see if it contains the word "HUMAN".
See if there is any Pfam number in "DR" lines.
See if the terms "transcription" and "DNA-binding" appear in "KW" lines.
If the three conditions are met, then print out the result like this:
$ID, $Pfam_number1, $Pfam_number2,...(if more Pfam numbers exist)
It should be a pretty easy one, but I am not sure how to write a script on this. It has puzzled me for days. Can anyone please help?
I will not need a full script, just a main construction of the script would be very helpful.
Thanks.