Hi there,
I'm kind of new to python and I'm trying to extract a protein sequence from this webpage...
http://www.ncbi.nlm.nih.gov/protein/BAH23558.1
When I use urllib.urlopen the html it gets does not contain the sequence data. When I open this page in firefox and use firebug to look at the page I can see the data. It looks like simply using python to grab the html file won't work. I'm not sure why this happens but if anyone could explain it to me I'd be much appreciative. I suspect the data is loaded server-side and I need to tweak my python code to include it somehow. I'm currently reading up on DOM but any pointers would be appreciated.
P.S. I know I could simply copy the data but I have to process over 1000 of these links (which I have saved in a text file) so I need to figure out how this works.
EDIT:
I'm sure it has something to do with the div id="viewercontent1. This div looks like this in the HTML pulled by urllib.urlopen()...
<div id="viewercontent1" class="seq gbff" val="224176120" SequenceSize="7562" VirtualSequence=""></div>
However when I look at the page in firefox using firebug the div looks like this...
<div style="display: block;" id="viewercontent1" class="seq gbff" val="224176120" sequencesize="7562" virtualsequence=""><div><div class="sequence"><a name="locus_224176120"></a><div class="hnav" id="hnav224176120_0"><div class="goto"><a aria-expanded="false" role="button" href="#goto224176120_0" class="tgt_dark jig-ncbipopper" config="openMethod : 'click', closeMethod : 'click', destPosition: 'bottom left', adjustFit: 'none', triggerPosition: 'bottom left'" id="gotopopper224176120_0">Go to:</a></div></div><div class="tabPopper nonstd_popper" style="display: none;" id="goto224176120_0"><ul class="locals"><li><a href="#feature_224176120" title="Jump to the feature table of this record">Features</a></li><li><a href="#sequence_224176120" title="Jump to the sequence of this record">Sequence</a></li></ul></div>
<pre class="genbank">LOCUS BAH23558 362 aa linear VRL 26-FEB-2009
DEFINITION VP1 [BK polyomavirus].
ACCESSION BAH23558
VERSION BAH23558.1 GI:224176120
DBSOURCE accession <a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116">AB485712.1</a>
KEYWORDS .
SOURCE BK polyomavirus
ORGANISM <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">BK polyomavirus</a>
Viruses; dsDNA viruses, no RNA stage; Polyomaviridae; Polyomavirus.
REFERENCE 1
AUTHORS Sugimoto,C., Hara,K., Taguchi,F. and Yogo,Y.
TITLE Growth efficiency of naturally occurring BK virus variants in vivo
and in vitro
JOURNAL J. Virol. 63 (7), 3195-3199 (1989)
PUBMED <a href="http://www.ncbi.nlm.nih.gov/pubmed/2542627">2542627</a>
REFERENCE 2 (residues 1 to 362)
AUTHORS Zhong,S. and Yogo,Y.
TITLE Direct Submission
JOURNAL Submitted (20-FEB-2009) Contact:Shan Zhong Graduate School of
Medicine, The University of Tokyo, Department of Urology; Hongo
7-3-1, Bunkyo-ku, Tokyo 113-8655, Japan
<a name="comment_224176120"></a><a name="feature_224176120"></a>FEATURES Location/Qualifiers
source 1..362
/organism="BK polyomavirus"
/isolate="MT clone 111"
/isolation_source="urine of a patient with systemic lupus
erythematosus"
/db_xref="taxon:<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10629">10629</a>"
/country="Japan"
/note="complete genome;
vector: pAT153"
<a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=1&to=362&report=gpwithparts">Protein</a> 1..362
/product="VP1"
<a href="http://www.ncbi.nlm.nih.gov/protein/224176120?from=2&to=362&report=gpwithparts">Region</a> 2..362
/region_name="PHA02614"
/note="Major capsid protein VP1; Provisional"
/db_xref="CDD:<a href="http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=177437">177437</a>"
<a href="http://www.ncbi.nlm.nih.gov/nuccore/224176116?from=1587&to=2675&report=gbwithparts">CDS</a> 1..362
/coded_by="AB485712.1:1587..2675"
ORIGIN
<a name="sequence_224176120"></a> 1 maptkrkgec pgaapkkpkd pvqvpkllik ggvevlevkt gvdaitevec flnpemgdpd
61 enlrgfslkl saendfssds perkmlpcys tariplpnln edltcgnllm weavtvqtev
121 igitsmlnlh agsqkvhehg ggkpiqgsnf hffavggdpl emqgvlmnyr tkypegtitp
181 knptaqsqvm ntdhkayldk nnaypvecwi pdpsrnentr yfgtltggen vppvlhvtnt
241 attvlldeqg vgplckadsl yvsaadicgl ftnssgtqqw rglaryfkir lrkrsvknpy
301 pisfllsdli nrrtqrvdgq pmygmesqve evrvfdgtek lpgdpdmiry idkqgqlqtk
361 ml
//</pre>
<a name="slash_224176120"></a></div>
</div></div>