Hey all. First post to this forum, though I have casually browsed threads here often in the past. I'm pretty new to Python but have several years of experience programming in C, C++ and Java. I've got a Java app currently deployed to Google App Engine and wanted to be able to do some batch processing on a local machine (basically going to the data store and pulling records that represent web pages, then checking to see if these pages are still active). The easiest way to do this, it seemed, was with something called Remote API (http://code.google.com/appengine/articles/remote_api.html) which is Python-only. I figured I ought to just go ahead and try working in Python. Cut to two days later and I feel I have a decent handle on the language. I'm posting here because I've hit an odd (or maybe not so odd) problem. I wrote a script that goes to the my database and fetches records (each of which contains a url for a webpage and some other properties), then iterates through them and uses urllib2.urlopen() and readline() to look at the contents of the page/url associated with each record. If the page is no longer active (these are craigslist housing listings -- so if I get a 404 or similar message) the script removes the associated record from the data store. This works for the most part alright but I've noticed that about ~10% of the time urllib2 grabs me a page different from the one that firefox grabs me using the same url (an example from just a minute ago: "http://newyork.craigslist.org/aap/jsy/abo/1297710020.html" -- the page received by the python script has a "this posting has been deleted by its author" message whereas the page downloaded by firefox is totally active). When I try to download these odd pages with wget or using a Java program, I get the same erroneous content that I get when using my Python script. Thinking it might be a user-agent discrimination type thing, I added the headers from my browser to my request (basically trying to spoof the server):
txdata = None
txheaders = {
'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2',
'Accept' : 'text/html, image/jpeg, image/png, text/*, image/*, */*',
'Accept-Language': 'en-us',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive': '300',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
}
req = urllib2.Request(, txdata, txheaders)
pagefile = urllib2.urlopen(req)
The only header I left out was (Accept-Encoding: gzip,deflate) because seemed (logically, I guess) to result in the server sending me back a compressed page. Still don't end up with the right page (well, the page I see in firefox or konqueror). So I guess the question is: what is happening here and what can I do to make it such that my script is able to access the same pages that my browser is accessing?
Thanks,
Nick Z