I've been tearing my hair out for 2 days over this, hopefully someone here can help me. I'm trying to scrape the price data off the following webpage:
http://www.morningstar.co.uk/UK/snapshot/snapshot.aspx?lang=en-GB&id=F0GBR04S4X
The value I want currently stands at 6.19 (i.e. the NAV value on the right hand side).
I currently have a working macro written in vba in excel that uses the following regular expression to do this:
(GBP).\d{1,2}[.]\d\d
but for some reason I can't get this to work in python and I want to transition into python for a few reasons I won't go into here (I repeat this for various unit trusts hence the {1,2} bit).
Below is a python script I've written to download the webpage contents and then prettify it using beautiful soup. If I don't do this the encoding of the webpage is difficult to decipher.
After 2 days I can't get a python compatible regular expression to grab this data. I also use the left() and right() functions in vba to remove any whitespaces and text characters from the resulting string, any ideas on how to do that in python most gratefully received!
How do I grab the 6.19 from this page (or whatever the price is when you look!)?
#!/usr/bin/python
import re
import urllib
import string
from BeautifulSoup import BeautifulSoup #requires python-beautifulsoup package
# documentation = http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick%20Start
#pattern = '(>GBP).\d{1,2}[.]\d\d' #this is the VBA regex pattern that works inMS Excel
pattern = '\d{1,2}[.]\d\d'
urladdress = "http://www.morningstar.co.uk/UK/snapshot/snapshot.aspx?lang=en-GB&id=F0GBR04S4X"
try:
#get data from web into one string
url = urllib.urlopen(urladdress)
htmltext = url.readlines()
url.close()
#Beautiful Soup bit
soup = BeautifulSoup(''.join(htmltext))
soup = soup.prettify()
#use regular expression to search through for price using above pattern
price = re.search(pattern, soup)
if price == None:
print'no result'
exit
else:
print price.group(0)
except StandardError, e:
print str(e)
exit