I'm trying to write my first web scraper with Python using simple regular expressions to match the info I want to extract (I realize BeautifulSoup is available, but I'm not ready to use that yet, so I want to figure out how to use reg ex first) .
I want to extract information about tennis players and their rankings from a table on a tennis site. The rank of each player is contained within lines like the ones below:
<div class="entrylisttext">1</div>
<div class="entrylisttext">2</div>
<div class="entrylisttext">3</div>
So, I wrote a regular expression that matches on ">[\d+]</div>", which I thought would output all of the ranks as a list of numbers like this: . I'll associate these ranks with their respective players later as I develop the code more, but right now, the brackets [] are not working the way I thought they should in Python (as specifying the set of characters to match) .
import urllib
f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
tennis_rankings = f.read()
tennis_players = re.compile(">[\d+]</div>", re.I | re.S | re.M)
find_result = tennis_players.findall(tennis_rankings)
print find_result
The code above outputs this:
There are two problems I need to solve. 1) this outputs the whole match, not just the string, and 2) this outputs only numbers 1-9, not 10 or above. If I delete the brackets [] from the above code, it matches all 100 players listed, but it still returns the whole ">3</div>" string.
Any help would be appreciated.