Hi, I just started learning python a few days ago...and I'm already stuck on something easy TT.TT
I have a Tab-Separated-Values data.tsv file that contains 3 columns (country name, area, and population).
here's a snippet of my current tsv file
country area population
MACAU 28.2 578025
MONACO 2 30510
SINGAPORE 697 5353494
HONG KONG 1104 7153519
GAZA STRIP 360 1710257
GIBRALTAR 6.5 29034
HOLY SEE (VATICAN CITY) 0.44 836
BAHRAIN 760 1248348
MALDIVES 298 394451
MALTA 316 409836
BERMUDA 54 69080
SINT MAARTEN 34 39088
BANGLADESH 143998 161083804
..........
I would like to aggregate the data by geo regions such as North America, South America, etc. and since the region info is not in the file, I need to add it in from this site www.indexmundi.com/factbook/regions and merge the region names into the file so it pairs with the correct country to produce this output:
(what I want my final tsv file to look like)
country region area population
AFGHANISTAN Asia 652230 30419928
ALBANIA Europe 28748 3002859
ALGERIA Africa 2381741 37367226
AMERICAN SAMOA Oceania 199 54947
ANDORRA Europe 468 85082
ANGOLA Africa 1246700 18056072
ANGUILLA Central America & the Caribbean 91 15423
ANTIGUA AND BARBUDA Central America & the Caribbean 442.6 89018
ARGENTINA South America 2780400 42192494
ARMENIA Asia 29743 2970495
ARUBA Central America & the Caribbean 180 107635
AUSTRALIA Oceania 7741220 22015576
AUSTRIA Europe 83871 8219743
AZERBAIJAN Asia 86600 9493600
.............
this is my code right now:
import urllib2, re
from bs4 import BeautifulSoup
response = urllib2.urlopen('http://www.indexmundi.com/factbook/regions').read()
soup = BeautifulSoup(response)
row = soup.findAll('li')
for link in row:
href = link.find('a')['href']
url = "http://www.indexmundi.com"
countryurl = url + href
response = urllib2.urlopen(countryurl).read()
soup = BeautifulSoup(response)
data_table = soup.findAll('td')
for data in data_table:
region = data.find('a').text
print region
This only prints out the region names like below:
Algeria
Angola
Benin
Botswana
Burkina
Faso
Burundi Cameroon
Cape Verde
Central African Republic
Chad
Comoros Congo, Democratic Republic of the
Congo, Republic of the
Cote d'Ivoire
Djibouti
Egypt
etc....
The result I want can be done only using BeautifulSoup4 and urllib2 (which I have incorporated) so I don't need other complicated modules (again, newbie).
I don't think I need to keep reading into the links from where I'm at right? But then I'm not sure how to merge the regions into the file with the correct country though I think I would somehow need to save the country name first so that when I do write the region names to my current tsv file, it will merge with the correct country it's under.
Any help would be greatly appreciated