I am trying to extract blogs related to economy using the RSS feeds in python. I have no idea how to get a specific number of blogs and how to get those blogs in a particular domain (like economy).
My project requires analysing these blogs using NLP techniques, but I'm stuck in the first step and I don't know how to start.

the RSS feed is an XML data, so you have to know how to parse XML. You can either parse using elementTree or using minidom

Thank you krystosan.
Umm, I don't know how to get the RSS feed for the blogs of a particular field(like economics, sports, etc.)

Give an example or link og what you try to extract/parse.
As mention by krystosan it's XML data,and there are good tool for this in Python.
And library that is only for parsing RSS like Universal Feed Parser
I like both Beautifulsoup and lxml.
A quick demo with Beautifulsoup.

from bs4 import BeautifulSoup

rss = '''\
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Python</title>
<link>http://www.reddit.com/r/Python/</link>
<description>
news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
</description>'''

soup = BeautifulSoup(rss)
title_tag = soup.find('title')
description_tag = soup.find('description')
print title_tag.text
print description_tag.text

"""Output-->
Python

news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
"""

Thanks snippsat!
It really helped.
Although, when I try to find the content of the blog using
content_tag = soup.find('content:encoded')
it just gives the first paragraph of the website (the first instance when "content:encoded" occurs)

find_all()

The find_all() method scans the entire document looking for results,

I have done this:
The code to get text only and remove the html tags (get_text()) works generally. But this code doesn't work somehow. What am I doing wrong?

import urllib2
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
content_tag = souprss.find_all('content:encoded')
for row in content_tag:
    print(row.get_text())
for node in row.findAll('p'):
    print''.join(node.findAll(text=True))

the above code and defining invalid tags also do not work (as the tags are nested and there are too many invalid tags)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.