Extracting blogs from rss feeds

Question

Remy the cook 0 Newbie Poster

10 Years Ago

I am trying to extract blogs related to economy using the RSS feeds in python. I have no idea how to get a specific number of blogs and how to get those blogs in a particular domain (like economy).
My project requires analysing these blogs using NLP techniques, but I'm stuck in the first step and I don't know how to start.

blog python rss

3 Contributors
7 Replies
2K Views
1 Week Discussion Span
Latest Post 10 Years Ago Latest Post by Remy the cook

krystosan 0 Junior Poster

10 Years Ago

the RSS feed is an XML data, so you have to know how to parse XML. You can either parse using elementTree or using minidom

snippsat 661 Master Poster

10 Years Ago

Give an example or link og what you try to extract/parse.
As mention by krystosan it's XML data,and there are good tool for this in Python.
And library that is only for parsing RSS like Universal Feed Parser
I like both Beautifulsoup and lxml.
A quick demo with Beautifulsoup.

from bs4 import BeautifulSoup

rss = '''\
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Python</title>
<link>http://www.reddit.com/r/Python/</link>
<description>
news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
</description>'''

soup = BeautifulSoup(rss)
title_tag = soup.find('title')
description_tag = soup.find('description')
print title_tag.text
print description_tag.text

"""Output-->
Python

news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
"""

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Remy the cook 0 Newbie Poster · Answer 1 · 2013-11-06T16:22:17+00:00

Thank you krystosan.
Umm, I don't know how to get the RSS feed for the blogs of a particular field(like economics, sports, etc.)

Remy the cook 0 Newbie Poster · Answer 2 · 2013-11-08T14:34:49+00:00

Thanks snippsat!
It really helped.
Although, when I try to find the content of the blog using
content_tag = soup.find('content:encoded')
it just gives the first paragraph of the website (the first instance when "content:encoded" occurs)

snippsat 661 Master Poster · Answer 3 · 2013-11-08T14:52:37+00:00

find_all()

The find_all() method scans the entire document looking for results,

Remy the cook 0 Newbie Poster · Answer 4 · 2013-11-11T23:02:18+00:00

I have done this:
The code to get text only and remove the html tags (get_text()) works generally. But this code doesn't work somehow. What am I doing wrong?

import urllib2
page = urllib2.urlopen("http://www.frugalrules.com")
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(page)
link = soup.find('link', type='application/rss+xml')
print link['href']
rss = urllib2.urlopen(link['href']).read()
souprss = BeautifulSoup(rss)
content_tag = souprss.find_all('content:encoded')
for row in content_tag:
    print(row.get_text())

Remy the cook 0 Newbie Poster · Answer 5 · 2013-11-11T23:46:39+00:00

for node in row.findAll('p'):
    print''.join(node.findAll(text=True))

the above code and defining invalid tags also do not work (as the tags are nested and there are too many invalid tags)