I made a scraper for a web site, but I'm having problems runninf my code...

#!/usr/bin/env python

from bs4 import BeautifulSoup
import urllib2
import re

# Get the links...

html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()

links = re.findall(r'''<a\s+.*?href=['"](.*?)['"].*?(?:</a|/)>''', html, re.I)

links_range = links[6:len(links)]


# Scrape and append the output...
f = open("test.html", "a")

for link in links_range:
    html = urllib2.urlopen('http://www.blah.fi/' + link).read()
    soup = BeautifulSoup(open(html))
    content = soup.find(id="content") 
    f.write(content.encode('utf-8') + '<hr>')


f.close()

Here is the error...

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

If I remove the 'for' loop and run a single instance of a page, it runs correctly.
What does the error mean?

The error message doesn't make any sense as it only references line 3

Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IOError: [Errno 36] File name too long: '\xef\xbb\xbf<!DOCTYPE html PUBLIC "...

which is

from bs4 import BeautifulSoup

Is BeautifulSoup installed correctly?

I installed it via the ubuntu package manager. I can get other output from it, for example...

f = open("test.html", "a")
html = urllib2.urlopen('http://www.blah.fi/asdf.html').read()
soup = BeautifulSoup(open(html))
content = soup.find(id="content")
f.write(content.encode('utf-8') + '<hr>')
f.close()

I'm not really sure how to test a python library for successful installation though.

The following works for me. Perhaps there is no id="content" for the site you use.

import urllib2
from bs4 import BeautifulSoup

f = open("test.html", "a")
html = urllib2.urlopen('http://www.google.com').read()
soup = BeautifulSoup(html)
content = soup.find(id="csi")
print "content", content
f.write(content.encode('utf-8') + '<hr>')
f.close()

You are doing some strange stuff.
Do you get urllib to work for this site?

import urllib2

url = "http://www.blah.fi/"
read_url = urllib2.urlopen(url).read()
print read_url #403 error

This site are blocking use of urllib.
I had to use Requests to get source code.

You can use BeautifulSoup to get all link,no need to use regex.

import requests
from bs4 import BeautifulSoup

url = "http://www.blah.fi/"

url_read = requests.post(url)
soup = BeautifulSoup(url_read.content)
links = soup.find_all('a', href=True)
for link in links:
    print link['href']

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Will give you output like this.

>>> 'http://www.blah.fi/' + 'http://v-reality.info/' + '<hr>'
'http://www.blah.fi/http://v-reality.info/<hr>'

soup = BeautifulSoup(open(html)) not this way,the normal way is.

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url) 
tag = soup.find_all(Do stuff you want)

IOError: [Errno 36] File name too long:
The error is pretty clear,put i do not understand that it come from line 3.
A filename can not be longer than 255 character.
Look at output from content = soup.find(id="content")

print content
print type(content)
print repr(content)

I'm not really sure how to test a python library for successful installation though.

import bs4
>>> bs4.__version__
'4.1.0'
>>> from bs4 import BeautifulSoup
>>> print BeautifulSoup.__doc__
#You get description if it work

urllib2.urlopen('http://www.blah.fi/' + link).read() is this really what you want?

Lol, no. That's just a fake site. I didn't want to mention the real site I'm scraping ;)

You were totally right about this part...

url = urllib2.urlopen("http://www.blah.fi/")
soup = BeautifulSoup(url)
tag = soup.find_all(Do stuff you want)

I've been piecing together bits from various tutorials, but it somehow looked like it was working when I was trying small snippets out.

Thanks for your help.

If the poster is still working on this, some five years later, I'd like to suggest his problems probably run deeper than any application can help with...

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.