So I'm building a web crawler for a pet project I've been working on. I'm using tutorial code for the crawler then building on it. I've done extensive troubleshooting and haven't had any luck.
The problem:
- Roughly half the websites return content, but all of them return headers.
- Some websites return content for some pages but not for others.
What I've tried:
- Setting the user agent to my browser user agent to ensure it's not the robots.txt file.
- Comparing headers from all the sites to see if there is any pattern in the headers. That is, all the sites that work have a certain field or all the sites that don't work have a certain field.
- Setting the timeout for setReadTimeout and setConnectTimeout to 30 seconds.
- Verifying that content is available through HTTP, i.e. it isn't blocked content
- I googled every combination of search terms I could think of related to the problem and haven't found anything.
Other Notes:
- The java code returns no errors, warnings, or exceptions.
- All the sites I've tried return a 200 in the header indicating no problems.
- I don't often work with data streams so the issue may be there, however, I did consult with a friend who is familiar and he didn't see any errors.
- I've consulted with a server system administrator, both Linux and Windows, regarding servers blocking robots and I've eliminated all of his suggestions as possible causes.
Potential causes that I don't know how to test:
- The web pages that work tend to be Microsoft servers with the exception of a Linux Redhat server.
- The web pages that worked also tended to be major companies such as ebay, google, and amazon, which all have very high bandwidth.
There may be a different method that I'm not aware of to accomplish the same goal, but being new to this, I would need an example to make it work.
Below I've included the class and a basic method for using the class to print the html.
This URL doesn't work: http://www.webworldindex.com/countcharacters.htm
This URL does: http://www.velocityreviews.com/forums/t136999-so-what-is-the-max-length-of-a-string.html
/**
* Get a web file.
*/
public final class WebFile {
// Saved response.
private java.util.Map<String,java.util.List<String>> responseHeader = null;
private java.net.URL responseURL = null;
private int responseCode = -1;
private String MIMEtype = null;
private String charset = null;
private Object content = null;
/** Open a web file. */
public WebFile( String urlString )
throws java.net.MalformedURLException, java.io.IOException {
// Open a URL connection.
final java.net.URL url = new java.net.URL( urlString );
final java.net.URLConnection uconn = url.openConnection( );
if ( !(uconn instanceof java.net.HttpURLConnection) )
throw new java.lang.IllegalArgumentException(
"URL protocol must be HTTP." );
final java.net.HttpURLConnection conn =
(java.net.HttpURLConnection)uconn;
// Set up a request.
conn.setConnectTimeout( 100000 ); // 10 sec
conn.setReadTimeout( 100000 ); // 10 sec
conn.setInstanceFollowRedirects( true );
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13");
// Send the request.
conn.connect( );
// Get the response.
responseHeader = conn.getHeaderFields( );
System.out.println(responseHeader);
responseCode = conn.getResponseCode( );
responseURL = conn.getURL( );
final int length = conn.getContentLength( );
final String type = conn.getContentType( );
if ( type != null ) {
final String[] parts = type.split( ";" );
MIMEtype = parts[0].trim( );
for ( int i = 1; i < parts.length && charset == null; i++ ) {
final String t = parts[i].trim( );
final int index = t.toLowerCase( ).indexOf( "charset=" );
if ( index != -1 )
charset = t.substring( index+8 );
}
}
// Get the content.
final java.io.InputStream stream = conn.getErrorStream( );
if ( stream != null ){
content = readStream( length, stream );
}else if ( (content = conn.getContent( )) != null && content instanceof java.io.InputStream ){
content = readStream( length, (java.io.InputStream)content );
conn.disconnect( );
}
}
/** Read stream bytes and transcode. */
private Object readStream( int length, java.io.InputStream stream )
throws java.io.IOException {
final int buflen = Math.max( 1024, Math.max( length, stream.available() ) );
byte[] buf = new byte[buflen];;
byte[] bytes = null;
for ( int nRead = stream.read(buf); nRead != -1; nRead = stream.read(buf) ) {
if ( bytes == null ) {
bytes = buf;
buf = new byte[buflen];
continue;
}
final byte[] newBytes = new byte[ bytes.length + nRead ];
System.arraycopy( bytes, 0, newBytes, 0, bytes.length );
System.arraycopy( buf, 0, newBytes, bytes.length, nRead );
bytes = newBytes;
}
if ( charset == null )
return bytes;
try {
return new String( bytes, charset );
}
catch ( java.io.UnsupportedEncodingException e ) { }
return bytes;
}
/** Get the content. */
public Object getContent( ) {
return content;
}
/** Get the response code. */
public int getResponseCode( ) {
return responseCode;
}
/** Get the response header. */
public java.util.Map<String,java.util.List<String>> getHeaderFields( ) {
return responseHeader;
}
/** Get the URL of the received page. */
public java.net.URL getURL( ) {
return responseURL;
}
/** Get the MIME type. */
public String getMIMEType( ) {
return MIMEtype;
}
}
WebFile file = new WebFile( "http://example.com" );
Object content = file.getContent( );
if (content instanceof String )
{
String html = (String)content;
System.out.println(html);
}