So I'm building a web crawler for a pet project I've been working on. I'm using tutorial code for the crawler then building on it. I've done extensive troubleshooting and haven't had any luck.

The problem:

  • Roughly half the websites return content, but all of them return headers.
  • Some websites return content for some pages but not for others.

What I've tried:

  • Setting the user agent to my browser user agent to ensure it's not the robots.txt file.
  • Comparing headers from all the sites to see if there is any pattern in the headers. That is, all the sites that work have a certain field or all the sites that don't work have a certain field.
  • Setting the timeout for setReadTimeout and setConnectTimeout to 30 seconds.
  • Verifying that content is available through HTTP, i.e. it isn't blocked content
  • I googled every combination of search terms I could think of related to the problem and haven't found anything.

Other Notes:

  • The java code returns no errors, warnings, or exceptions.
  • All the sites I've tried return a 200 in the header indicating no problems.
  • I don't often work with data streams so the issue may be there, however, I did consult with a friend who is familiar and he didn't see any errors.
  • I've consulted with a server system administrator, both Linux and Windows, regarding servers blocking robots and I've eliminated all of his suggestions as possible causes.

Potential causes that I don't know how to test:

  • The web pages that work tend to be Microsoft servers with the exception of a Linux Redhat server.
  • The web pages that worked also tended to be major companies such as ebay, google, and amazon, which all have very high bandwidth.

There may be a different method that I'm not aware of to accomplish the same goal, but being new to this, I would need an example to make it work.

Below I've included the class and a basic method for using the class to print the html.

This URL doesn't work: http://www.webworldindex.com/countcharacters.htm

This URL does: http://www.velocityreviews.com/forums/t136999-so-what-is-the-max-length-of-a-string.html

/**
	 * Get a web file.
	 */
public final class WebFile {
	    // Saved response.
	    private java.util.Map<String,java.util.List<String>> responseHeader = null;
	    private java.net.URL responseURL = null;
	    private int responseCode = -1;
	    private String MIMEtype  = null;
	    private String charset   = null;
	    private Object content   = null;
	    
	    
	 
	    /** Open a web file. */
	    public WebFile( String urlString )
	        throws java.net.MalformedURLException, java.io.IOException {
	        // Open a URL connection.
	        final java.net.URL url = new java.net.URL( urlString );
	        final java.net.URLConnection uconn = url.openConnection( );
	        
	        if ( !(uconn instanceof java.net.HttpURLConnection) )
	            throw new java.lang.IllegalArgumentException(
	                "URL protocol must be HTTP." );
	        final java.net.HttpURLConnection conn =
	            (java.net.HttpURLConnection)uconn;
	 
	        // Set up a request.
	        conn.setConnectTimeout( 100000 );    // 10 sec
	        conn.setReadTimeout( 100000 );       // 10 sec
	        conn.setInstanceFollowRedirects( true );
	        conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13");
	 
	        // Send the request.
	        conn.connect( );
	 
	        // Get the response.
	        responseHeader    = conn.getHeaderFields( );
	        System.out.println(responseHeader);
	        responseCode      = conn.getResponseCode( );

	        responseURL       = conn.getURL( );
	        final int length  = conn.getContentLength( );
	        
	        final String type = conn.getContentType( );
	        if ( type != null ) {
	            final String[] parts = type.split( ";" );
	            MIMEtype = parts[0].trim( );
	            for ( int i = 1; i < parts.length && charset == null; i++ ) {
	                final String t  = parts[i].trim( );
	                final int index = t.toLowerCase( ).indexOf( "charset=" );
	                if ( index != -1 )
	                    charset = t.substring( index+8 );
	            }
	        }
	 
	        // Get the content.
	        final java.io.InputStream stream = conn.getErrorStream( );
	        if ( stream != null ){
	            content = readStream( length, stream );
	        }else if ( (content = conn.getContent( )) != null &&  content instanceof java.io.InputStream ){
	            content = readStream( length, (java.io.InputStream)content );
	            conn.disconnect( );
	        }
	    }
	 
	    /** Read stream bytes and transcode. */
	    private Object readStream( int length, java.io.InputStream stream )
	        throws java.io.IOException {
	        final int buflen = Math.max( 1024, Math.max( length, stream.available() ) );
	        byte[] buf   = new byte[buflen];;
	        byte[] bytes = null;
	 
	        for ( int nRead = stream.read(buf); nRead != -1; nRead = stream.read(buf) ) {
	            if ( bytes == null ) {
	                bytes = buf;
	                buf   = new byte[buflen];
	                continue;
	            }
	            final byte[] newBytes = new byte[ bytes.length + nRead ];
	            System.arraycopy( bytes, 0, newBytes, 0, bytes.length );
	            System.arraycopy( buf, 0, newBytes, bytes.length, nRead );
	            bytes = newBytes;
	        }
	 
	        if ( charset == null )
	            return bytes;
	        try {
	            return new String( bytes, charset );
	        }
	        catch ( java.io.UnsupportedEncodingException e ) { }
	        return bytes;
	    }
	 
	    /** Get the content. */
	    public Object getContent( ) {
	        return content;
	    }
	 
	    /** Get the response code. */
	    public int getResponseCode( ) {
	        return responseCode;
	    }
	 
	    /** Get the response header. */
	    public java.util.Map<String,java.util.List<String>> getHeaderFields( ) {
	        return responseHeader;
	    }
	 
	    /** Get the URL of the received page. */
	    public java.net.URL getURL( ) {
	        return responseURL;
	    }
	 
	    /** Get the MIME type. */
	    public String getMIMEType( ) {
	        return MIMEtype;
	    }
	}
WebFile file   = new WebFile( "http://example.com" );

Object content = file.getContent( );
if (content instanceof String )
{
    String html = (String)content;
    System.out.println(html);
}

Hey all,

I've not had any luck with this problem yet. Even if you don't know what's going wrong, any ideas of what else I could check would be great.

Thanks!

Is the content length being returned as zero?
Does the if on line 61 evaluate to false?
Is stream.available on line 70 zero?

Just a gut feel - but the whole loop starting line 74 seems insanely complex - why not just create buf of the right length and read the whole stream into it directly?

Unfortunately I don't have time to test this until later tonight. I really appreciate the ideas and will work through them and let you know how it goes. I've also read that it may be a port issue, but I would think the problem would be consistent one way or another if that were the case.

Success! So the problem was in line 58. For some reason the tutorial used getErrorStream() instead of getInputStream(), so when there were no errors, it wouldn't return content. The latter returns content regardless of errors.

Thanks for the reply!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.