PDF Download Issue, Stumped

Question

blur0224 0 Light Poster

16 Years Ago

I'm trying to develop a simple program that automatically downloads PDF files from a web server and organizes them into Files. When I download any .pdf file, the size is roughly 30% bigger than when I download it with a browser and it pdf opens, but does not display.

So far I've narrowed the additional bytes being added to

java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );

The issue is not in the writing of the file because I can open a FileInputStream from a local copy of the pdf then output it and have it open fine. So somewhere between the web server and the BufferedInputStream I'm getting additional bytes added in.

I've tried using the URI decoding function but it appears it's only for the URL, not for content.

I've tried writing the file using Char data type instead of byte, but this didn't cause any changes.

I've verified that html pages downloaded and put in a .txt file match exactly what's see when the page source is view (i.e. no additions to the file with html)

I know it's possible because webcrawlers such as Nutch written in java are able to crawl and index PDFs without changing them.

Any help would be greatly appreciated.

Thanks!

//This code works but adds additional bytes to the outputted PDF causing it not to display

public class Main
{
	public static void main(String[] args) throws IOException
	{
	
		
		java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );
		java.io.FileOutputStream fos = new java.io.FileOutputStream("test.pdf");
		java.io.BufferedOutputStream bout = new BufferedOutputStream(fos);
		byte data[] = new byte[1024];
		
		while(in.read(data,0,1024)>=0){
			bout.write(data);
		}
		
		bout.close();
		in.close();

	}
}

java open-source pdf web-browser web-server

2 Contributors
2 Replies
215 Views
1 Day Discussion Span
Latest Post 16 Years Ago Latest Post by blur0224

All 2 Replies

JamesCherrill 4,733 Most Valuable Poster

16 Years Ago

If the last block you read is <1024 bytes, you still write 1024 bytes to the output stream.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

blur0224 0 Light Poster · Answer 1 · 2009-05-20T08:27:35+00:00

The difference between the file length far exceeds 1024 bytes and I've already ruled out the output stream as the culprit because it I've tested it with a local PDF by running a local copy of a PDF through the program and having it come out unaltered.

In the textual comparison of the PDF from the local copy to the downloaded copy I found additional 10 digit numbers in a list towards the beginning of the document, so some where between the web-server and the java program the information changes, I just don't know where.

PDF Download Issue, Stumped

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers