I'm trying to develop a simple program that automatically downloads PDF files from a web server and organizes them into Files. When I download any .pdf file, the size is roughly 30% bigger than when I download it with a browser and it pdf opens, but does not display.
So far I've narrowed the additional bytes being added to
java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );
The issue is not in the writing of the file because I can open a FileInputStream from a local copy of the pdf then output it and have it open fine. So somewhere between the web server and the BufferedInputStream I'm getting additional bytes added in.
I've tried using the URI decoding function but it appears it's only for the URL, not for content.
I've tried writing the file using Char data type instead of byte, but this didn't cause any changes.
I've verified that html pages downloaded and put in a .txt file match exactly what's see when the page source is view (i.e. no additions to the file with html)
I know it's possible because webcrawlers such as Nutch written in java are able to crawl and index PDFs without changing them.
Any help would be greatly appreciated.
Thanks!
//This code works but adds additional bytes to the outputted PDF causing it not to display
public class Main
{
public static void main(String[] args) throws IOException
{
java.io.BufferedInputStream in = new java.io.BufferedInputStream(new java.net.URL("http://www.anywhere.com/test.pdf").openStream() );
java.io.FileOutputStream fos = new java.io.FileOutputStream("test.pdf");
java.io.BufferedOutputStream bout = new BufferedOutputStream(fos);
byte data[] = new byte[1024];
while(in.read(data,0,1024)>=0){
bout.write(data);
}
bout.close();
in.close();
}
}