I am having some problems with merging a large number of text files. Basically I am attempting to merge upwards of 2000+ text files that could range anywhere in size from 1 megabyte to multiple gigabytes in size each. Thus implementations must use some form of buffering in order to drag in blocks of the file at a time instead of using a method to bring the entire file into memory.
My current implementation uses the following implementation:
- Delete the output file if it exists
- Copy the first file for merging as the output file.
- Open a TextWriter into the output file.
- For each remaining file to be merged:
- Get the next infile's location
- Call the function to do the merge. (SHOWN BELOW)
- Mark the file as successfully being written
- Print to the status box that the file was merged
The code for actually doing the merge is as follows:
/// <summary>
/// Concatenates the next Segment file with the merged file.
/// </summary>
/// <param name="inFile">The in file.</param>
/// <param name="writer">The output file writer.</param>
private void concatenateFiles(string inFile, TextWriter writer)
{
FileInfo fi = new FileInfo(inFile); //Used to determine file size.
int bufferSize = 1024 * 1024 * 10; //Default to 10 megabytes.
int bytesRead = 0; //Number of bytes read.
if (fi.Length < bufferSize)
{
bufferSize = (int)fi.Length; //If the file size is less than 10mb, we set the buffersize to the file size.
}
char[] buffer = new char[bufferSize]; //Establish the character buffer of size buffersize
StreamReader reader = new StreamReader(inFile); //Establish the reading stream.
while ((bytesRead = reader.ReadBlock(buffer, 0, bufferSize)) != 0)
{
writer.Write(buffer, 0, bytesRead); //Write the number of bytes read to the output file.
}
reader.Close(); //Close the reading stream.
reader.Dispose(); //Clean up the reading stream.
}
The implementation works, however my problem is regarding the time it takes for this implementation to work. I have multiple threads writing each fragment from the various other machines across the network. Once completed these fragments are merged together using the process above. This process takes approximately 2 seconds per file no matter the size. So 20 10megabyte files takes about 38 seconds to merge (the file copy occurs in well under 1 second, so the remaining 19 files all take 2 seconds each). However, 20 100 megabyte files also take 38 seconds, and even 20 500mb files take just over 40 seconds.
I wondered if the problem was that I kept re-opening the output stream, but now I never close the output stream (just pass the stream to the function) until I am finished writing, thus the only streams being generated constantly are the reading streams.
Also, I have tried changing the buffer size up and down and it seems to make little impact. I've tested anywhere from 1KB at a time to 200MB at a time with little change in speeds.
Any idea as to why this implementation is so slow? Any ideas on how to improve speeds?
Thanks in advance for any help.