Merging large text files

Question

erees 0 Newbie Poster

15 Years Ago

I am having some problems with merging a large number of text files. Basically I am attempting to merge upwards of 2000+ text files that could range anywhere in size from 1 megabyte to multiple gigabytes in size each. Thus implementations must use some form of buffering in order to drag in blocks of the file at a time instead of using a method to bring the entire file into memory.

My current implementation uses the following implementation:

Delete the output file if it exists
Copy the first file for merging as the output file.
Open a TextWriter into the output file.
For each remaining file to be merged:
1. Get the next infile's location
2. Call the function to do the merge. (SHOWN BELOW)
3. Mark the file as successfully being written
4. Print to the status box that the file was merged

The code for actually doing the merge is as follows:

/// <summary>
        /// Concatenates the next Segment file with the merged file.
        /// </summary>
        /// <param name="inFile">The in file.</param>
        /// <param name="writer">The output file writer.</param>
        private void concatenateFiles(string inFile, TextWriter writer)
        {
            FileInfo fi = new FileInfo(inFile); //Used to determine file size.
            int bufferSize = 1024 * 1024 * 10;  //Default to 10 megabytes.
            int bytesRead = 0;                  //Number of bytes read.
            
            
            if (fi.Length < bufferSize)         
            {
                bufferSize = (int)fi.Length;    //If the file size is less than 10mb, we set the buffersize to the file size.
            }

            char[] buffer = new char[bufferSize];           //Establish the character buffer of size buffersize
            StreamReader reader = new StreamReader(inFile); //Establish the reading stream.

            while ((bytesRead = reader.ReadBlock(buffer, 0, bufferSize)) != 0)
            {
                writer.Write(buffer, 0, bytesRead); //Write the number of bytes read to the output file.
            }
            reader.Close();     //Close the reading stream.
            reader.Dispose();   //Clean up the reading stream.
        }

The implementation works, however my problem is regarding the time it takes for this implementation to work. I have multiple threads writing each fragment from the various other machines across the network. Once completed these fragments are merged together using the process above. This process takes approximately 2 seconds per file no matter the size. So 20 10megabyte files takes about 38 seconds to merge (the file copy occurs in well under 1 second, so the remaining 19 files all take 2 seconds each). However, 20 100 megabyte files also take 38 seconds, and even 20 500mb files take just over 40 seconds.

I wondered if the problem was that I kept re-opening the output stream, but now I never close the output stream (just pass the stream to the function) until I am finished writing, thus the only streams being generated constantly are the reading streams.

Also, I have tried changing the buffer size up and down and it seems to make little impact. I've tested anywhere from 1KB at a time to 200MB at a time with little change in speeds.

Any idea as to why this implementation is so slow? Any ideas on how to improve speeds?

Thanks in advance for any help.

file-stream

2 Contributors
1 Reply
262 Views
21 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by powerbox

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

powerbox 13 Light Poster · Answer 1 · 2010-04-27T18:44:09+00:00

I have this problem also when I was migrating 50 500GB files and it took some time also. The problem is that we're reading the file block by block and buffering it in our application which I guess is too time consuming but it works perfectly. I don't know if this will help you out but if you can provide a way for you not to buffer the data on your read part and directly inserting it on the destination file then I guess it will be much faster. If not, your solution fits you well and that is the best possible way of doing it. ( May be you can check on file system I think different file system also has impact on writing and reading on it and hard disk rpm ).