Hi. I'm having some difficulty with a project that involves working with very large binary files. These are PCL files, where the decimal character "12" represents a Form Feed, but only if it's not embedded within a string of binary data.
In other words, I'm looking for decimal "12", then looking at the next few bytes to make sure it's really a form feed. I want to record the byte position of each "real" Form Feed, for futher processing down the line.
Here's what I have, it's working but extremely slow. The interesting part is in the "while" loop:
static void Main(string[] args)
{
// need to initialize header and position of first page.
filename = @"C:\Statements-05-03-05.pcl";
infile = new FileStream(filename, FileMode.Open, FileAccess.Read);
test = new byte[1024];
infile.Read(test, 0 , test.Length);
asciiChars = new char[ascii.GetCharCount(test, 0, test.Length)];
ascii.GetChars(test, 0, test.Length, asciiChars, 0);
asciiString = new string(asciiChars);
header = asciiString.Substring(0,asciiString.IndexOf("*b0M") + 4);
page_positions.Add(header.Length);
counter = 1024;
while (counter <= infile.Length )
{
pcl_char = infile.ReadByte();
counter++;
if (pcl_char == 12)
{
test = new byte[14];
curr_pos = infile.Position;
infile.Read(test, 0, test.Length);
counter = counter + 14;
asciiChars = new char[ascii.GetCharCount(test, 0, test.Length)];
ascii.GetChars(test, 0, test.Length, asciiChars, 0);
asciiString = new string(asciiChars);
if (asciiString == bgn_of_page)
{
page_positions.Add(curr_pos);
} // if (new string(test) == bgn_of_page)
} // if (pcl_char == 12)
} // while (sr.Peek >= 0)
infile.Close();
}
This is slow because I'm reading a single byte at a time. I check to see if the byte is "12", if so, I read the next 14 bytes, convert them to a string, compare it to a target string, and if I get a match, record the byte position of the "12" into an ArrayList.
I would really like to speed this up. For example, through buffering. If I use a StreamReader, and its "Read()" method, and use char[] instead of byte[], I can loop through a file in SECONDS, rather than MINUTES.
However, the .Position property of the BaseStream (infile) is wildly off. This is because it's reporting the position in the source stream, which is no longer reading a byte at a time. It's buffering 1024 at a time. So the .Position property is useless to me.
That's why the "counter" is in the above code. In that code, "counter" and "infile.Position" are synchronized and I could use them interchangably.
I had hoped that, when I switched to StreamReader, that "counter" would still be accurate. It isn't!
Can anyone shed light on this? I need a way to combine the speed of StreamReader/buffering while keeping track of absolute byte positions in the file.
Thank you SO MUCH for reading this far!