I am reading a file that is about 20 Mb, this will take about 10 seconds.
A few lines from this file look like below. The first value is a date value that go from lower to higher and the second value after the "Comma" is a number.
In the while statement that do read this file from top to bottom there is a if-statement that says: if( Date == "01/03/2008" )

Now is my question this Let us say that I am only interested to read the lines that are inbetween the Date 01/03/2008 - 01/06/2008.
This will meen that I will only read the nescessary lines from this 20 Mb file.
So is it possible to search for an entry and an exitpoint in this file instead of reading the file from top to bottom ?


01/01/2008,1
01/02/2008,2
01/03/2008,3
01/04/2008,4
01/05/2008,5
01/06/2008,6
01/07/2008,7

ifstream ReadFile("C:\\File1.txt");

char Comma;
std::string Date;
double Number = 0;


while( getline(ReadFile, Date, ',') )
{

	ReadFile >> Number;
	ReadFile.get();

	if( Date == "01/03/2008" )
	{
		int i = 5;
	}

}

Convert the dates to time_t using struct tm in <ctime> then you can easily compare the two dates without any problems. Or, you can rearrange the date so that they are in the form YYYY/MM/DD then you can use normal string comparisons on the dates.

>>So is it possible to search for an entry and an exitpoint in this file instead of reading the file from top to bottom ?

Depends. If the file is sorted by date, then you start from the beginning of the file and keep reading until the last date that you want has been read. At that time the program can stop reading the file. Its not possible to jump directly to the beginning of the dates that you want.

I understand. It is a good idéa to stop reading at the last date. I know how to manage this.

This is easy to say ofcourse for me but I have a program that do read a .txt file that is 20 Mb and I can choose to read between 2 dates somewhere in the middle of the file and it takes perheps 0.1 seconds instead of if I choose to read the whole 20 Mb file that will take 10 seconds. I just wonder how that is done.
Why I wonder this is because I will read 1000:s of files daily and this detail could spare an amount of time.

If it isn´t possible to search for a "startpoint" /date. I might wonder if it is possible to tell the while-loop to start at a specific Linenumber.
For example: "Start reading from Line number 50 in the file." ?


Convert the dates to time_t using struct tm in <ctime> then you can easily compare the two dates without any problems. Or, you can rearrange the date so that they are in the form YYYY/MM/DD then you can use normal string comparisons on the dates.

>>So is it possible to search for an entry and an exitpoint in this file instead of reading the file from top to bottom ?

Depends. If the file is sorted by date, then you start from the beginning of the file and keep reading until the last date that you want has been read. At that time the program can stop reading the file. Its not possible to jump directly to the beginning of the dates that you want.

If you know something about the distribution of dates in the file you can get the fastest time, but in any case you'll have to do a random-access search.

This should help, but may not if the date is in a worst-case position:

  1. read the first date in the file
  2. seek to the end of the file, backpedal until you have the last (non-empty) line of the file, and read that date.
  3. calculate the percentage that your target date falls between the first and last date of the file.
  4. jump to that position in the file and scan forward until you are at the beginning of a line.
  5. read the date.
  6. compare the date with the target date.
    1. if the date is what you want, you're done.
    2. if the date is too late, calculate the new percentage to be halfway between the new position and the current 'beginning position' (which will be the beginning of the file the very first time you do this).
    3. if the date is too early, calcuate the new percentage to be halfway between the new position and the current 'ending position' (which will be the end of the file the very first time you do this).
    4. reset your new beginning and ending positions.
  7. goto 4.

Hope this helps.

Thank you. This seems to be a solution.
I might wonder one detail how it is possible to jump to a position in the file.

Let´s say I already know that I will read from the middle of the file (50 %).
How will I "jump" and begin to read from here.
What method could be used for this ?

Duoas's suggested binary search will work only if all the lines in the file are exactly the same length. If the length of the lines are different, due to the length of the number following the comma, then it won't work because you will not know the location of any given random line.

If you have the source code to the program that writes the entries in those files then it would help a lot if you would change it to write lines that are exactly the same length, even if it has to embed spaces or make the numbers after the comma padded to the left with 0s so that they are all the same length

01/01/2008,0001
01/02/2008,0002
01/03/2008,0003
01/04/2008,0004
01/05/2008,0005
01/06/2008,0006
01/07/2008,0007
// etc
01/07/2008,9999

Thank you. This seems to be a solution.
I might wonder one detail how it is possible to jump to a position in the file.

Let´s say I already know that I will read from the middle of the file (50 %).
How will I "jump" and begin to read from here.
What method could be used for this ?

call ifstream's seekg() method to move the file pointer to any location in the file. But you must know the byte offset.

Thank you. I will look the seekg() method up in msdn.
Though as you said this might be a problem because the lines in the file are not excatly the same. It differs for about maximum 5 characters between them and I can´t change them I beleive.

call ifstream's seekg() method to move the file pointer to any location in the file. But you must know the byte offset.

>>Thank you. I will look the seekg() method up in msdn.
You might find it there, but that is a standard c++ fstream function. Look here instead.

Actually, you got me interested, so I'm writing a little library for you to do it... :)

My method accounts for the possibility that lines are different lengths.
As a matter of interest, do you open the file in textual or binary mode?

Hehe..that is to nice of you :) I looked at the lines and it differs for about maximum 5 characters.
I open the file like this. I am not sure myself if this is binary or textual.
Perheps you meen what the actual file contains and it look like this:
(if this is ment to be textual perheps ?)
12/01/1999,1,2,3,4,5,6

ifstream ReadFile("C:\\File1.txt");

Actually, you got me interested, so I'm writing a little library for you to do it... :)

My method accounts for the possibility that lines are different lengths.
As a matter of interest, do you open the file in textual or binary mode?

>>ifstream ReadFile("C:\\File1.txt");
That is text mode. Here is binary mode: ifstream ReadFile("C:\\File1.txt", ios::binary);

Text mode is good (makes things simpler).

Give me just a little longer to finish testing all the various corner cases where things can go wrong.

Yes, ofcourse ! I will really be happy to see what you can make of this.
It will be interesting..

/J

Text mode is good (makes things simpler).

Give me just a little longer to finish testing all the various corner cases where things can go wrong.

Yes, ofcourse ! I will really be happy to see what you can make of this.
It will be interesting.. I am trying myself to make a solution from what your idéa was but I find it quite difficult but I am still trying...

/J

Text mode is good (makes things simpler).

Give me just a little longer to finish testing all the various corner cases where things can go wrong.

Alright! :)

Sorry for the delay. There were a few more corner cases than I thought there would be when I started this. But anyway, here you go. Hope you find it useful.

The algorithm presumes that the dates in your file are more or less linearly distributed. If this is not the case by more than a standard deviation, open "datefile.cpp" and comment out the line #define LINEAR_DISTRIBUTION Sorry for the boilerplate, but companies get nervous without it. Basically it just says you can do anything you like with the files except claim anyone but I wrote the original versions or alter the boilerplate... (and excludes me from legal responsibility if someone manages to destroy something with it).

The file "a.cc" is what I used to test the datefile algorithm. You don't need it, but I've attached it anyway so you can play with it if you like. Its messy though...

Let me know if anything goes horribly wrong. ;)

So, here's a quick primer:

#include <iostream>
#include <fstream>

#include "datefile.hpp"

using namespace std;
using namespace datefile;

int main()
  {
  ifstream megafile( ... );

  time_t date = string_to_date( "4/28/1974" );
  streampos linepos = find_date( megafile, date );

  if (megafile)  // or (!megafile.eof())
    {
    string line;
    getline( megafile.seekg( linepos ), line );
    cout << "Found the line> " << line << endl;
    }
  else 
    {
    megafile.clear();
    cout << "No such date found.\n";
    }

  ...
  megafile.close();
  return 0;
  }

Enjoy! :)


Hmm. OK. A while back a server failure stuffed my inbox with thousands of messages. Since I can only delete 25 at a time I've given up.
Apparently this is also preventing me from attaching files. So I'll double-post them into the next post...

commented: very nice code +1

OK, sorry for the public service... (I also lost my webspace a short while back...)

Don't forget to read my comments in the last post.

datefile.hpp
datefile.cpp
a.cc

Just copy the datefile files into your project folder and add them to your project.
Make sure to #include "datefile.hpp" in your program (in the files where you intend to use it).

Heh...

I managed to download the files. The files did open in a Browser and from here I saved them as datefile.hpp and datefile.cpp I am not sure if this was the correct way to do it.
I did put these "datefile.cpp" and "datefile.hpp" in my projectfolder.
When I #include "datefile.hpp" I will have a compilererror:

'#include' : expected a filename, found '&'

\datefile.cpp(170) : fatal error C1010: unexpected end of file while looking for precompiled header. Did you forget to add '#include "stdafx.h"' to your source?

You said something about adding the file to the project first. I am not sure if this is something I have to do. I am not sure if I have done that before.
Thank you...

your project is using precompiled headers so the first line after the comments at the top of the *.cpp file must contain #include "stdafx.h" . If it doesn't eppear on the first line, add it.

I take "View Code" for "datefile.cpp" and when doing this, the code appears.
At the top there is a line #include "stdafx.h". So with this line I still get the compiler error.
For the file "datefile.hpp" I have "Include In Project" and this file do compiles without problem.
So when I "Include In Project" the file "datefile.cpp" that at the first line/ top have #include "stdafx.h" I will get this compiler error anyway.
I am not sure what this could depend on. I have tested this also without #include "datefile.hpp" anywhere.

your project is using precompiled headers so the first line after the comments at the top of the *.cpp file must contain #include "stdafx.h" . If it doesn't eppear on the first line, add it.

I attach the datefile.cpp. I copied it´s content and pasted it in a .txt file.
As I can see the "stdafx.h" is included there.

I don't know if you downloaded them correctly. They are plain-text files, not HTML or anything else. If you open them in notepad you shouldn't see anything funny.

Go to the download, click download. When you get the web page showing the file, select File --> Save Page As... and make sure the bottom-most drop-down box says the document type is "Text Document".

If you've done that (and I see that you put the include in the right spot), then you'll have to ask AD for more help. (Stupid MS VC++.)

Let me go dig out my old version of VC++ and see what weirdness it wants...

(Alas, I was looking forward to a happy you.)

Don´t worry, even if it does´t work I am happy for this effort. I did look at your testcode and it wasn´t little : )

I did do as you said anyway now. I opened the .cpp and .hpp files in the browser and saved these files as textfiles. The project compiles except for the .ccp file.
I have attatched 2 files. One that is original, how it looked like when I saved it(Saved datefile)
and then a second one with a few adjustments to the #include lines.
Still it doesn´t compiles.
I have problem to understand the .cpp file. To be honest I dont understand much of it so I have a bit of a problem to search for wrongs.

Though I have excluded datefile.cpp from my project. Then I have #included "datefile.hpp" in one of my forms and this compiles great.
I am not sure if I need the .cpp file anyway to make it work ?

I don't know if you downloaded them correctly. They are plain-text files, not HTML or anything else. If you open them in notepad you shouldn't see anything funny.

Go to the download, click download. When you get the web page showing the file, select File --> Save Page As... and make sure the bottom-most drop-down box says the document type is "Text Document".

If you've done that (and I see that you put the include in the right spot), then you'll have to ask AD for more help. (Stupid MS VC++.)

Let me go dig out my old version of VC++ and see what weirdness it wants...

(Alas, I was looking forward to a happy you.)

Here is the datafile.cpp that I get with the link posted by Duoas in post #17. As you see stdafx.h is not included here. I deleted most of the code to show only relevent parts. If there is a more recent copy then you should post it. Those text files you have posted are worthless because they are unreadable. Post the *.cpp file.

// datefile.cpp
// Copyright (c) 2008 Michael Thomas Greer.
//
// Boost Software License - Version 1.0 - August 17th, 2003
//
// Permission is hereby granted, free of charge, to any person or organization
// obtaining a copy of the software and accompanying documentation covered by
// this license (the "Software") to use, reproduce, display, distribute,
// execute, and transmit the Software, and to prepare derivative works of the
// Software, and to permit third-parties to whom the Software is furnished to
// do so, all subject to the following:
//
// The copyright notices in the Software and this entire statement, including
// the above license grant, this restriction and the following disclaimer,
// must be included in all copies of the Software, in whole or in part, and
// all derivative works of the Software, unless such copies or derivative
// works are solely in the form of machine-executable object code generated by
// a source language processor.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
// SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
// FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
// ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
// DEALINGS IN THE SOFTWARE.
//

#include <algorithm>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>

#include <cmath>    // for floor()
#include <cstring>  // for memset()
#include <ctime>

#include "datefile.hpp"

#define LINEAR_DISTRIBUTION


//////////////////////////////////////////////////////////////////////////////
namespace datefile {
//////////////////////////////////////////////////////////////////////////////

const std::time_t INVALID_TIME_T = (std::time_t)(-1);

<snipped>

// end datefile.cpp

I download all three files and successfully compiled with with VC++ 2008 Express after making a couple changes.

  • name a.cc to a.cpp
  • have to add #include <limits> to each of the two *.cpp files
  • VC++ 2008 Express does not recognize and and or, so add these macros after the includes in each *.cpp file
    #ifndef or
    #define or ||
    #endif
    #ifndef and
    #define and &&
    #endif

I beleive my saving of the file became messy if I compared to this one.
Yes you are right, #include "stdafx.h" is missing on line 28.
So I copied this text and added this on line 28. Still there are Errors that seems to point at these 3 lines in the datefile.hpp file:
(I am not sure if anything #included can look like that with ; etc...)

#include &lt;ctime&gt;
#include &lt;iostream&gt;
#include &lt;string&gt;

The errors look like this:
error C2006: '#include' : expected a filename, found '&'
fatal error C1083: Cannot open include file: '': No such file or directory

I did do as you said anyway now. I opened the .cpp and .hpp files in the browser and saved these files as textfiles. The project compiles except for the .ccp file.

That was a huge mistake. When you download the three files you have to click the Save button in the download window, not open it with the browser. From what I see in the text files you posted your browser added a lot of html code to the files that the c++ compiler doesn't know how to handle.

Here are the files I compiled in .zip form. Just use winzip to uncompress them in the directory of your choice.
[edit]DaniWeb will not upload *.zip files, so I changed the filename to *.txt. After you download it, change the file back to *.zip and use winzip on it.[/edit]

[edit]
Argh. Sorry about leaving out <limits>. (I forgot and GCC doesn't complain.)

I come from a Pascal background so I tend to like things you can actually read... I didn't know that VC++ doesn't understand and or or. (Stupid VC++.)

You should be very pleased with the results. The algorithm is very quick, and typically needs to examine only a very few lines before finding the answer.

I also forgot to mention earlier that lines that do not begin with a valid date of the form MM/DD/YYYY or MM-DD-YYYY are completely ignored, so it is OK if your file contains blank lines or other information.

Enjoy. Thanks a ton AD. You rock!

Yes, that was a huge mistake :) I continued and saved it in the browser.
I downloaded the zip.file and unzipped it. The .hpp and .cpp file did compile now :)
I am not sure what was done to make it compile but I will examine the files as good as I can. The a.cpp file did not compile but I think that is no problem because it ´only´ contains testcode for the function.

Duoas, you have really done much work here. I have to thank you alot !
I will really try to understand this as much as I can and test this out now.
The files that I have does not contain emty lines but if there is a function that can ignore blank lines, that is nice. I will try to use the examplecode that you was sending before how to use this, I will see what I can do.
Thank you again AD and Duoas !!!

[edit]
Argh. Sorry about leaving out <limits>. (I forgot and GCC doesn't complain.)

I come from a Pascal background so I tend to like things you can actually read... I didn't know that VC++ doesn't understand and or or. (Stupid VC++.)

You should be very pleased with the results. The algorithm is very quick, and typically needs to examine only a very few lines before finding the answer.

I also forgot to mention earlier that lines that do not begin with a valid date of the form MM/DD/YYYY or MM-DD-YYYY are completely ignored, so it is OK if your file contains blank lines or other information.

Enjoy. Thanks a ton AD. You rock!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.