(I hope this is not already covered in another post, a quick search didn't turn up anything in the archive)
So, I have recently made a lot of changes to a program, and want to quantify what those changes have done to the speed.
Let us call the old version A, and the new version B.
Just running A and B with the same settings and timing the runs with a clock gives me that B's runtime is 20 % longer than A's. This is in itself a puzzler, as what I have done is clean up and simplify the code, which naivity would suggest lead to a speed-up, but never mind that.
Using the GNU gprof tool to time the two runs, I find that B is more than 50 % slower than A. This is even more surprising, and it's a bit scary that my clean-up was that destructive. I realize that gprof changes the time to run the code, but it surprises me that the two programs are not slowed down equally.
Looking at the call-trees in the output .txt-file, I also learn that the majority of the time is not spent in the member function LotsOfWork(), as expected, but rather in stl::vector < data-structure > [] (long int). This is also a bit surprising, but I did add a lot of those calls, and I guess profiling is supposed to surprise sometimes.
Using the callgrind tool in valgrind (visualised with KCachegrind), I learn the majority of the time is actually taken up by LotsOfWork() - a whopping 34% when running B (and 43 % when running A). This is to be compared to gprof's estimate of 13%. stl::vector < data-structure > [] (long int) has been demoted to just taking 7 % of the time (instead of 17% according to gprof).
The relationship between A and B is by callgrind estimated as B being about 25% slower than A.
So, is gprof just not very trustworthy, or could there be something else going on? I have on previous occasions refrained from reducing the time spent in LotsOfWork(), because gprof indicated that that would be a waste of time, but looking in KCachegrind, it seems that that would very much be a good place to start?