What's the most challenging bug you've ever encountered in your programming career, and how did you eventually solve it?

I’ll start. The only one I can recall from the recent past is not realizing that you can’t use a MySQL transaction from within a persistent connection. Here’s when I learned that lesson.

As far as the single biggest threat to my career from a technical perspective, I would have to definitively say the miscommunication between myself and James that lead to our database being hacked in 2015. We are still experiencing the fallout from that today.

How about everyone else?

The most challenging bug was one I introduced myself. While making a miniscule code change I decided to improve the readability of the following line by inserting a space. In today's code that would not be a problem, however, most of our code (AGC/SCADA) was written in FORTRAN. Those of you who made your bones in the punch card era, or know a little of the history of computing will know where I am going. In the age of 80 column punch cards, columns 73-80 were reserved for sequence numbers. Supposedly there existed somewhere a resequencing machine. I've never seen one.

Unfortunately this convention was continued into our system where all editing was done on VT100 CRT terminals and sequence numbers no longer made sense. The offending line was of the form:

CALL MYFUNC(parm, parm, ... ,,,,,,,

where the rightmost comma was in column 72. The next line continued the previous statement. By pushing the comma into the sequence area it was ignored by the compiler. The code still compiled, but now all the parameters following that were off by one position. Apparently the function was not frequently used so it took a while for the problem to show up and much longer to track it down.

As far as I know my university likewise lacked a resequencer. For that matter I don't think they had a machine to add the sequence numbers either.

For the record, the wailing of someone who has just dropped their five hundred card program the night before the project is due is frightening.

The one with the most impact was a Firebird library update, which appeared low to no impact. After deploying at a customer, the library soon showed a complete disregard for database transactions. This mangled the client's database completely and caused us to retrace all changes by hand to undo the damage done. Several weeks worth of misery...

The most dangerous update I ever had to do, which I was glad to see working correctly, was a software change to a gold-plating assembly line for computer chips. A mishap would have caused thousands of dollars worth of damage per second.

This https://en.wikipedia.org/wiki/Peterson%27s_algorithm
Plus two different kinds of processors.
Plus an error rate of less than once a week.

After many months, the cause was eventually captured on a bus analyser.

After which, the solution was obvious after a bit of RTFM.

Turned out that one of the processors had the then awesome new feature of out of order memory operations. So very rarely, a write-read in the algorithm turned into a read-write on the bus. Fine if it was the only thing on the bus as the memory logic in the processor would have sorted it out.

But it wasn't the only thing on the bus, and the other processor was in that super critical window for some few uS at exactly the wrong moment.

Result, both spent the rest of eternity waiting for the other one to get on with it.

Turning off said feature, beers all round chaps!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.