Archive

Posts Tagged ‘debugging’

A Showstopper And A Nuisance

December 15, 2012 Leave a comment

I recently spent 3.5 days hunting down and squashing a “showstopper” bug that ended up being a side effect of an earlier fix that I had made to eradicate a long-standing “nuisance” bug.

SIDEBAR: The nuisance bug occurred only during system shutdown. The system would crash on exit, but no data was lost or corrupted. It was long thought to be an “out of sequence” object destruction problem, but because hundreds of lines of nested destructor code are called during shutdown in multiple threads of execution and the customer never formally reported the bug (because the system is very rarely shutdown), it’s annihilation was put on the shelf – until I “fixed” it.

When I was initially told about the showstopper, I was confounded because the bug seemed to be located somewhere in the code of a simple feature that I had thought I tested pretty thoroughly. And yet, there it was, plainly obvious and easily reproducible – a crash that happens during runtime whenever a specific sequence of operator actions are performed. WTF?

With the system model below embedded in my genius mind, I thought the bug HAD to be located somewhere in the massive, preexisting, 150K line legacy code base. After all, the number of code lines I added to the beast in order to implement the feature was so small and unassuming that the odds favored my hypothesis.

Even though my hypothesis was that the new code I added had uncovered and triggered some other dormant bug deep in the bowels of the software, I first inspected the measly few lines I recently added to the code base. Of course, the inspection yielded no “aha, there it is” moment. Bummer.

Next, I fired up the debugger, sprinkled a bunch of breakpoints throughout the code, and stepped through my brilliant and elegantly simple code. I found that when the control of execution descended into the netherworld below my impeccable work, the bug came out of hiding – crash! However, since a bunch of event driven callbacks were triggered each time the execution of control left my code, I couldn’t trace the execution path so easily.

Exasperated that the debugger didn’t tell me exactly where and what the freakin’ bug was, I started reading and reverse engineering (via targeted UML class and sequence diagram sketches) segments of the legacy code. Since it was my first focused foray down into the dungeon, it was a slow going, but beneficial, learning experience.

Finally, after a couple of days of inspection, reverse engineering, and a bazillion debugger runs, I stumbled upon a note written by yours truly in one of the infrastructure callback functions:

The line of code below was commented out because it triggers a crash on shutdown“.

Bingo, a light went off! Quickly, I uncommented out the line of code and reran the program. Yepp, the bug was gone! As I initially thought, the critter did turn out to be living within the infrastructure, but I had unwittingly put it there a while ago in order to kill the long-standing “nuisance” shutdown bug. Ain’t life grand?

Of course, the tradeoff for re-enabling the line of code that killed the nasty bug is that the nuisance bug is alive and well again. And no, unless I’m directly ordered to, I ain’t gonna go uh huntin’ fer it aginn. No good deed goes unpunished.

Categories: technical Tags: ,

Reasonable Debugging

In Rich Hickey‘s QCon talk, “Simple Made Easy”, he hoisted this slide:

So, what can enhance one’s ability to “reason about” a program, especially a big, multi-threaded, multi-processing beast that maps onto a heterogeneous hodge-podge network of hardware and operating systems? Obviously, a stellar memory helps, but come on, how many human beings can remember enough detail in a >100K line code base to be able to debug field turds effectively and efficiently?

How about simplicity of design structure (whatever that means)? How about the deliberate and intentional use of a small set of nested, recurring patterns of interaction – both of the GoF kind and/or application specific ones? Or, shhhh, don’t say it too loudly, how about a set of layered blueprints that allow you and others to mentally “fly” over the software quickly at different levels of detail and from different aspect angles; without having to slodge through reams of “flat” code?

Do you, your managers, and/or your colleagues value and celebrate: simplicity of design structure; use of a small set of patterns of interaction; use of a set of blueprints? Do you and they walk the talk? If not, then why not? If so, then good for you, your org, your colleagues, your customers, and your shareholders.

Bah Hum BUG

April 2, 2010 3 comments

Note: For those readers not familiar with c++ programming, you still might get some value out of this nerd-noid blarticle by skipping to the last paragraph.

The other day, a colleague called me over to help him find a bug in some c++ code. The bug was reported in from the field by system testers. Since the critter was causing the system to crash (which was better than not crashing and propagating bad “logical” results downstream) a slew of people, including the customer, were “waiting” for it to be fixed so the testers could continue testing. My friend had localized the bug to the following 3 simple lines:

After extracting the three lines of code out of the production program and wrapping it in a simple test driver, he found that after line 3 was executed, “val” was being set to “1” as expected. However, the “int” pointer, p_msg, was not being incremented as assumed by the downstream code – which subsequently crashed the system. WTF?

After figuring it out by looking up the c++ operator precedence rules, we recreated the crime scene in the diagram below. The left scenario at the bottom of the figure below was what we initially thought the code was doing, but the actual behavior on the right was what we observed. Not only was the pointer not incremented, but the value in msg[0] was being incremented – a double freakin’ whammy.

Before analyzing the precedence rules, we thought that there was a bug in the compiler (yeah, right). However, after thinking about it for a while, we understood the behavior. Line 3 was:

  1. extracting the value in msg[0] (which is “1”)
  2. assigning it to “val”
  3. incrementing the value in msg[0]

Changing the third line to “int val = *p_msg++” solved the problem. Because of the operator precedence rules, the new behavior is what was originally intended:

  1. extract the value in msg[0] (which is “1”)
  2. assign it to “val”
  3. increment the pointer to point to the value in msg[1]

A simple “const” qualifier placed at the beginning of line 2 would have caused the compiler to barf on the code in line 3: “you cannot assign to a variable that is const“. The bug would’ve been squashed before making it out into the field.

It’s great to be brought down to earth and occasionally being humbled by experiences like these; especially when you’re not the author of the bug 🙂 Plus, after-the-fact fire fighting is cherished by institutions over successful prevention. After all, how can you reward someone for a problem that didn’t occur because of his/her action? Even worse, most institutions actively thwart the application of prevention techniques by imposing Draconian schedules upon those doing the work.

The world is full of willing people, some willing to work, the rest willing to let them. – Robert Frost

My contortion of this quote is:

The world is full of willing people, some willing to work, the rest willing to manage them while simultaneously pressuring them into doing a poor job. – Bulldozer00

%d bloggers like this: