Debugging can sometimes be haphazard and chaotic, but the CSI approach to troubleshooting electronic product design – what we refer to as Debug Forensics - is focused on being methodical and, indeed, forensic. How can this method lead to more successful and efficient debugging of electronic and software products, that saves time and establishes confidence that bugs have been correctly identified and fixed?
The consequences of poor and incorrect debug diagnosis are obvious and substantial. For electronic and software design houses, the cost of non-conformance can equally be significant.
Waste, inefficiencies, and dissatisfaction occur when the wrong root cause is identified; lost engineering time, working components being unnecessarily replaced, frustrated engineers and unhappy customers are then the consequence of project delays and bugs that remain.
When diagnosing a bug, it is easy and unfortunately all too common in busy development organisations to jump straight in, guess what the problem is and take action to fix a crude ‘I have seen this before’ diagnosis.
By applying a more rigorous repeatable method to debugging, better results can be achieved, and engineers assured that a product in production is safe.
When debugging a problem, several stakeholders may be pressurising for a resolution, and this can create a sort of headless chicken approach to debugging.
Paradoxically, debugging can be fun. As a society, we spend a lot of time getting enjoyment out of solving issues and tackling problmes. That's what debugging is.
Today debugging takes a forensic approach, where tests are repeatable and scientific. It’s key in any investigation to picture success since it can be demoralising to look at the same bug for a few days or weeks. The bug after all will be fixed.
Gathering evidence
When it comes to debugging, the important thing is not trying immediately to guess or solve the problem. Gathering the evidence is key with no judgement.
I'd strongly advise companies to use free bug tracking tools either Bugzilla or The Bug Genie as they force you into using a rigorous forensic approach when it comes to debugging. Initially, during bug submission, a customer might say: "I've got this bug. This is what happened." You then reproduce it, and it goes through specific steps and for each step, everyone with an interest will then be alerted.
Further advantages are communication, where all stakeholders can log in for updates; and bugs that can't be fixed immediately can be parked.
A note of caution, however. Quite often, different bugs have the same symptoms. To give you an example, I created boards for a digital signal processing company. One system had 16 audio ports, ADCs and an FPGA, and it communicated over the internet to a Linux board, which captured the audio. We didn’t know at the time that it was destined to be used by an intelligence agency for phone tapping. I designed the hardware for the board and a colleague was developing the BSP code. It was all functioning fine, but then after about five hours, it would just stop working.
And it was very repeatable. We’d run the system, heard a quiet bang and it stopped working around five hours. However, sometimes it would stop working at one hour 10 minutes. We tracked down the five-hour bug to a memory overflow in the software, which we fixed. We decided to ship the product if it ran successfully for 24 hours. We reached 23 and a half hours and it stopped working. After that, it never ran that long and would bang much earlier.
After many despair-ridden sessions, we identified that we had a metastability bug in the FPJ. There were in fact two bugs. They both had the same symptom, i.e. audio channels failure, but there were very subtle differences in that one fell over on a precise time interval and the other was randomly distributed.
This is really important because engineers sometimes gather contradictory information about one bug, and then they start ignoring some of the evidence, and it's actually two bugs, so the second bug remains live.
Reproducing the problem
Often the hardest step in debugging is reproducing the problem, especially for intermittent bugs. The reasons are manifold. Firstly, bug reports are often very poor with scant detail.
Secondly, environmental factors are critical. We design and test devices in offices or labs, however, a product could be used in the middle of the rainforest, out at sea or in a car park.
The attached equipment is important. When we work in our offices, we have a specific set of devices that we are attaching our unit under test too, but the end-user may be using a completely different set of equipment. Sadly, there is also user error.
We recommend a checklist. By trying to reproduce the problem, you can answer if the test set up is the same environment as the users? Can you try and make it the same? Push the user to get a set of steps, one at a time, to reproduce the problem they reported, so you know exactly what happened and can recreate the same to test.
Automating a test set up
Something that we like to do is automate the test setup. Often, we use Arduino to power boards on and off. A common bug is lack of boot up or one time out of a hundred when the board doesn't boot up.
So, we've got the Arduino that we attach to a power supply, so it can just automate turning on and off overnight or over the weekend. Then we can have test equipment attached, or a data logger or a PC.
If you're trying to reproduce a problem, another caution here is to not get side tracked. Very often you're debugging something, and you come across another bug and this other bug might be a lot more interesting or easily fixable. To avoid getting distracted, my advice is to get The Bug Genie open or whatever bug tracking system you're using, put the new bug in and carry on with the existing bug.
We now have a system that is reproducing the bug, hopefully, and more than once every five days attach test equipment. If the bug only happens once every five days, you may see it and immediately try and fix it. In this case, you might miss out on gathering some of the evidence of exactly what happens when it went wrong.
So, I’d recommend that you look dispassionately at it, gather the data, record the data and you can replay the step for consistency.
Talking through bugs with other team members is key and others don't need to be an expert in your field to help. It really does help to have somebody who just asks you some questions. Even if they are just basic questions about the process, they can sometimes lead you to have a light bulb moment.
Try the easy stuff first
In my experience, a lot of bugs turn out to be due to something simple, in retrospect. Often when you're working on a bug it can be as simple as an electrostatic discharge or some bizarre EMC or radio issue. Human error is possible too. If it's a schematic design, perhaps a chip has been pinned in the wrong way or the soldering was bad.
In a complex embedded system, we can have the electronics of a unit under test with firmware running on it which might then plug into a display system and one that runs cloud software, for instance. So, if it is not obvious where the bug is, it's worthwhile breaking the problem down into its constituent parts.
At ByteSnap, we designed a motherboard with a Digi SOM and an FPGA, and the customer’s hardware. This was the shaker system connected by a synchronous serial interface. However, the huge shaker system was in Cambridgeshire and the size of a filing cabinet. We couldn’t easily debug this system because it was located elsewhere.
We also didn't know if there was a software issue with the booting of the board or if it was a hardware issue. We quickly eliminated the buy being anything to do with the SOM and the boot loader and the motherboard. We started to identify that the problem was likely the interface between the motherboard and customer hardware.
As we couldn't actually use the customers' hardware during debugging due to logistics, we recreated the same variables with a simple simulation of the shaker system. The duplicate shaker system was an E-squared prom and resistors, that simulated this interface and enabled us to apply power to see what happened when we booted. We were able to reproduce the problem using that process.
Apply the fix
In the case of the Cambridgeshire-based shaker system, we found strange behaviour on the SPI bus at boot up and we tracked down the cause. Simply having that SPI prom on that seal bus changed the boot mode of the process and it wasn't booting.
As we'd taken a forensic approach to debugging and used a one step at a time process, we got a really clear idea of what the bug was, quickly. We could reproduce it and see the evidence on the scope. So the next obvious thing is to apply it, then try and break it again.
The key here on trying to break it again is to try and break it in the way that you originally broke it. That may sound like an obvious thing to say, but if you had to turn it on a thousand times to recreate the breakage, turning it on once or twice isn't going to be good enough to identify the problem. If you had a bug that occurred five times out of a hundred and the system doesn't boot five times out of a hundred and you turn it on and it boots, there's a 95% chance that you haven't fixed the issue. That just means it normally booted.
Of you turn it on twice and boot it twice, you're down to about a 90% probability you haven't fixed the bug - it's just luck that it's worked both times and so on. In the third test, it becomes 85%. And, if we want to get to a 99% confidence interval or to be 99% confident that we fixed it, we basically want to say that 0.95 or the 5% failure rate that we had before, to the power of X, is less than 1%. That's where the log base comes in. A log base of 95 of 1% gives you 89.78. So, if I turn this system on and off 90 times, and it doesn't fail once, then I'm 99% confident that I fixed the issue.
That's for a specific sort of example failure mechanism, but there is other relevant maths to debugging. For example, mean time before failure, which is more commonly used with production, where we look at what's the weakest link in this product and how likely is it to fail in say 10 years. But there are cumulative distribution functions as well.
What we're talking about here is if there is a random bug that causes your system to fail at a random time, the longer you leave it on, the more confident you are that you fixed it. But it's an exponential equation so you can never be 100% sure you fixed it by just leaving it on.
For example, there is a bug where you had a system that failed at 10 minutes, but it followed a normally distributed pattern. You can then apply a cumulative distribution function to work out what the probability is. If it was a 10-minute mean and had a standard deviation of two minutes, you can use the cumulative distribution function to see if it makes it to 14 minutes. Then, if it does, you’re 99% sure that the issue is fixed.
‘Disappearing’ bugs
Sometimes bugs appear and then they magically go away. The problem with this is that that same bug can then recur at inopportune times. So, it's important to remember if you see a bug, unless you've actually taken action yourselves to fix that bug, the bug's probably still there and could come back to bite you.
Using The Bug Genie or Bugzilla, or a bug tracking system, is important as you can log all bugs. Even if you only see a bug once, don't forget about it because quite likely, you'll need to be coming back to it at some time in the future.
There's a principle called the Pentium principle of 1994, which is that an intermittent technical problem may pose a sustained and expensive public relations problem. In 1994, Intel had an issue with the multiplier in their Pentium processor, which initially they tried to brush away because most people didn't need that level of precision with their maths – but they soon found out that a lot of people did need that level of precision!
To remain competitive in global markets, development organisations cannot carry the burden caused by high costs and poor efficiencies in achieving product quality. Reducing the cost of quality is fundamental and the Debug Forensics approach helps them to achieve this.
We have seen many benefits from deploying Debug Forensics. In addition to improving troubleshooting efficiency and seeing reductions in waste, this method has delivered positive results in the areas of reduced time to market, improved customer satisfaction, better project profitability and enhanced competitiveness.
By using our Debug Forensics method, we take a structured and meticulous approach to troubleshoot both hardware and software designs. This has established a consistent protocol from our most senior engineers through to the graduate members of the team.
11 steps to debug forensics:
Picture success: First thing, get your mental attitude right. Try and brush aside all the naysayers, the people putting you under pressure and instead picture your success. Somebody's going to fix the bug and it will either be you or a team you are working with.
Keep notes: But do so without trying to guess what the problem is.
Reproduce the problem: This is the hardest step a lot of the time. If you can automate the bug mechanism, that can really help.
Gather the evidence: Get as much evidence as you can about exactly what went on when the bug happened.
Try the easy stuff first: Many bugs turn out to be due to really simple things in retrospect, so always try out the easy solutions first.
Break the problem down: If it's a complicated system, try and break it into separate sections and just work on each individual section one at a time.
Talk through the problem: If you're getting really stuck, talk it through with someone. Bear in mind that person doesn't need to be an expert. Just getting them to ask you questions can unleash your inner expert to help tackle the issue.
Apply the fix: When you think you've fixed it, apply the fix to ensure it is the right one for that bug.
Try to break it again: Once your fix is applied, the only way to truly test it is to try and break it again in the same way as you originally broke it.
Disappearing bugs: Remember those bugs don't go away magically by themselves. If there's a bug there and you haven't fixed it or someone else hasn't, it's still there.
Celebrate your success: This impresses on you the psychology of having the right mental attitude from the start. Lock in that feeling of "I fixed the bug" because that will help you next time you come across one.
Author details: Dunstan Power, Director, ByteSnap Design