As technology becomes more complex, so too does working out what’s going on inside an mcu
9 mins read
Visibility is everything in debug, but it is hard to achieve in the world of embedded systems. Compared to desktop work, where a software monitor can easily show the internal state of a system, debugging an embedded target can seem more like keyhole surgery.
For years, the in circuit emulator (ICE) was the staple of mcu users. A specialised connector made it possible to watch bus activity and provide shadow regions of memory that would help control what the cpu was running. This was vital for rom based mcus as the emulator memory could be changed on the fly, in contrast to the fixed memory on the target mcu.
State machines in the emulator would extract bus activity data and use it to determine where a cpu was in its execution and what data it saw. When the state machine saw that a particular combination of events happened, it could provide a breakpoint: stepping in and stopping execution and then downloading the registers and other state information from the cpu core. Or it could record the data as watchpoints.
As mcus became more complex and began to incorporate additional peripherals – such as on chip memory and caches to increase speed and reduce system cost – visibility suffered. ICE makers had to buy custom 'bondout' versions of the target mcu that made available to the emulator internal bus address and data lines that would normally be hidden from view. But, as these connections were not through proper I/O pads, they did not have the same level of electrostatic discharge protection. Bondout versions were far more electrically fragile – and were expensive to replace when a misplaced connection resulted in the device latching up forever.
Although programmers had the option of falling back on the standby of printf debugging – instrumenting the code to spit out hints about its behaviour on a serial port – this approach was generally prone to timing problems. Code that worked fine in debug mode would suddenly break when compiled for production. The delays caused by the printf statements or other signals might have masked fatal deadlocks and race conditions.
Embedded programmers got a lucky break in the late 1980s after the IEEE Joint Test Access Group (Jtag) put together a plan to reduce the cost of test. Chip designers realised that increased integration was making it tougher for OEMs to test assembled boards at the end of production using conventional 'bed of nails' testers – so many functions were hidden away inside individual pieces of silicon and the chips themselves so tightly spaced that board level probes were unable to reach them individually to see if they worked.
The answer lay in providing access to test logic inside each chip through a serial bus that wove its away around the board and inside each device. Using this bus, a tester could disable the core logic cells inside each chip and allow signals to pass through. This made it possible to test the continuity between the devices on the board without probing the pcb traces directly. Designers realised the ability to send commands to logic blocks inside a chip could not only be extended to test functions inside the device itself, but also to send commands to the logic blocks and to extract information from them as they ran in a functioning system. The first target for this extension of Jtag was the mcu's central processing unit (cpu), which was beginning to suffer from a similar visibility problem to that experienced by board level test engineers.
Intel quickly adopted the Jtag port for hardware assisted debug, as well as for board level test – implementing it on the 80486 – which pushed other manufacturers to use it. Motorola – now Freescale – developed its Background Debug Mode (BDM) interface, later embracing Jtag as the access port for the BDM logic on its products. However, Jtag implementations were far from equal. Although many companies put the debug functions on a different internal scan chain to the test functions, some popular devices – such as the IBM PowerPC 600 series – did not, making the job of accessing the debug support often quite tedious.
Since its early days, the x86 had included an instruction intended for use by software debuggers. The INT 3 instruction is a single byte instruction – with an opcode of 0xCC – that forces the processor to run an interrupt. Generally, this would be used by a debugger to patch a location in memory to force a breakpoint when the processor hit that location. The use of a single byte made it possible to patch any instruction in the x86 set, some of which are only 1byte long.
The 80386 greatly extended hardware support for debugging with the inclusion of six debug registers that could be used to set breakpoints without patching memory directly. The registers could be set to any location in memory – allowing breakpoints on data as well as instruction accesses. Another register setting made it possible to single step through the code. The processor would halt after every executed instruction. The inclusion of the Jtag port on the 80486 made it possible for external hardware to program these registers without needing to interfere with the running program.
Motorola's BDM brought greater visibility to lower cost mcus and processors, such as its popular 68332, without adding dedicated breakpoint registers. Instead, BDM provided an onchip controller for the cpu that could change or fetch register and memory contents through its serial connection without halting the target, using a mechanism similar to direct memory access (DMA). As a separate piece of hardware, the BDM controller could interrogate the system after a software crash or force the core into single step mode. Later parts added breakpoint registers to avoid the need to patch memory with trap instructions.
Unfortunately, a comparatively slow serial connection is only good for relaying start-stop commands and extracting small chunks of data at a time. One big advantage of the ICE was its ability to provide a real time trace of program execution. Without trace, programmers had to, again, lace their code with printf statements to work out which branches the processor took during execution. Some real time operating systems, such as VxWorks, added their own level of instrumentation to improve debug visibility – storing a record of recent system calls in a chunk of memory set aside for the purpose – but this again could alter timing related behaviour. And it was only suitable for targets that could justify the inclusion of an rtos.
Evolutionary process
The next step in the evolution of debug was on chip trace in an attempt to provide more of the features of an emulator, letting the debugger follow the progress of the code running inside the cpu. The problem was one of pins. Despite being a core part of any embedded development project, debug is far from being a priority when it comes to allocating pins. On small package mcus, pins are relatively expensive. But full trace is bandwidth hungry as, potentially, every address visited by the program needs to be output on a dedicated trace bus.
However, even in tightly looping code, only a fraction of the instructions are responsible for a change in flow. Otherwise, instruction flow is highly predictable. In a typical C program, a branch is encountered only every 10 instructions. Companies such as Freescale implemented branch trace modes that output small packets of data only on a change in flow. This greatly reduced the amount of data that needed to be sent from the chip, allowing the use of a narrower trace bus. In rare cases where the flow of control does not change after more than 200 instructions or so, the debug controller outputs a synchronisation message.
To allow the recording of more detailed trace data that carries information about memory locations that are accessed by instructions and not just instruction execution, vendors have developed two main mechanisms. One is to multiplex the trace bus pins with those of regular peripherals so that, when the mcu is in debug mode, the pins provide high bandwidth trace data. Clearly, this strategy only works if the pins are not needed for regular I/O work.
Compression makes it possible to reduce the bandwidth demand. Because the flow of a program is relatively predictable, it is possible to compress the number of address bits passed through the trace port, restricting full addresses to those of monitored data locations.
If access to pins is highly restricted, an alternative is to dedicated an area of on chip memory to a circular buffer that records the most recent cpu activity – its contents can be read while the target is halted either through the Jtag port or by switching the relevant multiplexed pins into trace mode.
Because it is often not practical or desirable to stop all the cores on a multitasking multicore device, trace support is becoming more important. And it is no longer isolated to processors. Hardware accelerators are beginning to sport trace functions to make it easier to visualise the interactions between them and the on chip processors. A trace buffer that can record events in the correct order is vital to understanding whether the application is suffering from race conditions. A further enhancement turning up these multicore systems is a new take on printf debugging. More recent ARM cores, for example, contain instrumentation buffers that record events under software control. By setting aside an onchip buffer, the program does not use potentially precious hardware I/O resources such as serial ports. The debugger can read out the logged events through the Jtag or the trace port.
In many cases, it is not practical to halt the processor completely on every breakpoint. In a multitasking system where interrupts need to be handled in real time – because, for example, they may be used to control a motor – the breakpoint is used to halt a particular thread, but not the actual processor. ARM calls this 'monitor mode debugging'. Instead of stopping the processor, control for a thread of execution passes to a debug monitor that runs as a privileged task alongside normal threads. This monitor takes care of accesses to debug registers, trace buffers and stored register contents – which will be flushed to memory while other tasks run. When the thread needs to be started from where it finished, the debug monitor returns its state to ready to run and restores the correct stack state so the rtos can schedule the thread for execution.
Despite the success of Jtag as a low level standard for debug, standardisation at higher level proved elusive. Core Jtag pins are generally always available: the questions focus on additional pins that might be present for relaying data to the emulator or host computer and the protocols used to control the target.
The longest lived attempt to unify debug is the Nexus 5001 consortium, formed in 1998 in response to demands from car makers for silicon suppliers to converge on one debug interface they could use for all the processors they bought. Although Freescale promoted Nexus enthusiastically and other silicon manufacturers selling into the automotive industry embraced it, support for Nexus elsewhere is far from widespread.
In practice, de facto standards such as ARM's CoreSight, Motorola/Freescale's BDM and OnCE port or the Intel Extended Debug Port (XDP) – the successor to the 80486's debug port – are the winners so far.
Things began to change in the mid 2000s, when it became clear that multicore architectures were the future for the embedded world. In the desktop world, de facto standards could be expected to survive as the focus has, so far, been on homogenous multicore processors. However, in the embedded space, heterogeneous architectures are becoming common, making it difficult to use single vendor debug architectures, even one as pervasive as ARM's.
In principle, debug systems for different processors can coexist on the same Jtag connection. The protocol is designed to support multiple test targets within a device, cycling through the scan chain until the correct one for a given client is found. However, this is not an efficient use of resources and a lack of communication between cores makes certain functions that can be useful during multicore debugging – such as synchronised halt and restart or cross triggering between threads running on different cores – hard, if not impossible, to achieve.
During the second half of the last decade, a number of bodies decided to create their own multicore debugging standards. The OCP-IP group, which defined the on-chip interconnect architecture used by companies such as Texas Instruments, has teamed up with Nexus to define a set of multicore debug standards. Other efforts included the EU funded Sprint Consortium, the Taiwan SoC Consortium and a proposal for mobile phone processors put together by the MIPI Alliance. While OCP-IP is currently the most active, there is still a lack of commonality in multicore debug protocols.
Making debug tools work better
For version 3.0 of its interconnect specification, OCP-IP is planning to make debug tools work better with systems where the power state of individual cores can change rapidly over time. For example, without specific support, the chip's power manager may attempt to power down a core while it is stopped, preventing access to its registers from the debugger. Another focus is on cache coherency – making the debugger aware of state changes in the cache that might affect the timing of software loops. For example, a cache update may block the main memory bus, affecting a number of processors cores in a cluster.
Outside of the standards efforts, a wider range of data is likely to be captured by trace ports. Xilinx made it possible to send data about temperature and other environmental data over the Jtag port. TI has indicated that future versions of its low power microcontrollers will record more data about energy consumption in real time for use in debuggers, making it easier to see the effect of code changes on overall power consumption.
Support for all these different aspects of debug is still patchy across the industry, but the trend is gradually towards greater levels of visibility which, in turn, should cut down the length of time it takes to verify that an embedded system works as planned.