Seeking perfection: The growing role of verification in SoC design
6 mins read
Verification has consumed the lion's share of the time and cost in chip design for at least the past decade. The reason is clear: once the design has 'taped out' – even though no-one has actually sent a layout on tape for years – it's gone. Only when you get the silicon back do you find out whether you have a working device or the source material for some attractive pendants. Verification is the one thing you have to determine how successful the project will be before it's too late.
The FPGA environment has a different set of demands, but the trends are pushing towards greater levels of verification before committing a design to silicon. FPGAs are easy to fix, unlike custom silicon. Even in the case of antifuse parts, a showstopping bug will mean you only lose the ones you have already programmed. However, the programming time is now much, much shorter than the time it takes to compile a design so that it can be loaded into a PROM or into the configuration registers of a flash-based device.
Even though simulation on a workstation is so much slower than running the same logic on hardware, the rise in compile time on larger devices has made simulation steadily more attractive, except for those instances where only at-speed verification will do. Yet the opposite trend has been followed in custom silicon design, where the sheer size of projects – running into millions of gates – means simulation can be prohibitively slow. In these case, engineers have turned to emulation or debugging in virtual prototypes implemented on FPGA based breadboards.
The overall trend in verification is to make the most of the simulation cycles you can run. That has meant augmenting simulation-based verification with static or formal verification techniques and bringing some of the tools developed for formal use into the simulation domain.
The most common form of formal verification, equivalence checking, appeared directly as a result of simulation taking too long. Until the end of the 1990s, gate-level simulation was the primary method for determining whether what was laid out on the chip matched the register-transfer level (RTL) hardware description used to generate the logic. Although logic synthesis helped to drive a massive increase in design productivity since it supplanted schematic-based design, no-one fully trusted the output of the synthesis and logic-optimisation algorithms – and they are still wise not to.
Comparing a gate-level simulation with the RTL simulation results provided the main means for determining whether something was wrong with the design as it would be implemented on-chip. But this 'golden', or sign-off, simulation was even more time consuming than performing RTL simulation. Having completed a design, no-one really wanted to wait for days for the gate-level simulation to complete and to then start analysing the testbench outputs.
This opened the door to equivalence checking, in which a synthesised circuit is analysed to ensure it matches the original hardware description. But this is not quite as simple as it looks; the logic could have gone through a number of optimisation steps, even pipelining changes, with some of the combinatorial logic moving from one side of a register to another. In some cases, the combinatorial logic implemented could be very different to that inferred by the simulator when it read a line of Verilog or VHDL.
Equivalence checking has migrated beyond ASIC and SoC design into FPGA design because of the use of very large devices and the time it takes to compare simulation with hardware, given the limited internal visibility that a programmed FPGA offers.
The technique is steadily moving up the abstraction levels, now being used in some cases to check a system-level description against the RTL synthesised by a tool. As system-level synthesis tends to be an iterative process of architecture selection and optimisation, it is difficult to analyse whether the generated logic is working correctly or not. Equivalence checking makes it possible to treat the system-level description as the correct model for part of a design and to then ensure any downstream logic matches it: all the way to gate-level equivalence checking.
This usage at the system level is steadily making its way across the SoC industry: Intel, for example, used a form of equivalence checking to verify the RTL implementation of a cache-coherency protocol against its definition, which itself was tested for correctness using formal techniques.
Demonstrating correctness is, however, a far knottier problem than equivalence.
The ultimate principle behind verification is simple enough: work out what could go wrong and use that to test the design. Unfortunately, building a testbench that provides a high degree of confidence in determining that all possible failures have been ruled out is tough to build by hand. Simply defining the correct behaviour of a system is not necessarily straightforward with an SoC that has many different modes of operation. And then you have the problem of knowing when you are done.#
For a surprisingly long time, testbenches were produced and assessed manually. It's only in the past decade that the idea of verification coverage has taken hold of design teams. This uses tools to work out how much of the system can be probed by a set of tests and how much remains untested.
A law of diminishing returns tends to govern testbenches: the first 60 to 80% of coverage is generally straightforward to obtain. Getting to the final 20% takes progressively longer and it's extremely difficult and time-consuming to get close to 100%; only avionics and safety-critical systems will be pushed to that level. It's especially difficult to get close to complete coverage using just simulation because of the number of cycles it can take to put a system into the right state to apply a particular state. This can make obscure bugs hard to tease out. The answer is, generally, to combine simulation with formal verification.
Formal verification is potentially very fast because it does not have to evaluate every possible state to demonstrate that a given piece of logic meets a set of properties under all conditions. But it has two problems. One is capacity. The big advantage of simulation is that, as long as you have enough memory and CPU cycles, there is no fundamental limitation on design size that you can feed to a tool. Formal verification is very sensitive, not just to block size, but also to the type of logic it contains.
For example, multipliers are tough to verify exhaustively because of the sheer number of combinations possible. But early formal verification tools also ran into difficulties with them because the solvers, which were then largely based on the same binary decision diagrams (BDDs) used in early equivalence checking tools, quickly ran out of memory. It took a change in the way that multipliers could be represented to made the problem tractable.
Other types of logic and datapath presented problems for BDDs, leading to a proliferation of formal verification methods which slowed adoption. Users had to become experts, not just in using the tools, but also in the theory of formal verification, in order to pick the right tool for each job. The EDA companies responded by building better heuristics into their tools that would attempt to recognise types of logic and then pick the best solver for the job. Capacity remains an issue for formal techniques and so is still largely focused on block-level or protocol verification. Therefore, it has not replaced simulation by any means, particularly at the level of an integrated SoC. However, techniques developed to help bring formal into the mainstream are now widely used in simulation-based verification.
The second problem that formal verification faces in usage is how to describe what a given piece of logic should be doing. Originally, formal verification tools needed mathematical descriptions to provide something to prove. These were difficult to put together, even for those with experience of the underlying theory. The response from the computer and EDA industries was to develop languages that could describe the desired behaviour in terms familiar to a hardware engineer, but which could be converted internally by a tool into a mathematical description. One of the results was IBM's Sugar, which turned into the Property Specification Language (PSL) when standardised by Accellera. Languages such as PSL became the basis for assertion-based verification.
Designers insert assertions into their code to tell users how a block should be used and to test for violations of those conditions. For example, a designer could note in an assertion that an acknowledgement signal should follow a request after no more than ten clock cycles. When that assertion is attached to the simulation and used as a monitor, the tool will check for situations where that assertion does not hold. And, instead of testing for all possible conditions, the testbench can generate a subset of them randomly. The unusual values that pop out can often find otherwise hard-to-find bugs. The idea soon caught on because of its efficiency in locating bugs that entirely manually created testbenches – using more 'reasonable' values – could not uncover.
The standardisation effort has now shifted to coverage. Today, different tools use their own ways of collecting and displaying metrics. Accellera set up the Universal Coverage Interoperatibility Standard (UCIS) project several years ago to try to find a common way not just to collect and store the data so that it can be used by different vendors' tools, but also to find common models of coverage.
Coverage can be represented in many different ways, from how many lines of HDL code are covered by all the tests to how much of the possible state space has been explored. Although the two are linked, a testbench can result in very different measures for these metrics. Determining which metrics provide the best results and pulling them together into a unified, sharable database is now the challenge facing the verification-tools industry.
When that work is complete, we will at least be closer to understanding when you can consider the verification task complete.