Fault tolerant designs allow defective chips to be used instead of scrapped
4 mins read
Circuit designers are rapidly coming to terms with the idea that it is impossible to build error free chips. The solution is fault tolerant architectures, combined with mechanisms for detecting and repairing errors, or at least mitigating their impact.
New Electronics talked to three experts in the field to find out how they are addressing the issues. At the recent DATE 2011 conference in Grenoble, fault tolerance was a recurring theme. One popular tutorial was 'Architectures for online error detection and recovery in multicore processors'. Organiser and moderator Dimitris Gizopoulos, from the University of Athens, said the move to nanometre geometries and the complexity it brings is threatening the reliability of future devices.
He identified three causes for this vulnerability. "First, there are environmental disturbances that produce transient or soft errors," he said. For some years, device manufacturers have been aware that, with the shift to deep submicron geometries, naturally occuring alpha particle radiation can impact memory and logic circuitry. The most common soft error is a bit flip in memory devices, whereby a 0 becomes a 1 or vice versa. System crashes, data corruption and system resets are a real risk.
The second factor is latent manufacturing defects. "Variability in advanced processes causes heterogeneous operation of identical components on the same chip," he explained. "Multicore microprocessors and memories are manufactured using inherently unreliable technologies." Process variation leads to changes in circuit parameters which, if undetected, can mean at best, unpredictable behaviour, and at worst, component failure. Inevitably, it leads to reduced yield, increased cost and delays.
Allied to this, aging phenomena, more common as geometries shrink, are increasingly producing permanent, hard errors causing digital circuits to fail over time. "Then there are verification inefficiencies that allow significant design bugs to escape into the system," Gizopoulos continued. "Due to the extreme complexity of multicore processors and the pressure towards reduced time to market, even after comprehensive presilicon verification and post silicon validation, major design errors or bugs may be missed."
While complex multicore devices are regarded as part of the reliability problem, they are also inspiring some innovative potential solutions. Spare resources are inherent to multicore designs and these are being exploited successfully, together with the application of evolving error detection, recovery and repair schemes, to make faulty devices usable, improve reliability and provide predictable operation.
Salvage operations Intel's Arijit Biswas has been researching core salvaging techniques, designed to reduce the scrapheap of devices with defective cores. "Multicore devices can be dominated by regular memory structures, mainly caches," he explained. "Fortunately, caches can be protected from manufacturing defects using well tried and tested techniques." CPU cores are the vulnerable area now, he said. "In the past, we had only one solution: throw out all defective parts. But recent work has shown that, properly managed, defects can be tolerated."
One obvious solution is to disable defective cores, Biswas explained. "Core disabling reduces the sales price as a four core part becomes a two core part, for example." Core sparing – a standby core – is another option; like redundancy, this consumes precious die area, while providing no performance or economic benefit in a non defective die. "A more desirable option is core salvaging, which allows defective cores to continue operation."
Biswas said most modern cores contain large amounts of redundant logic, which could be used to compensate for the defective logic. "The key is to be able to detect the defect to a finer granularity than just a core," he said. Microarchitectural core salvaging techniques disable defective execution pipelines or schedule operations on alternate or spare resources, thereby avoiding using the defective area.
By exploiting microarchitectural redundancy, this technique relies on the core's ability to execute the enire instruction set architecture (ISA) correctly, even in the presence of certain defects. Biswas' research team found, however, there are few opportunities to use this approach, because large portions of many redundant structures comprise non redundant logic, such as decoders, buffers and interconnect
Further, the technique requires a significant overhead. Architectural core salvaging, by contrast, considers resources outside the core and the fact that a single core need not be ISA compatible, providing the cpu, as a whole, is. "Crucially, if a defect means that a core cannot execute certain instructions, the core can still be used if we can detect and move the unexecutable instructions to a different core."
Certain defects, such as the inability to execute memory operations, will render a core useless. "Most ISAs contain numerous instructions which are used infrequently, yet occupy significant area – SIMD instructions are a good example," he added. In certain cases, it is possible to trap such an instruction and migrate the thread to a good core.
The way to better exploit the technique, says Biswas, is to ensure better thread scheduling and thread swapping algorithms. "Architectural core salvaging becomes compelling as the number of cores on a device increases." Beyond five cores, Biswas maintains, the performance drop from using architectural core salvaging will be less than 5%. "And the technique is orthogonal to core sparing," he concluded.
Growing old gracefully Graceful chip degradation is the aim of the CRISP (Cutting Edge Reconfigurable Ics for Stream Processing) consortium. This EU funded project, led by Dutch dsp IP specialist Recore Systems, demonstrated a self testing, self repairing nine core chip at DATE. Gerard Rauwerda, Recore's CTO, said the key to the CRISP technique is the use of dynamically reconfigurable cores and resource management at run time to exploit the natural redundancy in multicore designs.
"A key innovation is the Dependability Manager, a test generation unit which accesses the built in self test scan chain to effectively perform production testing at run time. This determines which cores are working correctly," he explained. To do this, the consortium has created an IP 'wrapper' around Recore's reconfigurable dsp core.
The addition of multiplexers allows the software to switch from functional mode to diagnosis mode to detect faults. "There are some timing issues to consider," Rauwerda explained, "as the circuitry is running at, say, 200MHz online, instead of 25MHz offline." Once the device has been analysed, the run time resource manager reroutes tasks to error free parts of the chip, effectively repairing it for continuous operation.
For the demonstration, the CRISP software was running on an ARM9 device, accessing a matrix of dsp cores, but Rauwerda said the technique could be applied to a variety of cores in a truly heterogeneous SoC; even in 3d stacked chip devices. Currently, CRISP's approach will determine unusable faulty cores and, if only the logic part of the core is affected, whether the core's memory might still be usable.
"In the future, the aim is to diagnose to a deeper level, to see if we can use more parts of a faulty core," Rauwerda concluded. "A fault tolerant interconnect is going to be very important. We will need to insert test structures into the network on chip interconnect IP for better diagnosis."