Back to the future - digital design
4 mins read
It appears that, to know where we're going, we have to know where we've been.
Self modifying software, complex instructions and parallel processing; it's as if the 1980s never went away. But these ideas are being reworked for new generations of processors that focus on energy per operation, rather than performance at all costs.
"Energy efficiency is critically important at both ends of the performance spectrum," says Professor Bill Dally, chief scientist at graphics processor maker nVidia. "At the high end, we are getting to the point where we are limited by what we can power and cool. At the other end, we are limited by battery life."
It is not just dissatisfaction with the energy performance of conventional processors that is driving this, but a demand for more flexible hardware that is easier to design.
"Lots of markets are drying up because the cost of innovation is so high," says Professor Mark Horowitz of Stanford University. "This happened before in the 1980s until people invented a new form of chip design. We need to rethink the way we approach design."
Prof Horowitz sees the future as being based on arrays of processors taking over complex functions from dedicated hardware. But you cannot take a bunch of standard processors and expect them to perform like hardware. In comparisons, the multiprocessor was 500 times less efficient than dedicated hardware.
Prof Horowitz says: "Eight or sixteen bit operations should take a fraction of a picojoule in a 90nm process, but the lowest that a processor will operate at is of the order of 80pJ. There is a tremendous opportunity in the gap between the overhead of an instruction and its actual operation."
According to David Balfour, who researched low power processor design at Stanford with Prof Dally before also moving to nVidia, a processor spends most of its energy supplying instructions and data to the execution units, rather than performing computations. It is the main reason why programmable systems are nowhere near as power efficient as fixed function hardware. For example, a 32bit addition in a 45nm low power cmos technology consumes about 0.52pJ; a 16bit multiplication consumes 2.2pJ. Executing an addition instruction on a risc processor, however, takes around 5.3pJ, even if the instruction that triggers the operation is sitting in a cache.
Work by Prof Dally's Stanford team on the ESC project low power processor project found that instruction supply was responsible for 42% of total processor power consumption. The response was to find a way to reduce the need for an instruction cache: using a small register file to hold frequently used instructions instead. The compiler explicitly loads the registers; in a sense, it writes a program that modifies itself similar to some minicomputer architectures from the 1970s, when instruction space was at a premium.
Putting more into each instruction is another option. One instruction fetch can trigger a multitude of operations in single instruction, multiple data (SIMD) architectures. Prof Horowitz says: "We built SIMD engines and tailored them for the application. We got a 20x performance improvement, but we did start 500 times worse than hardware. It's clear that generic data parallel operations are important and useful, but they are not enough."
Focusing on elements with high software overhead, but which can be implemented in hardware accelerators attached to processors, got Prof Horowitz and his colleagues to within three times the power of dedicated hardware, but with more of the design generated automatically.
Professor Alberto Sangiovanni-Vincentelli, of the University of California at Berkeley, says of Horowitz's plan: "It's not a new idea, but the way he articulates it is attracting interest from a lot of people."
Two different approaches were taken by researchers at IMEC. Harmke de Groot, project director of of ultralow power dsp and wireless, and colleagues are working to build body monitoring systems that can run for years from a lithium cell or by harvesting body heat or movement. The problem is that transmitting power from a heart monitor for storage chews Watts.
De Groot says: "The radio represents 80% of the power consumption. We could do two things: we could make a better radio, but we could do more local processing so we send less data over the air."
In the first iteration of the design, the hunch about the radio was correct. "But the processing power needed from a commercial microprocessor was so much that the gain from compressing the data for radio transmission was lost. We decided we needed a specialised, biological processor that could do this," de Groot says.
Using concepts similar to those employed at Stanford, the IMEC group added hardware support for wavelet transforms – which have proven to be very useful in compressing biological data. For a second generation processor called BioFlux, IMEC has used a more general purpose core from NXP as the basis for trying out a number of low level power reduction techniques, such as controlling the forward and reverse bias of transistors and taking the supply voltage close to or even below the switching threshold.
A group at the University of Michigan at Ann Arbor led by Professor David Blaauw pushed the operating voltage of its Phoenix processor down to 0.5V, using a 0.18µm process. But such low voltages expose the leakiness of large memory arrays. They focused on finding ways to reduce the leakage, but also the amount of memory that a processor needs, going full circle to the concerns of the 1970s and 1980s. They reduced the bit length of instructions, cutting the energy needed to transfer them across a bus, by using more complex addressing modes than a standard risc. And, to cut the memory footprint of data ram, the researchers used Huffman compression, decoding data on the fly as it is read into the processor's registers.
Ironically, srams are more energy efficient when run at higher speeds, according to Prof Blaauw's group – running an sram at a higher voltage makes it feasible to have clusters of slow processors use a shared cache without blocking each other as the cache can be clocked at a higher speed. The drawback is the cache may 'thrash' when processors conflict over cache lines, which could increase energy usage as data has to be repeatedly shifted in and out of main memory.
3d integration may help further by cutting the distance between processor cores. Intel has proposed stacking processors on top of memories to cut the time taken to access data, but Professor Yves Leduc, Texas Instruments chair at Polytech Sophia Antipolis, says: "The big guy doing big processors is not so enthusiastic about using 3d integration. The problem is heat." But Prof Leduc added: "If you are in the low power business, 3d is for you."
Meanwhile, high speed supercomputer projects are focusing on other assumptions that underlie software architecture in the hope that altering some of them may provide greater power reductions in the future. Much as the shift to risc philosophies in the 1990s centred on compiler design, the next generation of processors will need a similar change in approach.