In the mid 1970s, Burroughs introduced the first of its B1000 mainframes that included the idea of a 'writeable control store' that could be reprogrammed on the fly to emulate different machines.
Because hardware was so expensive, the earliest mainframes implemented many of their software visible instructions as short 'microcode' routines that were translated at execution time into short sequences of low level machine operations. Later versions used a combination of early field programmable gate arrays – mainly for the instruction decoder – and SRAM.
The idea behind the B1000 series was not to create a chameleon machine able emulate the behaviour of other machines, such as the IBM System/370 – that was an idea that came later. Instead, Burroughs' aim was to close the gap between high level languages and machine code. The microcode was designed to provide compilers with access to instructions that could support language features, such as allocating memory to objects, directly.
Although the B1000 had limited success and was discontinued in the mid 1980s, IBM found success with a similar approach, albeit designed for a single instruction set, when it launched the AS/400 minicomputer. Every single instruction generated by the OS/400 compiler is actually a piece of microcode. Software never 'sees' the actual instruction set of the core processor.
The design let IBM swap out the custom 48bit CISC processor and replace it with 64bit PowerPC-architecture processors when the time came to expand the machine's memory address space with no change to the actual programs. Because programmers never dealt with the memory address map directly – everything on the AS/400 is handled through object references – even the change in addressing range was transparent to applications.
The rise of RISC architectures, led by the realisation that compiler writers rarely took advantage of specialised high level instructions, removed much of the motivation for developing other computers that could remap instructions. Computers would simply target these stripped down instruction sets directly.
In the 1990s, a different motivation for translating instruction sets took hold. Intel's dominance of the PC market and an increasing stranglehold on servers led competing manufacturers to try to find ways to run Intel compatible code on their own processors.
In an attempt to make its Alpha architecture more attractive, Digital Equipment developed software that could translate Intel binaries into native Alpha machine code. Rather than performing a one off translation, the software could profile and optimise code as it ran. This way, the translator could take better advantage of information available at runtime to which a traditional compiler or binary translator has no access.
Transmeta then tried to build a processor around the idea of dynamic binary translation, aiming squarely at Intel's core market. But neither Digital nor Transmeta made any headway against the native x86. A key problem with the dynamic translation was its startup cost. In order to have anything to execute, the software translator would have to first work through code – much of which may only be run once – which slows software execution significantly.
With the rise of ARM, there is far less need among chipmakers to emulate an instruction set. There is a price for licensing the cores, but this is likely to be insignificant versus the cost of developing a new architecture and translation infrastructure simply to run ARM instructions. Yet, nVidia has decided to design a processor that performs dynamic translation from ARM binaries. The reason? Because, as Digital and Transmeta engineers determined, a processor has a better view of the runtime behaviour of code than a static compiler ever can.
Launched in late 2014 – and used in the HTC Nexus 9 tablet – the nVidia Denver processor (see figs 1 and 2) can dispatch seven instructions per clock cycle, just as long as it can find sufficient parallelism. Hardly any processors aimed at the mobile market attempt this level of complexity because, with the exception of DSP algorithms, statically generated code rarely succeeds in providing parallelism of more than two instructions per clock. Wide superscalar machines shuffle the execution sequence of code dynamically without changing the binary. But they have to perform the same analysis for the same code over and over again, which consumes a lot of power, ruling them out of mobile designs.
The Denver is designed to avoid the power consumption of regular wide superscalar processors by having software tasks profile code that is executed repeatedly and recompiling it based on the runtime information the processor collects. Unlike the Alpha or Transmeta machines, the Denver can run ARM code natively so does not suffer the same startup costs and it will only translate segments of code when it has little other work to do.
Mainframes are turning once again to dynamic translation. IBM designed its Power8 processor to better support the profiling needed by translation tools and allow the fine tuning of Java applications, which are designed to be dynamically compiled or interpreted.
The Power8 processor builds a table of 'hot' addresses – of those instructions the machine encounters many times during execution. Above a threshold, the processor fires off an interrupt that kicks off an optimiser that analyses the code to see if it can produce a faster version. If the hot address is a branch instruction, the translator may attempt to unroll the loop to both avoid the need to use branch prediction and overlap execution of multiple iterations of the same loop.
The performance advantages of chameleon architectures remain debatable. Although the Nexus 9 performs well in some Android benchmarks, it falls behind conventional processors on others. A stronger argument for altering programs dynamically to suit the underlying hardware may come from power consumption. And it could bring the ideas promoted by Burroughs full circle by bringing hardware closer to the software's source code.
Five years ago, ARM CTO Mike Muller warned of the problem of 'dark silicon', in which large areas of an SoC have to be turned off to stop the whole chip from overheating. Researchers have proposed SoC designs that move software workloads from processor to processor as each nears its thermal limit and so has to be turned off and allowed to cool. One option is to have specialised processors – called conservation cores or c-cores (see fig 1) by Michael Taylor at the University of California at San Diego – that run only selected parts of an application.
For an experimental SoC called GreenDroid, software analysed the source code of a variety of programs written for Android and, from them, synthesised custom processors that could accelerate high level algorithms that the system decided would be used heavily. A compiler then generated code that allowed sections to be swapped in and out more easily so the programs could access the c-cores as needed. Taylor's group has claimed energy efficiency improvements of up to ten times using the technique.
Although it would likely be difficult for binary translation to take full advantage of c-cores, compilers could generate intermediate code that is easier for a runtime system to interpret and convert to a form that can be used by the c-cores with the rest converted to binary. The technique would allow routines to be remapped to general-purpose processors if a c-core is locked out when it starts to overheat.
Security is a potential roadblock to dynamic binary translation. Attackers could target the translation mechanism itself to insert their own malicious code, so researchers are now trying to identify ways to spot these attacks and lock them out. But as the power problem facing silicon increases, hardware and the binary code that runs on it will have to become much more flexible.