Since the mid-2000s, computer architecture has fractured. Up to that point, processor designers and their users could rely on clock speed and process improvements to deliver more compute cycles for the same money and energy. Then Dennard scaling, which behind the ability to deliver many of the improvements, suddenly hit a wall. Peak clock speed stopped dead in its tracks at around 3GHz and the evolution of superscalar processors slowed dramatically, because the dynamic instruction schedulers they needed chewed through too much power.
Homogeneous multicore architectures at least made it possible to exploit the fact that, while Dennard scaling was no more, Moore’s Law was still providing twice the transistors every two or three years. But even that has slammed into a wall because trying to keep multiple high-speed processors going, complete with caches and memory interfaces, needs more cooling than most system designs can afford.
If you do not want to keep all the transistors powered at once and are happy to keep large chunks of the die dark, those additional transistors can go into accelerator units tuned for specific classes of application. The cost comes in software. Compilers need to become much more complex because of the sheer range of architectures with which they now have to deal.
In the glory days of the RISC processor, the individual architectures were conceptually similar. You might get differences in pipeline depth and fine-grained instruction scheduling. But, fundamentally, code generation for pretty much all processor architectures designed since the end of the 1980s could employ the same core optimisations.
Accelerators greatly expand the architectural space and work with entirely different compilation strategies. Organising data for a processor armed with a cache for optimum performance can be quite different to doing the same for a single-instruction, multiple-data (SIMD) accelerator designed to work on vectors.
The cache-based memory accesses of the processor favour short offsets that can hit elements in the same or nearby cache lines that have already been filled. The compiler will unroll tight inner for…to() loops to support this but it has to watch out for dependencies as the code iterates through the loop if the results need to be summed.
For a vector unit, it may make more sense to operate on more widely distributed data with a much bigger stride between elements and possibly with the luxury of working with fewer dependencies.
In effect, it will run multiple copies of the same inner loop – with different groups of data – in parallel. In order to do so, the compiler has to unpick the source code and determine how nested for…to() loops are related to each other.
Polyhedral compilers
Derived from ideas first proposed in the late 1960s, polyhedral compilers use geometric techniques to map the sequence of operations in nested loops and their dependencies. The operations cover techniques such as carving up matrices of operations into tiles that can then be distributed in parallel across accelerators that match those tile dimensions.
Whereas, the handling of multidimensional matrices or tensors used to be isolated to highperformance computing, applications such as machine learning are shifting these kinds of workloads onto embedded systems.
An additional wrinkle for embedded systems that have to deal with machine learning and similar tensorprocessing problems is that they need to handle not just highly regular convolutional operations, which can take advantage of the pipelining and parallelisation that suits dense matrices, but irregular sparse matrices.
For example, the pruning of deeplearning networks equates to setting to zero many of the cells in the tensor used to calculate the overall effect of the neuronal weights on an input.
An optimisation employed in some machine-learning processors is to use data compression on weights to reduce the impact of the zero weights – they are discarded at decode time with only non-zero weights fed to an execution unit. But a more generalpurpose technique is to adapt polyhedral compilation to map sparse matrices into denser forms. To do so, the code generator may recruit address generation logic to perform scatter-gather operations and feed data into multiple execution units.
A big issue for the compiler is inferring what is happening in these increasingly complex codes.
Polyhedral compilers do a reasonably good job of determining the relationships between the nested loops that characterise the mathematically intensive code. But one approach that may help improve results is to move to higher levels of abstraction in the source code. A high-level abstraction can explicitly show that what the operation is meant to achieve and relieve the compiler from trying to work it out from the shape of the code.
Domain specific languages
Even more intent can be captured using a domain-specific language (DSL). There will be operations that are commonly used in specific sectors that can be baked into the code analysis and optimisation functions. DSLs need not demand a complete move away from C or C++ and demand developers learn new syntaxes.
One example from Professor Jerónimo Castrillón’s group at TU Dresden is CFDLang, which is an extension to Fortran for the tensor operations used in fluid dynamics. He claims the extensions demanded comparatively little work to the underlying compiler but it was able to outperform Google’s Tensorfl ow because the compiler was able to take advantage of domain knowledge baked into the extended language syntax.
Although the focus of these compiler techniques is on the development fl ow, the high-level information captured by DSLs may wind up being taken through to the point that code is deployed. One of the arguments for edge computing is that it brings cloud-like agility to the way in which processes are assigned to hardware.
Industrial machine tools and robots may offload intensive AI and sensor processing routines to an edge server nearby rather than relying entirely on their own hardware, and not always the same server. Having the code move around can boost overall utilisation by scheduling things according to priority and resources. But the client cannot rely on the same confi guration of accelerators always being available. Ideally, applications would be able to switch between accelerators to choose from a wider range of hardware at runtime.
Another proposal from the TU Dresden group is its TETRiS system, named after the block-based game.
Applications are divided into chunks that can take advantage of different resources on a target platform. The scheduler fits these chunked applications around each other based on their use of resources to minimise conflict. In experiments the technique reduced variance in execution time and energy by orders of magnitude compared to the CFS scheduler used by Linux on a Samsung Exynos-based target.
Although compilers such as LLVM, with their multiple plugin back-end code generators, can readily target multiple accelerators, the issue for a runtime linker is which one to pick at any given time on arbitrary platforms.
This is where a standardisation effort at the IEEE may have an effect. The SHIM specifi cation, which is expected to become the IEEE 2804 standard early next year, provides platform suppliers with a consistent way of describing the capabilities of accelerators so that modelling tools can estimate their performance.
One potential issue with an estimate of peak throughput is when multiple tasks are competing for resources on the same machine. Many workloads are far from consistent in their use of processor cycles over time and prediction needs to take this into account as can easily affect the decision on which accelerator to choose. It may make sense, for example, to pick one that allows tasks to switch in and out quickly but is slower than one with a large setup overhead as that will lead to fewer delays caused by contention.
There are trade-offs in these systems. Experiments by Mina Niknafs and colleagues at Linköping University in Sweden showed the overhead of running a predictor has an effect on system effi ciency as does inaccurate estimation based on partial information.
But such performance analysers seem likely to prove essential as developers wrestle with the challenge of maximising their performance on a changing mixture of workloads on edge servers and embedded systems.