Embedded machine learning (ML) is seeing a far bigger feeding frenzy as established MCU players and AI-acceleration start-ups try to demonstrate their commitment to the idea, which mostly goes under the banner of TinyML.
Daniel Situnayake, founding TinyML engineer at software-tools company Edge Impulse and co-author of a renowned book on the technology, says the situation today is very different to that of the 1990s.
“The exciting thing about embedded ML is that machine learning and deep learning are not new, unproven technologies - they've in fact been deployed successfully on server-class computers for a relatively long time, and are at the heart of a ton of successful products. Embedded ML is about applying a proven set of technologies to a new context that will enable many new applications that were not previously possible.”
ABI Research predicts the market for low-power AI-enabled MCUs and accelerators for the TinyML market will climb from less than $30m in annual revenues this year to more than $2bn by the start of the next decade.
Despite the rapid growth, ABI analyst Lian Jye Su expects competition to become fiercer as large companies such as Bosch enter the market. Already, some start-ups such as Eta Compute have moved away from silicon to software tools.
“We do see some consolidation. At the same time, the huge fragmentation in the IoT market means a significant number of providers will survive, like the MCU or IoT chipset markets in general,” he says, pointing to the large number of suppliers who focus on specific vertical markets.
TinyML faces severe constraints. Pete Warden, technical lead of the TensorFlow Micro framework at the search-engine giant and Situnayake’s co-author on “TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers”, said at the Linley Group’s Fall Processor Conference that the aim is to take deep-learning models and “get them running on devices that have as little as 20KB of RAM. We want to take models built using this cutting-edge technology and crush them down to run on very low power processors.
“Because it’s open-source software, we get not only to interact with product teams inside Google but also get a lot of requests from product teams all over the world who are trying to build interesting products. And we often have to say: no, that’s not possible yet. We get to see, in aggregate, a lot of unmet requirements,” says Warden.
The core issue is that deep-learning models ported from the server environment call for millions or even billions of multiply-add (MAC) functions to be performed in a short space of time even for relatively simply models. Linley Gwennap, president of the Linley Group, says relatively simple audio applications, such as picking up words in speech that can activate voice recognition, calls for around 2 million MACs per second. Video needs far more.
Silicon vendors have been able to push the MAC count by taking advantage of the relatively low requirement for accuracy in individual calculations when performing inferencing. Whereas training on servers generally demands single or double-precision floating point arithmetic, byte-wide integer (int8) calculations seem to be sufficient for most applications.
There are indications that for selected layers in a model, even int8 MACs are unnecessary. Binary or ternary calculations that can be performed using little more than a few gates each do not hurt overall accuracy in many cases. Potentially the performance gains are enormous but lack the combination of hardware and software support needed to exploit them fully, says Situnayake.
Though the tooling for the TensorFlow Lite framework typically supports int8 weights, support for lower resolutions is far from widespread. “This is changing fast,” Situnayake notes, pointing to accelerators such as Syntiant’s that support binary, 2bit and 4bit weights as well as work by Plumerai to train binarised neural networks directly.
“While these technologies are still on the cutting edge and have yet to make it into the mainstream for embedded ML developers, it won't be long before they are part of the standard toolkit,” he adds.
Reducing the arithmetic burden
There are other options for TinyML work that reduce the arithmetic burden. Speaking at the TinyML Asia conference late last year, Jan Jongboom, co-founder and CTO of Edge Impulse said the key attraction of ML is its ability to find correlations in data that conventional algorithms do not pick. The issue lies in the sheer number of parameters most conventional models have to process to find those correlations if the inputs are raw samples.
“You want to lend your machine-learning algorithm a hand to make its life easier,” Jongboom says. The most helpful technique for typical real-time signals is the use of feature extraction: transforming the data into representations that make it possible to build neural networks with orders of magnitude fewer parameters.
Taking speech as an example, a transformation to the mel-cepstrum space massively reduces the number of parameters that can efficiently encode the changes in sound.
In other sensor data, such as the feed from an accelerometer used for vibration detection in rotating machinery, other forms of joint time-frequency representations will often work.
This approach is used by John Edwards, consultant and DSP engineer at Sigma Numerix and a visiting lecturer at the University of Oxford, in a project for vibration analysis.
In this case, a short Fourier transform had the best trade-off coupled with transformations that compensate for variable speed motors. The feature extraction reduced the size of the model to just two layers that could easily be processed on an NXP LPC55C69, which combines Arm Cortex-M33 cores with a DSP accelerator.
Jongboom says though it may be tempting to go down the route of deep learning, other machine-learning algorithms can deliver results. “Our best anomaly detection model is not a neural network: its basic k-means clustering.”
Where deep learning is a requirement, sparsity provides a further reduction in model overhead. This can take the form of pruning, in which weights that have little effect on model output are simply removed from the pipeline. Another option is to focus effort on parts of the data stream that demonstrate changes over time. For example, in surveillance videos this may mean the use of image processing to detect moving objects and separate them from the background before feeding the processed pixels to a model.
It’s been a learning experience for Jongboom and others. In describing his progress through the stages of TinyML, in the summer of 2017 he thought the whole concept was impossible. By the summer of 2020, having looked at ways to optimise application and model design together, his attitude had changed to believing real-time image classification on low-power hardware is feasible. As low-power accelerators that support low-precision and sparsity more efficiently appear, the range of models that can run at micropower should expand.
The result, Situnayake claims, is likely to be that “ML will end up representing a larger fraction than any other type of workload. The advantages of on-device ML will drive the industry towards creating and deploying faster, more capable low-power chips that will come to represent the majority of all embedded compute in the world”. Though there will be plenty of devices that do not run these workloads the need for speed as model sizes inevitably grow will focus attention on its needs and begin to dominate the development of software and hardware architectures, as long as the applications follow through.