Employing architectural innovations using analogue technology, Sagence AI is able to deliver multiple orders of magnitude improvement in energy efficiency and cost reductions, while sustaining performance equivalent to high performance GPU/CPU based systems.
According to the company, when compared to leading volume GPU processing the Llama2-70B large language model with performance normalised to 666K tokens/sec, Sagence’s technology performs with 10X lower power, 20X lower price, and 20X smaller rack space.
By using a modular chiplet architecture for maximum integration, Sagence has made it possible to deliver a highly efficient inference machine that scales from data centre generative AI to edge computer visions applications across multiple industries. This balance of high performance and low power at affordable cost addresses the growing ROI problem for generative AI applications at scale, as AI compute in the data centre shifts from training models to deployment of models to inference tasks.
“A fundamental advancement in AI inference hardware is vital to the future of AI. Use of large language models (LLMs) and Generative AI drives demand for rapid and massive change at the nucleus of computing, requiring an unprecedented combination of highest performance at lowest power and economics that match costs to the value created,” said Vishal Sarin, CEO & Founder, Sagence AI. “The legacy computing devices today that are capable of extreme high performance AI inference cost too much to be economically viable and consume too much energy to be environmentally sustainable. Our mission is to break those performance and economic limitations in an environmentally responsible way.”
“The demands of the new generation of AI models have resulted in accelerators with massive on-package memory and consequently extremely high-power consumption. Between 2018 and today, the most powerful GPUs have gone from 300W to 1200W, while top-tier server CPUs have caught up to the power consumption levels of NVIDIA’s A100 GPU from 2020,” said Alexander Harrowell, Principal Analyst, Advanced Computing, Omdia. “This has knock-on effects for data centre cooling, electrical distribution, AI applications’ unit economics, and much else. One way out of the bind is to rediscover analogue computing, which offers much lower power consumption, very low latency, and permits working with mature process nodes.”
Sagence technology is the first to do deep subthreshold compute inside multi-level memory cells, a combination that opens doors to the orders of magnitude improvements necessary to deliver inference at scale.
As digital technology reaches limits in ability to scale power and cost, Sagence’s leverages the inherent benefits of analogue in energy efficiency and costs to make possible mass adoption of AI that is both economically viable and environmentally sustainable.
In-memory computing aligns closely with the essential elements of efficiency in AI inference applications. Merging storage and compute inside memory cells eliminates single-purpose memory storage and complex scheduled multiply-accumulate circuits that run the vector-matrix multiplication integral to AI computing. The resulting chips and systems are much simpler, lower cost, lower power and with vastly more compute capability.
According to Sagence, the AI inference challenge should not be seen as a general-purpose computing problem, but a mathematically intensive data processing problem. Managing the massive amount of arithmetic processing needed to “run” a neural network on CPU/GPU digital machines requires extremely complicated hardware reuse and hardware scheduling.
The natural hardware solution is not a general-purpose computing machine, rather an architecture that more closely mirrors how biological neural networks operate.
The statically scheduled deep subthreshold in-memory compute architecture employed by Sagence chips is much simpler and eliminates the variabilities and complexities of the dynamic scheduling required of CPUs and GPUs. Dynamic scheduling places extreme demands on the SDK to generate the runtime code and contributes to cost and power inefficiencies.
The Sagence AI design flow imports a trained neural network using standards-based interfaces like PyTorch, ONNX and TensorFlow, and automatically converts it into Sagence format. The Sagence system receives the neural network long after GPU software created it, negating further need of the GPU software.