Just a couple of years after Google published its paper on the underlying Transformer architecture, a study from the University of Massachusetts Amherst estimated the carbon-dioxide emissions reaching the lifetime output of five petrol-driven cars if you wanted to train, from start to finish, the largest models at the time.
Since then, the numbers have only grown. Open AI’s GPT-4 needed some 50GWh to train from inception to its release. But such huge new models only come along every so often. The key characteristic of such large language models (LLMs) is that they get fully trained rarely and are fine-tuned for applications afterwards, even for the open-source models like Meta’s Llama.
Now that generative AI is a consumer service, inferencing where the model responds to user prompt is now the problem. Not only are there many, many more inferencing operations than training runs, the core operations do not run well on most hardware. The chief obstacle to performance is the need to build a long chain of tokens at the output based on the content of each of their predecessors.
To get the probabilities of each predicted token, each roughly equivalent to a short word, demands accesses to different parts of memory in quick succession. That is a big change from convolutional neural networks. For those, computer architects have been able to take advantage of memory locality to improve throughput. And user expectations are not good news for operators either.
“As we get answers from things like chatbots, we want to be able sometimes to skip ahead or skim. Having a higher tokens-per-second rate is key to delivering great user experiences,” says Dave Salvator, director of accelerated computing products at Nvidia.
Though the serial nature of LLMs makes it harder to take advantage of parallelism, splitting the model across many GPUs turns out to be both a requirement and a benefit to latency because of the sheer number of parameters in the model. Operators have also found ways to pipeline operations and stagger them so that they process different layers for successive tokens on different accelerators.
“It will probably involve multiple server nodes as the models continue to grow,” Salvator adds. And those server nodes are now beginning to incorporate liquid cooling to cope with the intense heat they produce.
Machine architectures
One option proposed by researchers at Microsoft and the University of Washington is to use radically different machine architectures for two main phases. Classic GPUs can run the compute-intensive operation of parsing prompts from users while machines tuned for fast memory accesses, and which need comparatively little in the way of arithmetic ability, generate the output tokens. This still leaves a gap to be filled.
“Deep learning has turned out to be a great way of putting really large computers to work. And it is at a scale that would have been impossible with traditional high-performance computing technologies. The future on offer is foundation models for everything,” said Imagination Technologies chief architect Dan Wilkinson at the AI Hardware and Edge AI Summit in London in June. “Now we are being sold foundation models at the edge.”
Given the size of these models, edge computing is not a practical reality. But there are two reasons why smaller and embedded machines could take on more of the job. One is the energy cost. Operators and customers are noticing their energy bills. Offloading more of the work to users’ own machines will help bring those bills down. The second reason lies in network congestion.
“The edge is exposed to a lot more data than is presented to the cloud,” says Sakyasingha Dasgupta, founder and CEO of EdgeCortix. “That is one reason why we need much more efficient edge processing. Customers are moving toward hybrid edge-cloud environments, where they send just [preprocessed] metadata to the cloud.”
Another split is possible and is one that Qualcomm and Samsung, among others, have explored in recent projects.
“Our approach is to reverse the paradigm: unload from cloud to the device,” says Stylianos Venieris, senior research scientist at Samsung AI.
Only some of the work may get handed back to edge devices. A server might let a client generate tokens using a much smaller “draft” model that it then submits for approval. The server can choose to accept these tokens or annul one or more of them if it finds better replacements. Even if it rejects most tokens, the overall processing rate should not be lower than just running everything on the server.
Because the draft model presents tokens in batches to the server, which are more easily parallelised across multiple GPUs. This incurs some round-trip delay as tokens pass back and forth across the internet. But the latency and throughput, even on a constrained device, can be better than passing everything to the server. Earlier this year, Qualcomm claimed to have achieved token rates approaching 20 per second for a seven-billion parameter open-source model running standalone on its Snapdragon processor.
Samsung has published an alternative that uses the tendency of language to cluster words in common phrases. This provides opportunities to do more in parallel rather than waiting for successive tokens to appear.
The efficiency drive is going in the other direction as well. Cloud operators are taking onboard an optimisation technique that edge-AI accelerators now perform routinely: quantisation. This involves converting models to use integer calculations or low-precision floating-point arithmetic, all the way down to 4bit (FP4), rather than the far more hardware-intensive 16 or 32bit floating-point arithmetic that the training process normally generates. Those simpler operators can be easily run in parallel on comparatively low-cost accelerators.
Training issues
Though work continues on trying to perform training using integer arithmetic, this remains largely experimental. The issue with training is that it involves tweaking coefficients down a sloping curve that is best represented with the flexibility of higher-resolution floating-point arithmetic. Low-bit integer operations introduce large steps that disrupt the gradient descent. Salvator says floating point “allows the model to converge. If it doesn’t converge, whatever speedup you’ve got is academic.”
Nvidia now offers support for FP4 operations on its latest GPUs, such as the Blackwell series. But Salvator notes that using these representations for generative AI takes a lot of software tuning compared to something like the convolutional networks used in many embedded systems that perform video and audio processing.
“This [algorithm development] is the hard work you need to do to make FP4 viable. It's one thing to claim you have FP4 support in your chip. It is another thing entirely to actually make it work in real AI applications,” Salvator argues.
“We have traditional techniques like quantisation, but we have to go beyond that with LLMs,” says Venieris, in order to make them run on constrained hardware. That could come with approximate computing, in which the hardware aborts arithmetic operations once the results are close enough. That should save power. But such work is at an early stage. “It is hard to assess the impact of approximation on accuracy of LLMs: that presents a big obstacle.”
Some researchers are looking at whether the Transformer structures used today are vital. If not, more efficient replacements could find their way into the models. Another important trend is retrieval-augmented generation (RAG), which underpins image generators like Stable Diffusion. This parcels more of the data into a knowledge base from a smaller neural network pulls useful fragments when it needs them.
These choices will have ramifications for accelerators, whether they sit in servers or edge and embedded devices. Wilkinson says RAG needs subtly different support, such as the ability to compare vectors easily is important, to the matrix operations needed for LLM inferencing. Balancing those needs could be tricky.
“We have a big space where we can have either a very flexible processor, accelerator that can handle any type of tensor operations and so on, but with limited performance compared to some more custom hardware. That can go all the way to having a heterogeneous system with accelerators tailored for transformers, tailored for convolutions and other specific operations. Deciding between those is a big question,” Venieris concludes.