A little over 20 years ago, Intel veteran Gordon Moore had to remind an audience of engineers “no exponential is forever” as he sought to give them a reality check on how far silicon scaling could go. By that point, the nature of semiconductor scaling was changing.
Limits on electron mobility and electric fields in sub-micrometre processes meant the industry was already shifting to a new way of scaling transistors and no-one could rely on reducing gate length to reduce power consumption.
A couple of years later, maximum clock speeds stalled at around 3GHz while fabs used strain engineering to improve mobility and trimming the space between transistors instead of simply relying on shorter gate lengths to drive the reductions.
At the influential NeurIPS conference held at the end of last year, Ilya Sutskever, co-founder of AI startup Safe Superintelligence, gave a similar reflection to Moore’s on the decade that passed since he and colleagues at Google Brain published the paper on “autoregressive” language models that led to the revolution in foundation models that gave rise to OpenAI’s GPT, Meta’s Llama and others.
Those huge foundation models fully embraced the Bitter Lesson, a popular principle first proposed by University of Alberta computer scientist Rich Sutton: “The biggest lesson that can be read from 70 years of Al research is that general methods that leverage computation are ultimately the most effective, and by a large margin.”
A series of influential papers showed how simply making foundation models bigger and training them for longer on larger datasets does not just yield higher accuracy. Larger models seem to gain new skills. The question is what happens when you hit a limit. One of those, Sutskever points out, is available data: “Data is the fossil fuel of AI. You could say we have achieved peak data and there will be no more.”
War of attrition
That is not the only problem with the Bitter Lesson. Training costs and energy consumption have spiralled upwards in a war of attrition among the leading foundation-model providers in the US. It has reached the point where service providers are trying to guarantee access to the energy from nuclear-power stations.
Yet China’s Deepseek showed that such intensive training programmes may be unnecessary for mainstream applications. Developed to harness the older GPUs to which Chinese users still have access, Deepseek took advantage of what researchers call “test-time scaling”. This, like GPT-o1, pushes more of the work into the inference phase rather than try to rely on long training runs. The approach uses an automated form of chain-of-through prompting, where a smaller model uses learned recipes to feed a series of prompts to the primary model to get the answer to a problem.
OpenAI argues such tradeoffs will be worth it for hard problems that need multiple steps to complete a task, with some grandiose promises.
OpenAI research scientist Noam Brown wrote in the autumn, “o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks…What cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis?”
Though we should expect hard problems to take a long time to solve, an obstacle to large-scale use of a model like o1 lies because the Transformer architecture was developed to streamline training, not inference. Asked to answer a question, a GPT-like model is limited more by memory latency than by floating-point processing speed. On top of that, the computation power it needs during inference scales quadratically with context length: the number of tokens you can feed through the layers in one pass in order to obtain a response.
At short context lengths, a model like Llama is at its fastest. Once the context window exceeds 2000 tokens in total performance can easily halve, falling to a quarter by 4000. A common context-window limit is now 2000 tokens for this reason. As a token roughly equates to a short word or a word fragment, such a limit means commercial models can cope with news articles but struggle with scientific papers or software code files.
Alternative architectures
For Sepp Hochreiter, head of the Institute of Machine Learning at the Johannes Kepler University of Linz, the growing performance problems facing Transformers provide an opportunity for other architectures to into the spotlight on the basis that scale should work for those as well, if given a chance.
Viable options include the long-short-term memory (LSTM) neural network structure he devised 30 years ago, and which underpinned some of the pre-Transformer LLMs. “The LSTM has stood the test of time and led to many successes,” he claimed during his keynote speech at NeurIPS last year.
Over the past few years, Hochreiter and colleagues have developed ways to overcome the training issues with LSTMs that saw it pushed aside in favour of Transformers. The long-term dependences captured by LSTMs, and which underpinned its ability to detect patterns, made it hard to parallelise training algorithms. An xLSTM removes a lot of the blocks to parallelism. That is not by removing dependences completely but by avoiding the need to stall for other worker threads when writing updates. These changes led to the creation of a model with 7 billion parameters, matching the size at the low end of the Llama family of Transformer-based LLMs.
“What is important for industrialisation is the speed. Because of its quadratic increase, the Transformer is soon out of the game as the context length grows,” Hochreiter says. Unlike Llama and similar models, xLSTM maintains a constant output rate. Tests pushed the context length for xLSTM out to 250,000 tokens, way beyond the practical maximum for the Transformer models.
The improved inference speed may give xLSTM an advantage in the chain-of-thought reasoning setups that GPT-o1 and Deepseek now use. Hochreiter argues the uniform resource requirements of xLSTM will also suit it to embedded applications, such as robots. These systems could profit from the ability to hold and analyse very long sequences captured from sensors.
As well as forming a startup called NXAI to commercialise the work, the Linz team has released the code and other information for the 7B model as open source on the Github and Hugging Face repositories. It still has to contend with improvements to the efficiency of the Transformer, which still has most of the market momentum.
For the problem of memory usage, various groups have tried to apply shortcuts that reduce the memory and computing overhead of Transformers so the models can run on embedded and edge-computing systems. One option lies in better caching through schemes like flash attention.
Another is to trim model size by changing the training recipe to get better accuracy out of less data by carefully refining it. This approach follows a project by Microsoft Research where they used larger foundation models to generate reams of digital textbooks to train their Phi-1 model, though the team did not publish its training recipe. Hugging Face experimented with various methods for its Cosmopedia project and its Smol series of models so it could release them in open-source form.
Conclusion
As AI butts up against the limits of the Bitter Lesson and tries to industrialise its outputs, brute-force scaling looks likely to give way to a new wave of experiments in model architectures.
Though this could lead to new headaches for designers of hardware accelerators, the increased focus on efficiency will probably push researchers to optimise for conventional GPU instruction sets.