Addressing these conflicting goals is a challenge and one which is pushing data centre operators to explore whether custom or general purpose solutions might meet their needs.
Microsoft researchers, along with colleagues from Bing – Microsoft's search engine – have been collaborating with industry and academia to examine data centre hardware in Project Catapult. One of the project's themes is an effort to deploy FPGAs and results suggest that performance improvements of up to 95% might be achievable.
The presence of Bing in the research team highlights the challenges being faced. Intense competition means search engines rivals are looking for ways to bring results to the user even more quickly.
Microsoft researcher Doug Burger explained the motivation in a blog post. "We are addressing two problems," he said. "First, how do we keep accelerating services and reducing costs in the cloud as the performance gains from CPUs continue to flatten?
"Second, we wanted to enable Bing to run computations at a scale that was not possible in software alone, for much better results at lower cost."
Bing hardware architect Derek Chiou said, in the same post: "The factor of two throughput improvement demonstrated in the pilot means we can do the same amount of work with half the number of servers or double the amount of work with the same number of servers – or some mix of the two.
"Those kinds of numbers are especially significant at the scale of a data centre and potential benefits go beyond simple dollars. To give some examples, Bing's ranking could be further enhanced to provide an even better customer experience, power could be saved and the size of the data centres could be reduced. The strength of the pilot results have led to Bing deploying this technology in one data centre."
Mike Strickland, director of Altera's compute and storage business unit, noted three particular areas where FPGAs were being used as accelerators, rather than general purpose GPUs (GPGPU): algorithms, including search, neural and financial; networking, including virtualisation, encryption and compression; and data access, including data analytics and filtering.
"FPGAs tend to do well when it comes to power-performance," he said. "They typically consume about 20% of the power required by a GPGPU so in cloud applications, when things are packed tightly, it's an advantage. What it also shows is that FPGAs are price competitive with GPGPUs – they don't all cost $5000 each."
More recently, Microsoft has used Altera's Arria 10 FPGAs to boost performance/Watt in data centre acceleration applications based on convolutional neural network (CNN) algorithms. Such algorithms are used for image classification, image recognition and natural language processing.
"We are seeing a significant leap forward in CNN performance and power efficiency with Arria 10 engineering samples," said Burger, "and the silicon's precision hard floating point in the DSP blocks is part of the reason we are seeing compelling results in our research."
"Neural networks need lots of nodes," said Strickland. "You can put lot in FPGAs and, although the peak performance may be lower than that of a GPGPU, an FPGA has better sustained performance."
According to Strickland, one of the advantages of using FPGAs in this application is that they 'take care of multiplication'. "We've added an IEEE753 compliant block to the parts which allows single precision filtering to be accomplished in a DSP block," he said. "At a recent conference held by the Linley Group, we showed an example of a neural algorithm running on a 28nm Stratix V device. Conventionally, this would make a lot of use of DSP blocks, but there is also the need for logic in order to handle the timing and to move things around. Not everything runs on a DSP, but a lot does.
"The design we presented at the conference needed 40% less logic, which meant the timing could be closed more quickly. Clock speed was increased and less logic in the device means lower power consumption."
Strickland contended that the move from a Stratix V to an Arria 10 could bring a fourfold increase in performance for such applications. "And Altera has announced that devices made on Intel's 14nm process with feature the Hyperflex architecture. This will bring a big jump forward in performance."
Stratix 10 devices will be built on Intel's 14nm process and Strickland believes these will be 'even more interesting' in these applications. "It will be a 10TFLOPs part – Arria 10 is a 1.3TFLOP device – and generally will be better at sustained performance."
But there are other applications in the data centre to which FPGAs are suited. "There's networking," Strickland suggested, "as well as data filtering, when you are looking for particular fields. While a Xeon processor has more cores, it runs into a bottleneck when accessing external memory. Filtering this data through an FPGA on the way in is more efficient."
And there's data access. "Data needs to be brought in from storage devices," he continued. "This might be compressed using the GZIP algorithm. IBM has developed an FPGA based data analytics converter for GZIP. And there's encryption; cloud hosting companies are concerned about how to make data unavailable internally."
Microsoft's Burger noted that Catapult's work had shown 'a programmable hardware enhanced cloud, running smoothly and reliably at large scale'. "I would imagine that, a decade hence, it will be common to compile applications into a mix of programmable hardware and programmable software.
"This is a radical shift that will offer continued performance improvements past the end of Moore's Law as we move more and more of our applications and services into hardware," he concluded.