Their plight resembles in many ways the problems hardware and software developers face as they struggle to keep up with advances in the artificial intelligence (AI) arena. New applications and algorithms continue to come at them at an ever-accelerating rate, but the traditional tools they’ve been using lack the power and flexibility to allow them to keep up. What’s a developer to do?
It takes almost two years to design a typical ASIC, but by the time that ASIC is ready, the market will have moved on. Who could have foreseen two years ago when many inference engines still used 32-bit floating point data, that eight-bit integers could be just as effective, but use 30 times less power to execute? Or that AI algorithms for CNN would go from the straightforward AlexNet, to the more complex GoogLeNet to the even more complex DenseNet? As a result, an ASIC that accelerated AlexNet with 32-bit floating point math was obsolete before it even hit the market.
At the Xilinx Developer Forum in the US, the company rolled out several new FPGA-based products that will enable our customers to bring high-performance AI products to market in far less time than their old-fashioned tool chains permitted.
The Alveo accelerator cards are Xilinx’s first turnkey products. There’s no need to design a custom PCB; the user just plugs the card into a PCIe backplane, adds software, and it’s ready to go. In key workloads like inferencing and video processing, Alveo can provide 90X higher performance than traditional CPUs. It’s even 4X faster than GPU-based inferencing solutions and can reduce latency by a factor of three.
A second new product, Versal, is the first instance in the category of our new Adaptive Compute Acceleration Platforms (ACAPs). Versal provides the computational horsepower needed to outperform hardwired devices and the flexibility to adapt to new software algorithms, even after a product has been deployed in the field. Xilinx has been working on this product family for over four years under the code-name "Everest." Built on TSMC’s 7-nanometer FinFET process technology, the Versal portfolio includes a series of devices uniquely architected to deliver scalability and AI inference capabilities for markets from cloud to networking to wireless communications to edge computing and endpoints.
Versions of Versal include a smorgasbord of elements including ARM Cortex A72s, on-chip DRAM with ECC, DSP engines, PCIe Gen 4, DDR4 memory controllers and Ethernet MACs, all connected by a state-of-the-art network-on-chip (NoC) that delivers low latency, multi-terabits per-second bandwidth.
Training and inference
Most AI applications have two distinct phases – training and inferencing. The training task entails building a model and fine-tuning its parameters by exposing it to thousands (or millions) of samples of data. Training involves many iterations with large data set and can take days, weeks and sometimes even months therefore performance is the key metric. This phase is typically performed offline and hence power and latency is not that critical. GPUs from Nvidia and AMD have been employed for this task to get faster throughput involving large sets of data.
Once the training is done, the model can be deployed to the application where the inferencing will happen - in the cloud or at the edge. Inference is where adaptable devices such as Xilinx FPGAs and the new ACAP products excel. Inferencing uses the models built during the training phase to make decisions based on those models. If the model has been built to recognize the faces of thousands of terrorists, the inferencing task may be used to recognise those faces in a crowd, or as people walk by security cameras. This places a different set of technical requirements on an effective solution.
In inferencing, there are three critical factors. The first is low latency. Imagine you are driving a car. Or more accurately, your car is driving you. It is constantly scanning the road for hazards at a very high frame rate. The latency of the AI inferencing translates directly into stopping distance. The higher the latency, the more likely you’ll hit the hazard. Second, high throughput matters. A smart camera deployed in a "smart city" may have to scan thousands of faces a minute, and each video stream generates megabytes of data every second. That must be decompressed, analysed and recompressed. Third, performance per watt matters, especially for so-called "edge-of-network" devices that may be battery powered and communicate over WiFi or 4G/5G networks.
Let’s look at how these factors play out in an actual smart city security application where a single node might contain four cameras covering different visual areas. If these nodes use a CPU and a GPU, the CPU might handle H.264 video decoding and motion detection. The CPU needs 16ms to decode each frame, and another 16ms to senses motion in that frame using OpenCV. It then hands the image off to the GPU which uses a CNN algorithm to recognize objects in the frame. The decoding and motion analysis portions of this task take 32ms, and the CNN computations take roughly 50ms, so the entire process takes 82ms. This means the node can process 12 frames per second per camera, or 48 frames per second in all, while consuming about 75W to run both the CPU and GPU.
Had that node been configured with a CPU and a Xilinx FPGA, the performance increases and the power decreases. The CPU still takes 16ms to decode each frame, but then the FPGA needs less than 1ms to analyze the motion and slightly more than 9ms to run the CNN algorithm, for a total latency of 26.1ms, so the node can process four streams at 38 frames per second, more than three times the throughput of the GPU-based solution. Not to mention that the Xilinx solution only uses 50W to run, one-third less than the GPU approach. Multiply the savings for this node by the thousands of nodes deployed in the city, and you’re talking about real power savings.
These same savings apply in other inference applications as well. In 2016, Korea’s SK Telecom introduced its cloud-based NUGU Personal Assistant, which handles tasks similar to Siri or Alexa. Its initial implementation relied on GPU technology to decode, comprehend and respond to voice commands. This approach worked well when the service had tens of thousands of users, but as it gained popularity, SK Telecom saw that its data centers lacked the electrical capacity to run the number of GPUs needed to service several million users generating tens of millions of conversations. It quickly retooled its application to use FPGAs instead of GPUs in its data centre, and achieved a 16X improvement in performance per watt. Today, the system can accommodate over three million users that generate more than 1.1 billion conversations.
The use of FPGAs to accelerate AI applications is still in its infancy. Xilinx technology has enabled many pioneering applications to deliver useful results with its existing UltraScale and UltraScale+ technology, but we’ve barely scratched the surface. Alveo just hit the shelves, and Versal is on its way. As Robert Browning noted in Rabbi Ben Ezra, "The best is yet to be."
Author details: Salil Raje is Executive Vice President for Software and IP Products, Xilinx