FuriosaAI unveils new AI inference chip

1 min read

FuriosaAI, an emerging AI semiconductor company, has unveiled the RNGD AI accelerator, at Hot Chips 2024.

FuriosaAI unveils new AI inference chip Credit: FuriosaAI

The RNGD chip is being positioned as an efficient data centre accelerator for high-performance large language model (LLM) and multimodal model inference.

Founded in 2017 by three engineers with backgrounds at AMD, Qualcomm, and Samsung, the company has pursued a strategy focused on rapid innovation and product delivery which has resulted in the unveiling and fast development of RNGD.

Furiosa successfully completed the full bring-up of RNGD after receiving the first silicon samples from their partner, TSMC. With its first-generation chip, introduced in 2021, Furiosa was able to submit their first MLPerf benchmark results within 3 weeks of receiving silicon and achieved a 113% performance increase in the next submission through compiler enhancements.

Early testing of RNGD has revealed promising results with large language models such as GPT-J and Llama 3.1. A single RNGD PCIe card delivers 2,000 to 3,000 tokens per second throughput performance (depending on context length) for models with around 10 billion parameters.

"The launch of RNGD is the result of years of innovation, leading to a one-shot silicon success and exceptionally rapid bring-up process. RNGD is a sustainable and accessible AI computing solution that meets the industry's real-world needs for inference," said June Paik, Co-Founder and CEO of FuriosaAI. "With our hardware now starting to run LLMs at high performance, we're entering an exciting phase of continuous advancement. I am incredibly proud and grateful to the team for their hard work and continuous dedication."

RNGD's key innovations include:

  • A non-matmul, Tensor Contraction Processor (TCP) based architecture that enables a perfect balance of efficiency, programmability and performance.
  • Programmability through a robust compiler co-designed to be optimized for TCP that treats entire models as single-fused operations.
  • Efficiency, with a TDP of 150W compared to 1000W+ for leading GPUs
  • High-performance, with 48GB of HBM3 memory delivering the ability to run models like Llama 3.1 8B efficiently on a single card.

The chip is currently sampling to early access customers, with broader availability expected in early 2025.