Cisco pushes the boundaries with a 4billion transistor single chip network processor
4 mins read
Cisco Systems has announced the nPower X1, the industry's first single chip 400Gbit network processor. The chip is designed for the next stage in the evolution of the network dubbed the Internet of Everything, or IoT.
Some 10billion high end devices are estimated to be connected to the internet and this is forecast to grow to 50bn by 2020. Such a rapid increase in connectivity creates requirements beyond simply scaling the network's capacity.
The IoT not only connects people and machines, but also includes the attributes of data and process. Data is used for decision making, while process refers to the ability to deliver information to a machine or a person as required. A network processor for the IoT not only needs to support greater traffic throughput, but also to have ample computation and control abilities.
The events following a motorway accident are one example cited by Cisco to highlight the role of data and process. An accident would trigger alerts such as a GPS system updates to nearby vehicles to avoid the resulting hold up. In turn, additional transactions could be triggered, such as accessing a driver's online calendar to alert those impacted by the driver's delay. "It is no longer just about bandwidth, it is also about processing information and taking action," said Sanjeev Mervana, a director of marketing at Cisco Systems.
Cisco's nPower X1 chip is the company's first network processor for the IoT and the first device in the nPower architecture. The X1 is already being shipped in Cisco's latest network convergence system, the NCS 6000 platform. The NCS is a family of routers that acts as a fabric to link and scale Cisco's existing IP core and edge routers deployed in service providers' networks. The X1 will also be adopted in Cisco's CRS-X 400Gbit per slot IP core router, announced in June.
Nikhil Jayaram, vice president of silicon engineering at Cisco and head of the X1's design team, highlighted the device's throughput and I/O, the packet processing and traffic management attributes and the chip's massively parallel computation capacity.
The asic can process 400Git/s of packet traffic, equivalent to 300million packet/s. "How that 400G is used and split is application dependent," he said.
Another 400Gbit/s network processor which has been shipping for a while is Alcatel-Lucent's FP3. However, that is implemented as a chipset, comprising a packet processor, traffic manager and fabric interface chip (see NE, 10 July 2012).
Merchant chip company EZchip has also announced a network processor for smart networks (NPS) family that, like the X1 and the FP3, spans the processing requirements for the higher networking layers of the Open Systems Interconnection model, from layer 2 to layer 7. The first device, the NPS-400, will have a 400Gbit/s throughput and first samples of the part are due by the end of 2013 (see NE, 26 Feb 2013).
In a network processor, a packet is inspected and a look up performed to determine the packet's class, the quality of service it requires and where it should be forwarded. The traffic manager is also informed about which queue the packet should be placed into. The traffic manager's role includes packet buffering, queueing and scheduling; making decisions as to how packets should be dealt with, especially when traffic congestion occurs.
Cisco stresses that having the network processing functionality in one chip benefits system design. "You have to have a level of integration that allows industry leading capacity in systems," said Jayaram. "The fewer discrete components on that board and the lower the power consumption, the more you can put on that system."
The NCS 6000, for example, is being shipped with 1Tbit line cards. Cisco has slots for 10 CPAKs on each card's face plate. The CPAK is Cisco's own 100Gbit optical module design implemented using silicon photonics. Cisco has crammed five NPUs on the card to accommodate the resulting 2Tbit of traffic flowing through the line card (via the front plate and the fabric).
Having key technologies is seen as critical for Cisco if it is to differentiate its platforms. In addition to designing custom asics and making the CPAK pluggable modules based on silicon photonics, Cisco also has coherent technology that enable its routers to support long distance optical transmission.
Implementing the 4bn transistor X1 – which has a die area of 598mm2 –for manufacture on a 40nm cmos process is clearly a significant undertaking and has likely cost the company $100m to develop.
The nPower X1's I/O includes 10, 40 and 100Gbit Ethernet media access controllers (MACs). Cisco is not releasing further details about the I/O, beyond saying that high speed serdes are used for the chip's line side communications and to interface to the router's fabric.
The X1 has 336 dual threaded, general purpose cores and these are used to perform layer 2 to layer 7 packet processing. Cisco designed the cores using Tensilica's instruction set architecture. "These cores are fully programmable in high level languages such as C or Python," said Jayaram. Certain tasks – such as packet look ups, hash functions and IP security – are off loaded from the cores and performed instead on the X1's hardware accelerator engines.
Cisco says it has worked with more than one memory company to develop custom external dram. "This is a custom memory chip with a custom memory die [made] for Cisco," said Jayaram. "The high capacity memory has extremely high access rates and low latency, a key benefit in IP packet processing," said Jayaram. There are 20 external channels of memory connected to the X1.
The third aspect of the X1 which Jayaram highlighted is its ability to handle multiple transactions associated with the IoT.
"You have to do sophisticated layer 2 through layer 7 stateful processing on all this traffic," said Jayaram. "The traffic flowing through the network processor; every one of those events requires a different type of processing."
This equates to a different set of instructions being run on each networking event. This degree of computation is possible due to the 336 programmable cores and the X1's three levels of on chip cache.
Jayaram gave the example of a single packet entering the NPU carrying a specific set of information and requiring specific treatment. "It is not a flow, it just needs one specific set of actions performed on it," he said. Up to 300m packets could be flowing through the network processor simultaneously per second and it is possible that each might require a different action performed on it.
"What this [architecture] allows is that every single thread on every single processor can run a completely different feature set on different packets; different events concurrently with no loss in performance," Jayaram claimed. "Every single one of these events gets precisely the service it asks for; all at 400Gbit of throughput."
How the cores are managed is less clear, but Jayaram said on chip hardware ensures that the processing tasks are allocated fairly to the cores and that they are not overburdened.
Cisco has not revealed the X1's power consumption, but says the chip consumes a quarter of the power per bit compared to its previous generation core router silicon.