A Peek Inside a 400G Cisco Network Chip

rnxrx · on Sept 14, 2017

As impressive as this is, it's likely 3-4 generations back from what's currently shipping. It's a switch-on-chip (SoC) set up for 12.5 Gbit SerDes. As pointed out elsewhere, it might have been deployed for 16 10G front panel ports, another 160G to a backplane ASIC and some of the remaining ports for control plane use.

Reading between the lines a bit it might have been used in one of the Nexus 5K's - which would put it at around 8 years old, depending on how we're counting.

MattSteelblade · on Sept 14, 2017

Wouldn't the fact that it's using a 22nm processor mean that 8 years is at least a couple years too far?

rnxrx · on Sept 15, 2017

Possibly, but it's in the right neighborhood. It depends on whether we're counting from the time of design/initial fabrication or actual mass production and general sale.

The point is that 12.5Gb/s SerDes pretty much means they'd be practically to 10/40G. At least in the DC networking world this puts it several generations back.

petermonsson · on Sept 15, 2017

NPUs are not used in data centers. They don't need the expensive but totally flexible network processing capabilities. Think carrier and enterprise access networks that have lower Ethernet bandwidth requirements, but higher processing requirements per packet.

kijiki · on Sept 15, 2017

This is an NPU, so for sure not in the 5k, or any other Nexus. Those are all fixed-function ASICs, in some cases (like the 3k, or the 5k's L3 module) actually Broadcom ASICs.

The original 5k (which was a Nuova Systems product before cisco spun them in), had a 1040G crossbar (nicknamed Altos) and a bunch of 80G line chips (nicknamed Gatos).

userbinator · on Sept 15, 2017

I'm surprised that there's even a processor (as in a Turing-complete machine) in those ASICs at all --- at those speeds, I would've expected more along the lines of lots of hardcoded switching circuitry with any programmability restricted to lookup tables and configuration registers, closer to an FPGA than CPU in concept. Certainly, a lot of slower switch ASICs are designed as hardcoded switches; here's one example:

https://www.intel.com/content/dam/www/public/us/en/documents...

kelleyk · on Sept 15, 2017

That's not because it's slower; that's because it's the low-end (non-programmable) member of its family. It was designed by a company called Fulcrum, which produced the FocalPoint FM2000/FM3000/FM4000 (1/10G) and FM5000/FM6000 (10G/40G) switching ASICs before being acquired by Intel.

I think that a comparison to the FM4000, which is the programmable series of parts in the "Monaco" family, would be more fair. Here's their datasheet: https://www.intel.com/content/dam/www/public/us/en/documents...

The FocalPoint ASICs were, as far as I know, some of the first to support (a demo-quality implementation of) OpenFlow in hardware. When Intel bought them, they released the datasheets, which is neat.

As a real-world example, these ASICs were used in Arista's 7100 series (c. 2008) switches. They published a two-part "Technical Evaluation Guide" for those switches which are, among other things, an interesting glance at how switches are constructed out of ASICs. Part 1 (https://local.com.ua/forum/index.php?app=core&module=attach&...) shows the topology of each switch (starting on page 13).

The 7124 is a single 24-port FM4224 with all 24 ports connected to front-panel ports. The 7148S has three FM4224 ASICs; each is connected to 16 front-panel ports and uses 4 ports (40 Gb/s) to connect to each of the other two ASICs in a ring. Intuitively, this means that it's possible that those inter-ASIC connections could cause bottlenecks (if e.g. all 16 ports connected to the first ASIC try to send 160 Gbit/s of traffic to the 16 connected to the second ASIC, they'll saturate the 40 Gbit/s of connectivity between the ASICs). Therefore, Arista also offered the 7148SX, which is non-blocking but needs six (!) FM4224s to make it happen!

myrandomcomment · on Sept 15, 2017

7148SX was not truely non-blocking. It was a Clos in a box. 4 chips in front (12 ports each, and 2 in the back, each front panel chip wired 6 ports to each of the rear chips as a LAG). If the hashes worked right it was non-blocking, but taking an Ixia out of the box and clicking go on a RFC ful mesh test would give you drops. The 7148S was created after the 7148SX shipped as a cost reduced version for those that had a workload that fit the tput characteristic. The 10GBaseT version was the same 3 chip switch. The power draw on the 10G-T phy was quite high at the time.

userbinator · on Sept 15, 2017

The FM4000 doesn't appear to have an actual programmable CPU either; just the same "lots of lookup tables and configuration registers" (but more of them.)

myrandomcomment · on Sept 15, 2017

The most interesting thing about the FMxxxx chips was the pipeline was async.

wmf · on Sept 15, 2017

Many routers use programmable NPUs because it gives them a 2-year advantage in time to market for new features. Also, the feature set of high-end routers is so rich that programmable cores may be less area than hardcoded logic.

nsteel · on Sept 15, 2017

Exactly this. Plus testing/verification/bug-workarounds all become much easier/possible.

candiodari · on Sept 15, 2017

Except for the TCAM, this chip looks a lot like a GPU.

aomix · on Sept 14, 2017

That number seems stupendous but I don't know enough to know that. My very casual understanding was that 100G was state of the art.

virtuallynathan · on Sept 14, 2017

100G is not quite state-of-the-art anymore, but it is what is being deployed. This chip supports 10G ports, and can do 400Gbps through it, but does not support 100Gbps front-panel ports from what I can see.

There are field trials and pre-prod switch/route chips and optical/DWDM gear that support 400G Ethernet using 56Gbps SerDes. This chip uses older 12.5Gbps SerDes, current-gen 100Gbps stuff uses 28Gbps SerDes.

pavs · on Sept 14, 2017

100G has been around for a while and not that common in usage, other than extreme cases. 10g and 40G are still the most widely used ports. Considering that most switches and routers have a minimum 10-20 ports (10g-40g), that's already a stupendous amount of BW - and on routers, you can always add more cards. 100g routers/switches are very expensive.

Unless you are pushing google/facebook/Comcast level traffic - there are very few use cases. Apparently, Google/Facebook uses their own network hardware.

rnxrx · on Sept 15, 2017

This isn't really true in the market today. 100G switches have emerged at roughly the same cost per port as 40G from a couple of years ago and, indeed, can often accept both 40 and 100 gig optics. Even at list price the cost per port for 100G switching has been under $1K for more than a year.

As a result it's actually fairly common to find new DC fabrics (read: inter-switch connections, not end hosts) being built with 100G because there's no significant economic disadvantage to doing so. That said, the pricing for inter-site 100G is still high enough that it hasn't commonly made its way to smaller organizations.

nsteel · on Sept 15, 2017

Google, Apple and Facebook also use commercial network products for the points in there networks where they need to go fast, just like everyone else.

will_hughes · on Sept 15, 2017

> My very casual understanding was that 100G was state of the art.

I think you're having the same conceptual issue that I had when first reading it.

The 400G in the article title is referring to the bandwidth of the chip, not the bandwidth of any particular port.

The ports, as mentioned elsewhere here, were probably either 10G or perhaps 40G.

jlgaddis · on Sept 14, 2017

It is. Any links higher than that are pretty much all Nx100G, though there are a few carriers with (limited) 200G waves in production.

For anything over 100G (currently) you're making some sacrifices on distance and may even need some guard bands.

(Short-haul/intra-DC may well be a different story but I doubt that it's very different.)

myrandomcomment · on Sept 15, 2017

In the DC 100g is pretty common now.

lsjdfkljdfwkwdf · on Sept 15, 2017

True for routing. For optical 200Gbps DP-16QAM is commercially available.

VectorLock · on Sept 14, 2017

> The processor complex on this unnamed NPU has 672 processors, each with four threads per core.

Very cool.

QAPereo · on Sept 14, 2017

My favorite bit:

L2 instruction cache that also has an on-chip interconnect that links the clusters and caches to each other as well as packet storage, accelerators, on-chip memories, and DRAM controllers together. This interconnect runs at 1 GHz and has more than 9 Tb/sec of aggregate bandwidth

I keep seeing elephant guns and experimental German artillery when I read that!

lsjdfkljdfwkwdf · on Sept 14, 2017

Why is Cisco still using TCAM instead of HBM/HMC like Juniper?

signa11 · on Sept 15, 2017

(t)cam is specialized hardware for doing content based lookups. specifically, they are used for parallel table searching based on comparator values...

zackmorris · on Sept 14, 2017

Ok this is uncanny, I just commented on the need for something similar yesterday:

https://news.ycombinator.com/item?id=15244655

I wish I could set up a notification of some kind to know when someone gets Erlang/Elixir running on this chip. It would be a great platform for stress-testing Go concurrency as well.

Beyond that, I would really like to see Octave running on it because it's the only approachable vector programming language that I know of. The holy grail for me is to be able to use something like the MATLAB libraries at 1000 times their current speed to simulate the interesting stuff.

jlgaddis · on Sept 14, 2017

These aren't chips that you'd use to run your own application on (they aren't x86 or ARM or anything like that). They're custom ASICs engineered to handle extremely high amounts of network traffic.

imtringued · on Sept 15, 2017

You should instead look out for the 1024 core epiphany. It's a streaming processor so it may not perform as well on random memory access and there is no cache hierarchy but it's very close to your previous comment about "Something on the order of 256 or more 1 GHz MIPS, ARM, or even early PowerPC processors."

zackmorris · on Sept 15, 2017

That's really cool, thank you! I would actually prefer a cacheless architecture like that because I don't think it really has a place in streaming or message-passing paradigms like Erlang or Go (it can still be relevant within the local address space of each process though, but I don't feel that the gain is worth it in most cases). Plus the problem space is still large so it might be better to let people discover alternative approaches to data locality like map reduce/sharding, copy on write and content-addressable memory.

I spent my teens writing blitters for shareware games and found that even then, the cache mostly got in the way. Processors like the PowerPC 603e had a pretty substantial cache miss penalty that was on the order of 5-20% for me depending on the situation. It was difficult to come up with appropriate cache hints for even relatively minor random access. I tried disabling the cache, but that made it even slower than a 601. So that's where my head is at, and the Epiphany sounds perfect. Here's a quick link for anyone curious:

https://www.parallella.org/2016/10/05/epiphany-v-a-1024-core...

dsr_ · on Sept 14, 2017

Cisco is unlikely to ever make chips like this available other than tightly bundled inside their products.

signa11 · on Sept 15, 2017

well, even if they would, where are you going to get the associated 'stuff' (the board, the drivers, the operating system etc. etc.) around it to make it work ?