I work on both. Intel is closing the gap with their DPDK, but Cavium creams them...

I work on both.

Intel is closing the gap with their DPDK, but Cavium creams them on clock-by-clock and on cost of goods.

Cavium Octeon chips currently scale to 32 cores at 1.4Ghz, but with ZIP, GZIP, AES, SHA1, etc coprocessors running at 800Mhz. All cores share a fast, coherent unified L2.

One of the key advantages of the Octeon architecture is their hardware work scheduling unit. This is essentially a highly programmable hash engine on packet fields (with software-only bits for software classify-then-reschedule). The idea is to ensure that no packets with identical hashes are in flight on any core at the same time.

If programmed correctly, this work scheduling prevents data structure contention, which is particularly problematic when you scale to 32 (and next-gen up to 48 [then I believe to 64] cores).

The chips also support direct packet transport (XAUI, SGMII, etc), rather than requiring transport across PCI-e. Each of these ports can be programmed separately, so you can use switch-specific goofy encapsulation modes (Broadcom HiGig2, Marvell DSA, etc) to support very quick traffic <-> physical port mappings.

I should also mention that Cavium scales down very well, all of the way to configurations like 2 cores at 400Mhz for PoS, SOHO usage, and such. So it can be an attractive architecture to target.

Finally, Octeon family MIPS64 has a lot of MIPS64 extensions, like branch on bit, posted atomic operations (e.g. statistics, where you don't care about the value, you just want to += 42 it), pop count, fast bitfield subfield extract, etc.