Hunting a bug in the i40e Intel driver

rzezeski · on July 29, 2021

> During those tests, we noticed the machines were randomly freezing after some time, so we decided to upgrade the firmware of the network cards,

Reminds me of the various i40e Tx freezes I debugged while at Joyent. Granted, this is the illumos driver, not Intel's, but basically there were issues with the programming guide that I had to figure out the hard way. The 700-series controllers have not been the easiest to work with.

https://smartos.org/bugview/OS-7492 [Tx freeze when b_cont chain exceeds 8 descriptors]

https://smartos.org/bugview/OS-7457 [i40e Tx freezes on zero descriptors]

drewg123 · on July 29, 2021

This 8 descriptor per packet limit is HORRIFIC. I debugged and fixed this issue on FreeBSD when we first moved to the new iflib based ixl (i40e's name on FreeBSD).

They had a routine (ixl_tso_detect_sparse()) which was AFU. I wrote a userspace unit test that proved it was AFU, and then fixed it. I fed them back the routine & the unit test, and they hilariously left my commented debug prints in the routine.

https://github.com/freebsd/freebsd-src/blob/412b5e40a721430a...

And their 100GbE NIC has the same limit, which is just so sad. All these fancy features, and they cannot handle 8 segments per emitted packet on the wire.

comex · on July 30, 2021

phab · on July 30, 2021

all fucked up

sofixa · on July 30, 2021

It was the same with Intel's drivers. We had the same issues happen on ESXi servers with X710 NICs, with VMware's repackaged and then Intel's original i40e driver, and it worked terribly, either kernel panicking or just freezing the NICs. It was a fun one to debug, but thankfully we only had to wait a few months ( the issue was known for a year) for Intel to come up with the fixed driver.

The bastards at VMware kept the buggy driver on their hardware compatibility list and kept shipping it for multiple versions probably a year later.

SteveNuts · on July 30, 2021

Was the ixgbe driver not available?

sofixa · on July 30, 2021

IIRC the options were i40e and i40en, the latter resulting in daily crashes, so i40e it was :)

userbinator · on July 30, 2021

From your first one:

Malicious Driver Detection

My reaction upon reading that line was "WTF." I haven't touched NIC drivers beyond the classic NE2000s, common Realteks, and the Intel 8254x, but it seems strange to have some sort of... antimalware feature in a NIC? Reminds me of the old BIOSes with "boot sector antivirus".

rzezeski · on July 30, 2021

Probably more to do with the fact that everything is moving towards virtualization. Oftentimes these NICs dole out VFs directly to VMs via SR-IOV, in which case I imagine the NIC controller has some safeguards to keep the host and the rest of the guest's safe from denial-of-service and other attacks from a malicious guest driver.

nn3 · on July 29, 2021

Just to save you a somewhat pointless read, they didn't really debug anything but just found the right forum to ask.

kbenson · on July 29, 2021

They debugged the system, not the driver. The way they did that was to identify and confirm it was the driver that caused the problem and in what circumstances, so they could report it to the people responsible for actually dealing with that.

That's still a form of debugging. It's all a matter of perspective. If you had a hardware device that you were interacting directly with in an applicaiton, and you found that if you utilized in in a specific way it crashed, so you changed how the application used it so it wouldn't crash, that would be debugging the application, even if not really debugging the hardware.

MauranKilom · on July 29, 2021

As a counterpoint, I found the journey interesting and learned a lot about various tools on the way. Only caveat is that they didn't end up pinpointing the error - understandable, given that they are not paid to fix bugs in Intel code, and Intel having fixed the bug already in a newer version anyway.

cesarb · on July 30, 2021

But it seems Intel did end up pinpointing the error. The last link in the article ("Since then, Intel has removed the faulty driver from their website.") points to https://downloadmirror.intel.com/30190/eng/635390-TA-256.pdf which says "The driver instability was caused by an incomplete backport to i40e from the upstream kernel." Frustratingly, it doesn't give any more detail than that.

nik_0_0 · on July 30, 2021

It looks like the Intel out-of-tree driver is carrying around some legacy HAVE_PAGE_COUNT_BULK_UPDATE option that is making their porting efforts difficult.

This commit in upstream ends up getting split in half:

https://github.com/torvalds/linux/commit/8ce29c679a6ecefb88d...

With only 3 lines of it getting pulled into i40e-2.13.10:

https://github.com/dmarion/i40e/blob/master/src/i40e_txrx.c#...

(Can't link git diff line for 2.13.10->2.14.13 because diff is too big, annoying!)

And the final line getting pulled into i40e-2.14.13:

https://github.com/dmarion/i40e/commit/135d6d885aa4704180e10...

  --- if (unlikely(!pagecnt_bias)) {
  +++ if (unlikely(pagecnt_bias == 1)) {

Best thing I can find in i40e_txrx.c where a single patch in Linux upstream got split across 2.13.10 and 2.14.13. Not a smoking gun exactly, still some exercise left for the reader.

AceJohnny2 · on July 29, 2021

Not entirely pointless, they did provide some useful tips (I wasn't aware of Bcc), but yeah the story ends with them not resolving the issue and just using a different version of the driver that doesn't have the bug.