The worst bug we faced at Antithesis

intuitionist · 2024-05-21T13:59:25 1716299965

(Disclosure: I’m an Antithesis employee.)

It’s briefly mentioned in a footnote here, but we have a lot of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.

justinsaccount · 2024-05-22T01:47:47 1716342467

If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5/15/30/60 minutes everywhere.

At some point you'll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5/15/30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won't necessarily know which one causes the issue.

If you use (co)prime numbers, say, 5/7/11/13/17/19 as intervals: One, you won't have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.

cozzyd · 2024-05-22T05:58:19 1716357499

wouldn't it have made it harder to find in this case if the DCHP lease time was 34 minutes?

eszed · 2024-05-22T09:25:28 1716369928

If it's in your control use co-prime scheduling. If it's not in your control, um... Hope they didn't use co-primes? Or a multiple of your co-primes? Er, yeah. I see the justification for doing it, but it's not exactly a cure for complexity. It'll work up until everyone else catches up with your weird trick, and then there might be may be more collisions than there were before.

Edit: Yeah, I guess GP said "if you control everything". Still, how often do you actually control everything? Or how long can you control everything? Everything (heh) sufficiently complex to worry about this talks with other systems at some point, right?

justinsaccount · 2024-05-22T18:11:36 1716401496

I think you underestimate how badly people mess this up now, with piles and piles of cron jobs and timers all configured to run at 1/5/15/60 minute intervals.

If other things did this there would not be "more collisions", since the status quo is multiples of 5 minute intervals everywhere.

eszed · 2024-05-22T20:48:05 1716410885

Fair point. I'll definitely consider co-primes the next time I deploy a system with complicated scheduling.

justinsaccount · 2024-05-22T21:28:42 1716413322

for what it's worth, it also comes in super handy when trying to troubleshoot existing systems. Running a complicated system that has weird latency spikes sometimes at a 60s interval and the config file has like:

  - timeout: 1m
  - expireInterval: 1m
  - updateInterval: 5m
  - etc

The first thing I do is go in and change them all to

  - timeout: 59s
  - expireInterval: 61s
  - updateInterval: 307s
  - etc

and then see what the new "problem interval" is. It's either gone, because the spike was from contention from all 3 jobs running every 5 minutes, or the new problem occurs now at 61s intervals.

justinsaccount · 2024-05-22T18:31:38 1716402698

rdg42 · 2024-05-21T19:14:48 1716318888

Great read!

But...

“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”

Isn't this "sysadmin 101" ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up & down every 30min would have triggered many here instantly.

Interesting journey nevertheless!

nusl · 2024-05-21T21:43:53 1716327833

It’s probable that they did do that, but also that the network issue didn’t appear related, even though it’s suspect on its own.

cbanek · 2024-05-22T07:15:55 1716362155

Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.

fuzzfactor · 2024-05-22T16:32:08 1716395528

This stands out like some of the same things faced in natural science where you don't have to be an entomologist for the primary goal to be to elucidate the complex variety of creepy irregularities thrown at you by nature, whether unexpected or not. Or whether there will ever be a solution/progress or not.

It's a good writeup overall, but it's amazing how this bit applies to challenging scientific problems that have nothing to do with code, so try to read it from that point of view:

>One of the highest-productivity things your team can do is not have any “mysterious” bugs, so any new symptom that appears will instantly stand out. That way, it can be investigated while the code changes that produced it are still fresh in your mind.

>A rare, stochastic, poorly understood machine crash would completely poison our efforts to eradicate mysterious bugs. Any time a machine crashed, we would be tempted to dismiss it with, “Oh, it’s probably that weird rare issue again.” We decided that with this bug in the background, it would be impossible to maintain the discipline of digging into every machine failure and thoroughly characterizing it. Then more and more such issues would creep into our work, slowly degrading the quality of our systems.

>There are many people who say that a “zero bugs” mindset is excessive because, for rare bugs, the cost of fixing them exceeds the cost of living with them. But I find these people are rarely considering the indirect costs of rare bugs – on team velocity, discipline, and culture.

Some industries are so risky that the "zero defects" approach goes back way before there was software involved, that attitude can be practiced on other things besides code, and can definitely be applied to an advantage when coding.

In things like experimental chemistry with a growing layer of electronics, computers, and software on top of it, and where one of the main ideas can be to strive for more "9's", this is another wide opportunity for discrepancy.

Bugs propagate even worse in nature.

ajkjk · 2024-05-22T00:52:14 1716339134

So why the 8 minute offset? I think they never said?

cperciva · 2024-05-22T04:13:26 1716351206

EC2 bare metal instances take a long time to boot. The machine was probably running for 8 minutes before DHCP started up (and then it got a new response every 30 minutes after that).

nusl · 2024-05-21T21:44:36 1716327876

Kudos. We have a similar unknown bug at work so we’ll see how it goes as we scale. Folks aren’t currently giving the fix too high of a priority but I suspect it will become a real problem soon enough.

cperciva · 2024-05-22T04:15:30 1716351330

If your similar unknown bug is on FreeBSD/EC2, I want to hear about it!

maherbeg · 2024-05-21T23:09:20 1716332960

I'm curious what the fix was, presumably just retry?

cperciva · 2024-05-22T04:14:39 1716351279

The fix was to teach the ENA driver that "set the MTU to the value it already has" should be a no-op. With that change, the interface didn't bounce.