> Let’s get more concrete. Let’s use this to solve a real problem. My server has...

wwilson · 2024-09-10T14:24:21 1725978261

Yes, unfortunately we have not figured out how to rewind time in the real world yet. When we do, there are a lot of choices I'm going to revisit...

abeppu · 2024-09-10T14:36:10 1725978970

... but the intro makes it sound like this system is valuable in investigating bugs that occurred in prod systems:

> I’ve been involved in too many production outages and emergencies whose aftermath felt just like that. Eventually all the alerts and alarms get resolved and the error rates creep back down. And then what? Cordon the servers off with yellow police tape? The bug that caused the outage is there in your code somewhere, but it may have taken some outrageously specific circumstances to trigger it.

So practically, if a production outage (where I think "production" means it cannot be in a simulated environment, since the customers you're serving are real) is caused by very specific circumstances, and your production system records some, but not every attribute of its inputs and state ... how does one make use of antithesis? Concretely, when you have a fully-deterministic system that can help your investigation, but you have only a partial view of the conditions that caused the bug ... how do you proceed?

I feel like this post is over-promising but perhaps there's something I just don't understand since I've never worked with a tool set like this.

jackschu · 2024-09-10T18:09:30 1725991770

(I work at Antithesis)

I think you're right that the framing leans towards providing value in prod issues, but we left out how we provide value there. I think you're also right that we're just used to experiencing the value here, but it needs some explanation.

Basically this is where guided, tree-based fuzzing comes in. If something in the real world is caused by very specific circumstances, we're well positions to have also generated those specific circumstances. This is thanks to parallelism, intelligent exploration, fault injection, our ability to revisit interesting states in the past with fast snapshots, etc.

We've had some super notable instances of a customer finds a bug in prod, recalls its that weird bug they've been ignoring that we surfaced a month ago, and then uses this approach to debug.

The best docs on this are probably here: https://antithesis.com/docs/introduction/how_antithesis_work...

yellow_lead · 2024-09-10T16:04:00 1725984240

This was my thinking as well. Prod environments can be extremely complicated and issues often come down to specific configuration or data issues in production. So I had a lot of trouble understanding how the premise is connected to the product here.

qarl · 2024-09-10T14:29:02 1725978542

> Yes, unfortunately we have not figured out how to rewind time in the real world yet.

10 bucks says you get complaints for not implementing the "real world" feature.