At timestamp 23:40 in the video by Alex Pshenichkin from 2024-06-10 it says data ingestion comes via VMCALL interactions. As such a call is literal nonsense if you are not virtualized, any such call inherently means you are using a paravirtualized interface. Now maybe FreeBSD has enough standardized paravirtualized drivers similar to virtio that you can just link it up, but that would still be paravirtualization solution with manual rewrites, just somebody else already did the manual rewrites. Has the fundamental design changed in the last 3 months?
This is exactly a replay engine (or I guess you could say replay engines are deterministic simulators). How do you think you replay a recording except with a deterministic execution system that injects the non-deterministic inputs at precise execution points? This is literally how all replay engines work. Furthermore, how do you think recordings work except by recording the inputs? That is literally how all recording systems designed to feed replay engines work. The only distinction is what constitutes non-determinism in a given context. At the whole hypervisor level, it is just I/O into the guest; at the process level, it is just system calls that write into the process; at the threading level, it is all writes into the process. These distinctions are somewhat interesting at a implementation level, but do not change the fundamental character of the solution which is that they are all a replay engine or deterministic simulator, whatever you want to call it.
You don’t need to reverse time if you can deterministically reproduce everything that led up to the point of interest. (In practice we save a snapshot of your system at some intermediate point and replay from there.)
Any third party service does need to be mocked or stubbed out. We have a partnership with Localstack that lets us provide very polished AWS mocks that require zero configuration on your part (https://antithesis.com/docs/using_antithesis/environment.htm...).
If you need something else, reach out and ask us about it, because we have a few of them in the pipeline.
You're right that if you tried to do something like this using record/replay, you would pay an enormous cost. Antithesis does not use record/replay, but rather a deterministic hypervisor (https://antithesis.com/blog/deterministic_hypervisor/). So all we have to remember is the set of inputs/changes to entropy that got us somewhere, not the result of every system operation.
The classic time space tradeoff question:
If I run Antithesis for X time, say 4 hours, do you take periodic snapshot / deltas of state so that I don't have to re-run the capture for O(4 hours) again, from scratch just to go back 5 seconds?
... but the intro makes it sound like this system is valuable in investigating bugs that occurred in prod systems:
> I’ve been involved in too many production outages and emergencies whose aftermath felt just like that. Eventually all the alerts and alarms get resolved and the error rates creep back down. And then what? Cordon the servers off with yellow police tape? The bug that caused the outage is there in your code somewhere, but it may have taken some outrageously specific circumstances to trigger it.
So practically, if a production outage (where I think "production" means it cannot be in a simulated environment, since the customers you're serving are real) is caused by very specific circumstances, and your production system records some, but not every attribute of its inputs and state ... how does one make use of antithesis? Concretely, when you have a fully-deterministic system that can help your investigation, but you have only a partial view of the conditions that caused the bug ... how do you proceed?
I feel like this post is over-promising but perhaps there's something I just don't understand since I've never worked with a tool set like this.
I think you're right that the framing leans towards providing value in prod issues, but we left out how we provide value there. I think you're also right that we're just used to experiencing the value here, but it needs some explanation.
Basically this is where guided, tree-based fuzzing comes in. If something in the real world is caused by very specific circumstances, we're well positions to have also generated those specific circumstances. This is thanks to parallelism, intelligent exploration, fault injection, our ability to revisit interesting states in the past with fast snapshots, etc.
We've had some super notable instances of a customer finds a bug in prod, recalls its that weird bug they've been ignoring that we surfaced a month ago, and then uses this approach to debug.
This was my thinking as well. Prod environments can be extremely complicated and issues often come down to specific configuration or data issues in production. So I had a lot of trouble understanding how the premise is connected to the product here.
The simulation is a completely generic Linux system, so we can run anything (including NodeJS). If your build tool can produce Docker containers, then it will work with us.
In my opinion, generative AI pictures make a blog post feel cheaper and less truthful. Just my view, I fully accept that I'm probably in a minority of opinion.
FWIW, I enjoyed how the pictures were adding a little theme, were consistent and broke up the reading nicely without being too "noisy" (compared to e.g. technical articles full of meme pictures).
I would like to half-seriously recommend that you overwrite a different character than the first when mangling the environment variable name. Specifically one beyond the third, so as to stay within the LD_ namespace (not yours exactly, but at least easier to keep track of and more justifiably excludable from random applications) and deny someone ten years from now the exciting journey of figuring out why their MD_PRELOAD environment variable is overwitten with garbage on some systems. How do you feel about LD_PRELOAF?
Also it's probably better to leave LD_PRELOAD properly unset rather than just null if it was unset before; in particular I wonder if empty-but-set might still trip some software's “someone is playing tricks” alarms.
There are probably other ways this is less than robust…
(hi, I kind of have a Thing for GNU and Linux innards sometimes)
Good suggestion on leaving LD_PRELOAD unset if it was previously unset. We will fix that.
I’m torn on whether MD_PRELOAD or LD_PRELOAF is more obnoxious to other programs.
Fun fact: A previous version of this program used an even more inscrutable `env[0][0]+=1`, which is great as a sort of multilingual C/English pun, but terrible in the way that all “clever” code is terrible.
As an aside, a lot of people don't know about ldd, and introducing it to them is very cool, but it should almost always come with a warning - maybe add a note that people should be careful with ldd - ldd _may_ execute arbitrary code. This is in the ldd man page, but most people never read documentation. It is unsafe to use on any binary you're not otherwise believed safe.
> This minimal meta-loader will totally work if you invoke it directly like `$ meta_loader.sh foo`, and it will totally not work if you hardcode its path (or a symlink to it) in the ELF headers of a binary.
why not have `foo` be a shell script which invokes the meta loader on the "real" foo? like:
```
#!/bin/sh
# file: /bin/foo
# invoke the real "foo" (renamed e.g. ".foo-wrapped" or "/libexec/foo" or anything else easy for the loader to locate but unlikely to be invoked accidentally)
exec meta_loader.sh .foo-wrapped "$@"
```
it's a common enough idiom that nixpkgs provides the `wrapProgram` function to generate these kinds of wrapper scripts during your build: even with an option to build a statically-linked binary wrapper instead of a shell-script wrapper (`makeBinaryWrapper`).
Love it. I came to this same insight about nix and containers being two approaches to a dynamic linking work around, but via a different path, of building my own little container runtime.
Feels like we are building things who's original purpose is now holding us back, but path dependence leaves us stuck wrapping abstractions in other abstractions.
Why did you want to use Nix to make impure binaries for other distros? Much of the appeal of it for me is using it to distribute software in a more reliably portable way, but that of course always means shipping a whole chunk of /nix/store, once way or another.
What made your team/company want to use Nix to build binaries and then strip them down for old-fashioned, dependency hell-ish distribution? Why not install Nix on your target systems or use Nix bundle, generate containers, etc.?
It's a shame that DLL hell was never resolved in the obvious way: deduplication of identical libraries through cryptographic hashes. Containers basically threw away any hope of sharing the bytes on disk - and more importantly _in ram_. Disk bytes are cheap, ram bytes are not, let alone TLB space, branch predictor context, and so on.
There was a middle ground possible at one point where containers actually were packaged with all of their dependencies, but a container installer would fragment this assembly into cryptographically verifiable share dependencies, but we lost that because it was hard.
The container runtimes have to cope with Dockerfiles and similar, which know nothing about packages. To get the kind of granularity you want here, you have to do actual packaging work, which is the thing Docker sold everyone on avoiding.
If you are willing to do that kind of packaging work you can get the best of both worlds today with Nix or Guix. But containers are attractive because you can chuck whatever pathological build process your developers have evolved over the decades into a Containerfile and it'll mostly work.
> deduplication of identical libraries through cryptographic hashes
Or, maybe, adding a version string to the file name, so, if you were compiled with data structures for libFoo1 (which you found on libFoo.h provided by libFoo1-devel) you’ll link to libFoo1 and not libFoo or libFoo2.
I'd like to express that while this article is WAY outside my wheelhouse, but I liked the writing style and the AI illustrations felt like they were emotionally additive to the section rather just a distraction (to me). Also, my head hurts trying to still understand this cursed thing: https://github.com/antithesishq/madness/blob/main/pkgs/madne...
Hi Will, I'm curious what your thoughts are about the Nix uniqueness problem, and the characterization of failures, or lack thereof under undefined behavior's failure domains. Exception handling generally requires a defined and deterministic state which can't be guaranteed given design choices to resolve DLL hell under Nix (i.e. its a stochastic process).
I mention this since it is a similar form of the problem you mention in writing this piece of software, that can lead to madness.
Also, operationally, the troubleshooting problem-space of keeping things running, segments nicely into deterministic and non-deterministic regions; which the latter ends up costing orders of magnitude more as a function of time to resolve since you can't perturb individual subsystems to test for correct function, without determinism and time in-variance (as system's properties), testing piecemeal has contradictions in stochastic processes.
Hashing by rigorous definition is non-unique (i.e. its like navigating a circle), and there is no proof of uniformity. So problems in this space would be in the latter region.
While, there are heuristics from cryptography that suggest using factional cubic roots to initialize the fields brings more uniformity to the examined space than not, there is no proof of such.
When building resilient systems, engineers often try to remove any brittle features that promote failures.
Interestingly, as a side note, ldd output injects non-determinism into the pipe by flattening empty columns non-deterministically (i.e. if you ldd ssh client, you'll see the null state for each input to output has more than a single meaning/edge on the traversal depending on object type, this ends up violating the 1:1 unique input-output state graph/map required for determinism as a property, though it won't be evident until you run use it as an input that problematically maps later in automation (i.e. grepping the output with RegEx will silently fail, providing what looks like legitimate output if one doesn't look too closely).
PaX ended up forking the project with the fix, because the maintainers refused to admit the problem (reported 2016, forked in 2018), the bug remains in all current versions of ldd (to my knowledge).
While based in theory, these types of problems crop up everywhere in computation and few seem to recognize them.
Working with system's properties, and whether they are preserved; informs on whether the system can be safely and consistently used in later automated processes, as well as maintained at cheap cost.
Businesses generally need a supportable and defensible infrastructure.