More

wwilson · 2024-10-02T01:54:44 1727834084

My guess is rather that he's conflating the US with US + Western Europe.

ccppurcell · 2024-10-02T05:12:14 1727845934

And then you can make the appropriately similar conflation of USSR with Warsaw pact countries.

wwilson · 2024-09-10T21:40:54 1726004454

No, we don’t require any paravirtualization at all, and nothing needs to be manually rewritten. I’m not sure where you got that impression.

It also is not in any sense a replay engine. We don’t need to record anything except the inputs!

Veserv · 2024-09-10T22:22:12 1726006932

At timestamp 23:40 in the video by Alex Pshenichkin from 2024-06-10 it says data ingestion comes via VMCALL interactions. As such a call is literal nonsense if you are not virtualized, any such call inherently means you are using a paravirtualized interface. Now maybe FreeBSD has enough standardized paravirtualized drivers similar to virtio that you can just link it up, but that would still be paravirtualization solution with manual rewrites, just somebody else already did the manual rewrites. Has the fundamental design changed in the last 3 months?

This is exactly a replay engine (or I guess you could say replay engines are deterministic simulators). How do you think you replay a recording except with a deterministic execution system that injects the non-deterministic inputs at precise execution points? This is literally how all replay engines work. Furthermore, how do you think recordings work except by recording the inputs? That is literally how all recording systems designed to feed replay engines work. The only distinction is what constitutes non-determinism in a given context. At the whole hypervisor level, it is just I/O into the guest; at the process level, it is just system calls that write into the process; at the threading level, it is all writes into the process. These distinctions are somewhat interesting at a implementation level, but do not change the fundamental character of the solution which is that they are all a replay engine or deterministic simulator, whatever you want to call it.

wwilson · 2024-09-10T15:29:11 1725982151

This makes the mapping "injective": https://antithesis.com/blog/deterministic_hypervisor/

The "onto" direction doesn't really matter.

nynx · 2024-09-10T15:35:34 1725982534

How can it reverse time? Does it record a stack of every decision point?

intuitionist · 2024-09-10T15:38:49 1725982729

You don’t need to reverse time if you can deterministically reproduce everything that led up to the point of interest. (In practice we save a snapshot of your system at some intermediate point and replay from there.)

wwilson · 2024-09-10T14:30:19 1725978619

Any third party service does need to be mocked or stubbed out. We have a partnership with Localstack that lets us provide very polished AWS mocks that require zero configuration on your part (https://antithesis.com/docs/using_antithesis/environment.htm...).

If you need something else, reach out and ask us about it, because we have a few of them in the pipeline.

wwilson · 2024-09-10T14:25:37 1725978337

You're right that if you tried to do something like this using record/replay, you would pay an enormous cost. Antithesis does not use record/replay, but rather a deterministic hypervisor (https://antithesis.com/blog/deterministic_hypervisor/). So all we have to remember is the set of inputs/changes to entropy that got us somewhere, not the result of every system operation.

slippy · 2024-09-10T15:07:45 1725980865

The classic time space tradeoff question: If I run Antithesis for X time, say 4 hours, do you take periodic snapshot / deltas of state so that I don't have to re-run the capture for O(4 hours) again, from scratch just to go back 5 seconds?

wwilson · 2024-09-10T15:11:31 1725981091

Yes! See Alex's talk here: https://www.youtube.com/watch?v=0E6GBg13P60

In fact, we just made a radical upgrade to this functionality. Expect a blog post about that soon.

wwilson · 2024-09-10T14:24:21 1725978261

Yes, unfortunately we have not figured out how to rewind time in the real world yet. When we do, there are a lot of choices I'm going to revisit...

abeppu · 2024-09-10T14:36:10 1725978970

... but the intro makes it sound like this system is valuable in investigating bugs that occurred in prod systems:

> I’ve been involved in too many production outages and emergencies whose aftermath felt just like that. Eventually all the alerts and alarms get resolved and the error rates creep back down. And then what? Cordon the servers off with yellow police tape? The bug that caused the outage is there in your code somewhere, but it may have taken some outrageously specific circumstances to trigger it.

So practically, if a production outage (where I think "production" means it cannot be in a simulated environment, since the customers you're serving are real) is caused by very specific circumstances, and your production system records some, but not every attribute of its inputs and state ... how does one make use of antithesis? Concretely, when you have a fully-deterministic system that can help your investigation, but you have only a partial view of the conditions that caused the bug ... how do you proceed?

I feel like this post is over-promising but perhaps there's something I just don't understand since I've never worked with a tool set like this.

jackschu · 2024-09-10T18:09:30 1725991770

(I work at Antithesis)

I think you're right that the framing leans towards providing value in prod issues, but we left out how we provide value there. I think you're also right that we're just used to experiencing the value here, but it needs some explanation.

Basically this is where guided, tree-based fuzzing comes in. If something in the real world is caused by very specific circumstances, we're well positions to have also generated those specific circumstances. This is thanks to parallelism, intelligent exploration, fault injection, our ability to revisit interesting states in the past with fast snapshots, etc.

We've had some super notable instances of a customer finds a bug in prod, recalls its that weird bug they've been ignoring that we surfaced a month ago, and then uses this approach to debug.

The best docs on this are probably here: https://antithesis.com/docs/introduction/how_antithesis_work...

yellow_lead · 2024-09-10T16:04:00 1725984240

This was my thinking as well. Prod environments can be extremely complicated and issues often come down to specific configuration or data issues in production. So I had a lot of trouble understanding how the premise is connected to the product here.

qarl · 2024-09-10T14:29:02 1725978542

> Yes, unfortunately we have not figured out how to rewind time in the real world yet.

10 bucks says you get complaints for not implementing the "real world" feature.

wwilson · 2024-09-10T14:16:24 1725977784

The simulation is a completely generic Linux system, so we can run anything (including NodeJS). If your build tool can produce Docker containers, then it will work with us.

We don't run this on your production server, but in the same simulation that we use to find your bugs. See also: https://antithesis.com/product/how_does_antithesis_work/

wwilson · 2024-07-10T17:27:11 1720632431

Biggest advantages I know of for dynamic linking:

* You can use the LD_PRELOAD trick to override behavior at runtime.

* You can run with entirely different implementations of the dynamically linked library in different places.

* Software can pick up interface-compatible upgrades to its dependencies without being re-compiled and distributed again.

We use all three of these tricks in our SDKs, FWIW. But it is still a giant pain in the ass.

wwilson · 2024-07-10T15:57:40 1720627060

Our artist is on vacation, and some fool gave the CEO access to Midjourney.

bloopernova · 2024-07-10T17:38:41 1720633121

In my opinion, generative AI pictures make a blog post feel cheaper and less truthful. Just my view, I fully accept that I'm probably in a minority of opinion.

pizzalife · 2024-07-10T19:33:25 1720640005

I agree with you since it adds absolutely no value to the article. Technical articles don't need unrelated pictures that add huge page breaks.

didsomeonesay · 2024-07-10T19:44:08 1720640648

FWIW, I enjoyed how the pictures were adding a little theme, were consistent and broke up the reading nicely without being too "noisy" (compared to e.g. technical articles full of meme pictures).

imagineerschool · 2024-07-10T17:29:48 1720632588

These synthetic artifacts will come to be regarded as psychological asbestos.

Please consider labelling it, and giving provenance data. And protecting public sanity by putting it behind a clickwall.

wwilson · 2024-07-10T15:54:36 1720626876

Post author here. Feel free to ask me any questions about the piece of software that I most regret having had to write.

dasyatidprime · 2024-07-10T17:28:06 1720632486

I would like to half-seriously recommend that you overwrite a different character than the first when mangling the environment variable name. Specifically one beyond the third, so as to stay within the LD_ namespace (not yours exactly, but at least easier to keep track of and more justifiably excludable from random applications) and deny someone ten years from now the exciting journey of figuring out why their MD_PRELOAD environment variable is overwitten with garbage on some systems. How do you feel about LD_PRELOAF?

Also it's probably better to leave LD_PRELOAD properly unset rather than just null if it was unset before; in particular I wonder if empty-but-set might still trip some software's “someone is playing tricks” alarms.

There are probably other ways this is less than robust…

(hi, I kind of have a Thing for GNU and Linux innards sometimes)

wwilson · 2024-07-10T22:17:33 1720649853

Good suggestion on leaving LD_PRELOAD unset if it was previously unset. We will fix that.

I’m torn on whether MD_PRELOAD or LD_PRELOAF is more obnoxious to other programs.

Fun fact: A previous version of this program used an even more inscrutable `env[0][0]+=1`, which is great as a sort of multilingual C/English pun, but terrible in the way that all “clever” code is terrible.

foobiekr · 2024-07-10T17:33:28 1720632808

As an aside, a lot of people don't know about ldd, and introducing it to them is very cool, but it should almost always come with a warning - maybe add a note that people should be careful with ldd - ldd _may_ execute arbitrary code. This is in the ldd man page, but most people never read documentation. It is unsafe to use on any binary you're not otherwise believed safe.

wwilson · 2024-07-10T18:54:47 1720637687

Great point! I'll update the post to mention that.

colinsane · 2024-07-10T22:56:06 1720652166

> This minimal meta-loader will totally work if you invoke it directly like `$ meta_loader.sh foo`, and it will totally not work if you hardcode its path (or a symlink to it) in the ELF headers of a binary.

why not have `foo` be a shell script which invokes the meta loader on the "real" foo? like:

``` #!/bin/sh # file: /bin/foo

# invoke the real "foo" (renamed e.g. ".foo-wrapped" or "/libexec/foo" or anything else easy for the loader to locate but unlikely to be invoked accidentally) exec meta_loader.sh .foo-wrapped "$@" ```

it's a common enough idiom that nixpkgs provides the `wrapProgram` function to generate these kinds of wrapper scripts during your build: even with an option to build a statically-linked binary wrapper instead of a shell-script wrapper (`makeBinaryWrapper`).

Klaster_1 · 2024-07-10T16:09:44 1720627784

No questions about Madness, but I really enjoyed the article tone and playfulness. Thank you.

adamgordonbell · 2024-07-10T17:39:38 1720633178

Love it. I came to this same insight about nix and containers being two approaches to a dynamic linking work around, but via a different path, of building my own little container runtime.

Feels like we are building things who's original purpose is now holding us back, but path dependence leaves us stuck wrapping abstractions in other abstractions.

pxc · 2024-07-12T15:23:07 1720797787

Why did you want to use Nix to make impure binaries for other distros? Much of the appeal of it for me is using it to distribute software in a more reliably portable way, but that of course always means shipping a whole chunk of /nix/store, once way or another.

What made your team/company want to use Nix to build binaries and then strip them down for old-fashioned, dependency hell-ish distribution? Why not install Nix on your target systems or use Nix bundle, generate containers, etc.?

foobiekr · 2024-07-10T17:28:29 1720632509

It's a shame that DLL hell was never resolved in the obvious way: deduplication of identical libraries through cryptographic hashes. Containers basically threw away any hope of sharing the bytes on disk - and more importantly _in ram_. Disk bytes are cheap, ram bytes are not, let alone TLB space, branch predictor context, and so on.

There was a middle ground possible at one point where containers actually were packaged with all of their dependencies, but a container installer would fragment this assembly into cryptographically verifiable share dependencies, but we lost that because it was hard.

xenophonf · 2024-07-10T18:10:51 1720635051

> deduplication of identical libraries through cryptographic hashes

Isn't that how the .NET CLR's global assembly cache works?

foobiekr · 2024-07-11T03:05:42 1720667142

A lot of things work that way because it’s obvious to everyone except the container runtimes on Linux.

pxc · 2024-07-12T15:34:09 1720798449

The container runtimes have to cope with Dockerfiles and similar, which know nothing about packages. To get the kind of granularity you want here, you have to do actual packaging work, which is the thing Docker sold everyone on avoiding.

If you are willing to do that kind of packaging work you can get the best of both worlds today with Nix or Guix. But containers are attractive because you can chuck whatever pathological build process your developers have evolved over the decades into a Containerfile and it'll mostly work.

rbanffy · 2024-07-10T19:19:33 1720639173

> deduplication of identical libraries through cryptographic hashes

Or, maybe, adding a version string to the file name, so, if you were compiled with data structures for libFoo1 (which you found on libFoo.h provided by libFoo1-devel) you’ll link to libFoo1 and not libFoo or libFoo2.

foobiekr · 2024-07-11T03:05:59 1720667159

Never use a string. What if the string is wrong? Use hashes.

rbanffy · 2024-07-11T08:11:59 1720685519

If the file for version 2 or libFoo is libFoo1, it only means someone shouldn’t be naming things.

Being in the file name makes it trivial to retrieve and immediately obvious to a human reading the information.

jadbox · 2024-07-10T22:16:10 1720649770

I'd like to express that while this article is WAY outside my wheelhouse, but I liked the writing style and the AI illustrations felt like they were emotionally additive to the section rather just a distraction (to me). Also, my head hurts trying to still understand this cursed thing: https://github.com/antithesishq/madness/blob/main/pkgs/madne...

o11c · 2024-07-11T03:11:23 1720667483

Somehow your website has semi-broken scrolling, which is impressive since that normally only happens when Javascript is enabled.

Also, please QEFS (quote every string) in your shell script fragments.

jcgrillo · 2024-07-10T16:40:16 1720629616

Also not a question, just want to say that "crt glow" (or maybe "Cerenkov glow"?) effect upon hovering over a link is awesome.

trod123 · 2024-07-10T20:52:48 1720644768

Hi Will, I'm curious what your thoughts are about the Nix uniqueness problem, and the characterization of failures, or lack thereof under undefined behavior's failure domains. Exception handling generally requires a defined and deterministic state which can't be guaranteed given design choices to resolve DLL hell under Nix (i.e. its a stochastic process).

I mention this since it is a similar form of the problem you mention in writing this piece of software, that can lead to madness.

Also, operationally, the troubleshooting problem-space of keeping things running, segments nicely into deterministic and non-deterministic regions; which the latter ends up costing orders of magnitude more as a function of time to resolve since you can't perturb individual subsystems to test for correct function, without determinism and time in-variance (as system's properties), testing piecemeal has contradictions in stochastic processes.

Hashing by rigorous definition is non-unique (i.e. its like navigating a circle), and there is no proof of uniformity. So problems in this space would be in the latter region.

While, there are heuristics from cryptography that suggest using factional cubic roots to initialize the fields brings more uniformity to the examined space than not, there is no proof of such.

When building resilient systems, engineers often try to remove any brittle features that promote failures.

Interestingly, as a side note, ldd output injects non-determinism into the pipe by flattening empty columns non-deterministically (i.e. if you ldd ssh client, you'll see the null state for each input to output has more than a single meaning/edge on the traversal depending on object type, this ends up violating the 1:1 unique input-output state graph/map required for determinism as a property, though it won't be evident until you run use it as an input that problematically maps later in automation (i.e. grepping the output with RegEx will silently fail, providing what looks like legitimate output if one doesn't look too closely).

PaX ended up forking the project with the fix, because the maintainers refused to admit the problem (reported 2016, forked in 2018), the bug remains in all current versions of ldd (to my knowledge).

While based in theory, these types of problems crop up everywhere in computation and few seem to recognize them.

Working with system's properties, and whether they are preserved; informs on whether the system can be safely and consistently used in later automated processes, as well as maintained at cheap cost.

Businesses generally need a supportable and defensible infrastructure.

limaoscarjuliet · 2024-07-10T16:42:00 1720629720

Been there, done that. In my case, I symlinked myself out of this mess rather than modify ELF.

swayvil · 2024-07-10T22:32:23 1720650743

Dig the purple anteater pix.