Debugging Distributed Systems With Why-Across-Time Provenance [pdf]

phaedrus · on Oct 12, 2018

I've been working this year on exactly this problem. I "inherited" responsibility for one of our vendor's Windows MFC codebase which makes gratuitous use of threads and every obfuscated/side-channel form of nonfunctional data flow known to C++. It's like a mirror universe rendition of Erlang created by a sloppy idiot savant using nothing but Windows messages and public class member variables.

(For example reading config values from the INI file gets its own CWinThread-derived custom message pump, so that both the initiator of the read and the handler of the result value can be different threads in different .cpp files. I found one CWinThread which was merely being used to RAII one unrelated member variable and whose constructor constructed a 3rd unrelated object two stars (pointer dereferences) away. Its run loop was empty (for gosh sake).)

The reason for this monstronsity was that the program was written by a fresh college graduate from the peak of the era when CS professors were saying "OMG concurrency/threads/cores every processor of the 2010's is going to have 128 and doubling weak cores and if you don't multithread all your code you're part of the problem not part of the solution." Then this fresh college graduate was thrown into a hardware company whose software department still parties like it's 1999 with MFC / Win32 in C++. And he created... this cosmic horror.

As you can imagine, stack traces are useless to me here. Logging is nearly useless (because the app is always doing more than one thing, even when it has < 1 thing to do). Breakpoints are useless to me (the args mean nothing; public members and global variables hold all the relevant state - but they could be anywhere).

What I ended up doing was adding Sqlite to the codebase, and created "causality log" database. It's sort of like a call tree, but not. It's sort of like a flow chart - but not. And it's sort of like a UML chart, but not. It combines aspects of all of these; after testing a feature of the program I run a post-processor on the log data to turn it into a dotGraph file so I can render it as a graph.

For example it might tell me that thread object A created thread object B which then was given pointers to the following 3 global objects by assigning B's member variables ex post facto. Later, a windows message came from dialog X and B handled it in event handler B.h() and passed the data onto dialog Y.

That's the goal anywhere - I have 2/3rds of the above paragraph implemented; what I have left is figuring out how to pass a "causality" tracker value to shadow windows message sends.

The way I created this causality logging system is that my log-to-event-database functions return an id to a row, and I (had to) change the code to pass these ids (parent_id) through constructor calls and function calls, etc. I find the non-local destructive assignments of public (should be private) class member variables, and manually log those occurrences. All of these ids provide a chain of provenance for causality and reachability. Because they are database ids rather than call stack frames, they survive the return of any one call frame.

pjmlp · on Oct 12, 2018

You just reminded me of an application for SNMP communication on HP-UX I had to maintain in 2005, the amount of threads per action was so complex that we had a couple of A4 papers glued together for the fluxogram of the actions triggering threads being started, joined and respective synchronization. :\

adamcharnock · on Oct 12, 2018

I would be very grateful if someone could spell out how an implementation of this may look in a more concrete sense. I don't parse the math very well myself.

My gut-feel has been that a 'message trace' of some sort would be useful in debugging a distributed system. For example, every message contains the ID of the message which cause it to be sent, if any (or a list of message IDs giving the full causality chain). This is something I'm considering implementing in Lightbus [1] using Python's contexts.

Is the proposed wat-provenance system here somehow different? To quote the abstract:

> Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input.

So is this more akin to static analysis rather than runtime debugging?

[1]: https://lightbus.org