> LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.
I'm sorry, but this is saying "catch your bugs before they reach production" which just isn't feasible on non-critical software development (i.e., most software development). The important part that is left out here is: what happens when one such errors slips in? How do you deal with it after the fact?
That being said, your system is impressive and I loved being able to read about it. Please keep up the good work and specially sharing your findings! :)
Not quite! I don't think that's what they're getting at.
The idea is this: Say you have a record A with fields f1, f2, f3. When an even comes in you run a function F with steps s1, s2, s3 each of which may modify a field of record A.
Here's the issue, if s3 fails (due to "invalid input"), the modifications to A from s1 and s2 are incorrect and A is now corrupt.
There are a bunch of ways to handle this but the one described here is to avoid touching data that persists between requests until you're at a stage where nothing can fail anymore.
Absolutely. But doing things in this style protect you from large classes of the especially hard to reproduce bugs. Nothings perfect but it helps a lot!
I'd never heard it articulated before but I personally discovered this style over the years as well.
I honestly can't imagine bug free software, even in critical software development. Luckily for me I've worked on important apps, but if there's an issue there is time to trace and repair the data...not a system that has 6 million orders per second.
I'm sorry, but this is saying "catch your bugs before they reach production" which just isn't feasible on non-critical software development (i.e., most software development). The important part that is left out here is: what happens when one such errors slips in? How do you deal with it after the fact?
That being said, your system is impressive and I loved being able to read about it. Please keep up the good work and specially sharing your findings! :)