" A customer issues a transaction to machine A that decrements the inventory. After this transaction completes, a different transaction is issued to machine B that reads the current inventory. On a single machine, the second transaction would certainly read the write of the first transaction. But now that they are running on separate machines, if the system does not guarantee consistent reads across machines, it is very much possible that the second transaction would return a stale value (e.g. this could happen if replication was asynchronous and the write from the first transaction has not yet been replicated from A to B). This stale read would not be a violation of serializability. It is equivalent to a serial order of the second transaction prior to the first. But since the second transaction was submitted after the first one completed, it is certainly a violation of strict serializability."
Isn't this a problem on all DBs that don't have strict serializability and can't it be solved by 2 Phase Commit ?
In the vote phase, the coordinator can also get the final balance from each replica, in that case if any balance differes it knows one of the replica isn't synced and aborts the transaction (user retries).
Another idea maybe (doesn't scale) -
Mark replicas as read only and the writes only go through one or a set of master nodes which guarantee that the state is up to date ?
Yes, it potentially is, but it entirely depends on the implementation details of the system. It's certainly allowed behavior by the definition of serializability.
Edited my comment to include 2 Phase Commits. Wouldn't 2PC always get you synchronous replication ? Async Replication probably helps you scale but the tradeoff is consistency and errors like the one they mentioned. Is that correct ?
Usually 2PC is used to run transactions across multiple shard which each have a subset of the total dataset. Though it doesn't provide you with any extra reliability.
There's a tradeoff in sync vs async replication... a node which you synchronously replicate to won't serve stale reads, but then it becomes part of the critical write path. Conversely when reading from a async node it's very possible to get a stale read. (Alternatively, systems like Cassandra allow you to read from multiple nodes and return the latest result).
Note, I don't believe it's necessary for _all_ transactions to be strictly serializable, but I would make the argument that all read-write transactions should be in order to prevent the anomaly in the article. In other circumstances it's reasonable to make the tradeoff and accept potentially stale data.
> Note, I don't believe it's necessary for _all_ transactions to be strictly serializable, but I would make the argument that all read-write transactions should be in order to prevent the anomaly in the article. In other circumstances it's reasonable to make the tradeoff and accept potentially stale data.
How exactly does the "free checking" example work?
Is it two transactions, where each transaction is responsible for checking the total balance across all accounts? Or is it three transactions, where a third read-only transaction verifies the sum of all accounts?
If it's two transactions, I think the anomaly as described is prevented even by non-strict serializability, because the additional reads force the transactions to conflict. If it's three transactions, then what we're really concerned with is the read-only transaction. What does it mean to mix some strictly-serializable transactions and non-strictly-serializable ones?
It is 3 transactions. Alice deposits money into her savings acount, then subsequently withdraws money from her checking account. In the meantime, some concurrent process transactionally reads the balances of both accounts.
Since there is no internal causality tying transactions 1 and 2 together, transaction 3 is allowed to arbitrarily order them without violating serializability.
I think the answer to your second question comes down to what anomalies you are trying to protect against. In order to prevent causal reverse, all read-write transactions must be strictly serializable, but it is safe to reorder read-only ones.
Their banking example is impossible under any system which guarantees serializability.
To guarantee serializability, the database system cannot assume anything about how some value is derived (unless it is something like a single stored procedure).
If I am playing the evil demon to show why their example is not valid under serializability, let us make a small modification to their example. Instead of just paying the babysitter, assume that our customer simply execute the following procedure twice as two separate transactions.
begin transaction;
money = select money from bank where id=42;
if money % 2 == 0:
update bank set money=money-1 where id=42;
else:
update bank set money=money+2 where id=42;
commit;
If money = 10 initially, there is precisely one valid end state for money, money = 11. However, using the logic from the article, it is supposedly valid for the second execution to read the old value for money. Therefore, both transactions end up writing money=9 which is clearly not possible under serializability, strict or not!
The key part of this is that the database has no idea what logic we are using to compute the new value of money.
The only time I could see strict serializability potentially being stronger than normal serializability is with cases where the client is executing a stored procedure and therefore the database can know exactly what logic the client is running.
> Their banking example is impossible under any system which guarantees serializability.
>
> To guarantee serializability, the database system cannot assume anything about how some value is derived
Their example is very well possible with a system that just guarantees serializability. Especially if it's distributed. The database is not assuming anything in their example, but you are changing the database design and assume it's in a way serializability works. Serializability works in many scenarios, especially on single machine databases, that's why it was usually sufficient (they even mention exactly that). So it's very easy to come up with an example for serializability working. Actually your example is equivalent to their first example (with inventory), where they point out how serializability works.
But the data in their second example is not in a single value "money", but it's disjoint values "savings_balance" and "checkings_balance". Reads/writes to these are from a database point of view non-conflicting operations, so they might be reordered. The article explains the transactions access disjoint data (checking balance vs savings balance) if the disjoint data are located on different machines in a distributed system.
So the two transactions T1 (add money to saving) and T2 (remove money from checkings), may be executed T1 -> T2 and T2 -> T1, because they are non-conflicting reads/writes. In the T2 -> T1 case of their example, the balance dropped below $5000 dollar, because the addition came later.
That seems like a problem of not having sync acknowledgements while allowing multi-instance writes at the same time. I am not sure if that is a strict serializability issue anymore.
The constraint that all observers see T1 before T2 because T1 is acknowledged before T2 begins is the “in real time” component of a strict serializability guarantee. Mere serializability does not guarantee this, as unintuitive as that sounds.
ANSI SQL serializable is not the same as serializability in a mathematical sense. The standard defined serializability in terms of anomalies since they came up with the standard without considering things like snapshot isolation or MVCC.
The classic example, related to mine, which shows the difference is with the following two queries:
update balls set color=0 where color=1;
update balls set color=1 where color=0;
Under (mathematical) serializability, all the balls must be the the same color. However, under snapshot isolation (serializable according to the SQL standard), it is also possible to just have the colors swapped in case both queries read the same version.
You can verify it yourself, but this is not any one of the typical anomalies since both transactions neither read nor write to the same rows. A phantom read does not occur since each transaction only does a single read operation.
Yes, I believe so. Since in Postgres, snapshot isolation (on top of which its serializable implementation is built) provides a stable view of the system at a point in time, and also doesn't allow stale reads, transactions _can be_ strictly serializable.
However, if multiple synchronous standby replicas are configured, then this means that stale reads are possible, since Postgres by default does not necessarily wait for all replicas to ack the write. It depends on the replication configuration and where reads are directed.
> However, if multiple synchronous standby replicas are configured, then this means that stale reads are possible, since Postgres by default does not necessarily wait for all replicas to ack the write. It depends on the replication configuration and where reads are directed.
It seems synchronous replication still refers to durability of a commit. Once it‘s stored on the replica persistently it doesn‘t block anymore, so no strict serializability. Macdice‘s comment and the link hold lots of info on this.
On a single machine, it most likely does (it wouldn't really make sense to implement something else), but across replicas it doesn't even offer non-strict serializability as SIREAD locks are not coordinated between replicas.
You can reproduce the problem in postgresql with the following:
create table counters(counter int);
insert into counters(counter) values(1);
BEGIN TRANSACTION ISOLATION LEVEL serializable;
select sum(counter) from counters; /* this will return 1 */
/* insert sum into counters. wait until committing next transaction before executing the insert */
insert into counters(counter) values(1);
COMMIT;
/* this transaction should commit before doing the insert in the above transaction and after the above transaction has calculated the sum */
BEGIN TRANSACTION ISOLATION LEVEL serializable;
insert into counters(counter) values(10);
COMMIT;
both transactions commit and the final table looks like:
1, 10, 1
which is possible if the first transaction committed first, and then the second transaction committed which is different from the order the transactions committed in by the wall clock. so it is possible for another client to see the table as:
[1] (the initial state)
[1, 10] (after the second transaction committed)
[1, 1, 10] (after the first transaction committed)
which is a sequence of states which should not be possible. if you see [1], [1, 10] then you should see [1, 10, 11] as the last state. hence it violates external consistency.
I agree the order in the table is irrelevant. However, the ordering observed by another entity doing 3 different read-only transactions is relevant.
The third party does query 1 and sees [1]. This is a result of the initial state.
The third party does query 2 and sees [1, 10]. This is a result of the second transaction committing. The second transaction just inserts '10' so [1,10] makes sense.
The third party does query 3 and sees [1, 1, 10]. This is a result of the first transaction committing but postgresql pretending it committed before the second transaction. The first transaction inserts the sum of what was in the table. If the first transaction committed first then inserting '1' is allowed in the serial order. It doesn't break the second transaction because the second transaction didn't read any rows. In order to let the transaction commit postgresql pretends that it actually committed first
There is no serial order the transactions could have committed in that would let the third party see these results. However, this allowed in postgresql because postgresql doesn't prevent anomalies that can only be observed across multiple transactions.
Edit: TL;DR: No, not with replication in the picture.
As for whether replication shows fresh data, it sort of can, but not in a very practical way. synchronous_commit = on, when synchronous_standbys has been configured, does indeed make COMMIT wait until some number of standbys has flusheùd the transaction to disk, but not until the transaction is visible to new transactions/snapshots on the replicas. In other words, it's just about making sure your transaction is durably stored on N servers before you tell your client about the transaction; a new transaction on the replica server after that might still not see the transaction (if the startup process hasn't got around to applying it yet). As a small step towards what you want, we added synchronous_commit = remote_apply, which makes COMMIT wait until the configured number of standbys has also applied the WAL for that transaction, so then it will be visible to new transactions. The trouble is, you either have to configure it in such a way that a dying/unreachable replica can stop all transactions on the primary (blocking the whole system), or so that it only needs some subset of replicas to apply, but then read queries don't know which replicas have fresh data.
I have a proposal to improve that situation: https://commitfest.postgresql.org/22/1589/ Feel free to test it and provide feedback; it's touch and go whether it might still make it into PostgreSQL 12. It's based on a system of read leases, so that replicas only have a limited ability to hold up future write transactions, before they get kicked out of the synchronous replica set, a bit like the way failing disks get kicked out of RAID arrays. Most of the patch is concerned with edge conditions around the transitions (new replicas joining, leases being revoked). The ideas are directly from another open source RDBMS called Comdb2. In a much more general sense, read leases can be found in systems like Spanner, but this is just a Fisher-Price version since it's not multi-master.
As for whether that gets you strict serializability, well no, because PostgreSQL doesn't even support SERIALIZABLE on read-only replicas yet, and although REPEATABLE READ (which for PostgreSQL means snapshot isolation) gets you close, anomalies are possible even with read-only transactions (see famous paper by A Fekete, search for that name in the PostgreSQL isolation tests, for an example). Some early work has been done to try to get SERIALIZABLE (actually SERIALIZABLE READ ONLY DEFERRABLE) working on read-only replicas. https://commitfest.postgresql.org/22/1799/ but some subproblems remain unsolved.
Maybe with all of that work we'll get close to the place you want. It's complicated, and we aren't even talking about multi-master.
First, note that PostgreSQL uses SSI, an optimistic strategy for implementing SERIALIZABLE, so it neeeds to be able to nuke transactions that cannot be proven to be safe. This is a trade-off; other systems use (pessimistic) strict two-phase locking and similar, and we are taking the bet that the total throughput will be higher with SSI than with S2PL. Usually that optimism turns out to be right (though not for all workloads). Note nearby comments about another RDBMS whose SERIALIZABLE implementation is so slow that no one uses it in practice; that matches my experience with other non-SSI RDBMSs too.
Next, you have to understand that a read-only snapshot isolation transaction can create a serialization anomaly. Take a look at the read-only-anomaly examples here, showing that single node PostgreSQL database can detect that: https://github.com/postgres/postgres/tree/master/src/test/is... (and ../expected shows the expected results).
Now, how is the primary server supposed to know about read-only transactions that are running on the standby, so it can detect that? Just using SI (REPEATABLE READ) on read replicas won't be good enough for the above example, as the above tests demonstrate.
The solution we're working on is to use DEFERRABLE, which involves waiting until a point in the WAL that is definitely safe. SERIALIZABLE READ ONLY DEFERRABLE is available on the primary server, as seen in the above test, and waits until it can begin a transaction that is guaranteed not to be killed and not to affect any other transaction. The question is whether we can make it work on the replicas. The reason this is interesting is that we think it needs only one-way communication from primary to replicas through the WAL (instead of, say, some really complicated distributed SIREAD lock scheme that I don't dare to contemplate).
It’s very unclear what level of serializability MySQL’s “serializable” level actually provides; it is seldom used in practice because the performance cost is extreme (essentially table locks everywhere for both reads and writes).
I would be surprised if either isolation level is correctly enforced across distributed replicas, even with semi-synchronous row-based replication. My MySQL operational experience was strongly characterized by fighting replication divergence.
I'd always known this as wall-clock serializability. Not a technical sounding term but very descriptive.
I'm not sure I know of any traditional RDBMSes that run in Serializable but not Strict Serializable mode. It's only the horizontally distributed ones that squeeze more performance that seem to choose the trade-off for performance. CockroachDB (-is-)was the notable one that used SNAPSHOT isolation permitting write skew. "As of CockroachDB 2.1, SNAPSHOT isolation is no longer supported, and SERIALIZABLE SNAPSHOT is now simply known as SERIALIZABLE." Good to know.
Doesn't strict serializability solve only part of the problem? If X finishes before Y starts, then X must come before Y in serial order. But if there is some overlap between X and Y in the first place, then again some serial order is picked (serializability already guarantees that), but strict serializability still does not define which of the two possible orders gets picked. So at least query-level replication would suffer from the same problems as before.
Linearizability is a property of accesses of a single key. Strict serializability is the analogous property for accesses on a multi-key system. Basically, strict serializability implies linearizability, but not the other way around.
A database with just serializable operations but with single key operations being linearizable is still useful. You could imagine this being implemented as a partitioned database with operations in a single partition being strictly serializable, but without a global clock to preserve real time ordering of transactions on different partitions.
What do you mean by "key" here? In my world a "key" is a half of a key-value-pair, and would correspond to only one record. But then you talk about strict serializability of entire partitions and that seems like something different.
Isn't this a problem on all DBs that don't have strict serializability and can't it be solved by 2 Phase Commit ? In the vote phase, the coordinator can also get the final balance from each replica, in that case if any balance differes it knows one of the replica isn't synced and aborts the transaction (user retries).
Another idea maybe (doesn't scale) -
Mark replicas as read only and the writes only go through one or a set of master nodes which guarantee that the state is up to date ?
Would love for an expert to chime in.