A history of the distributed transactional database

zzzcpan · on Dec 4, 2018

> This is not at all a theoretical problem. Deliberately inducing withdrawal races at ATMs is a widely reported type of fraud. Thus, databases need “external consistency”, colloquially known as ACID.

In all those fraud cases banks were actually running ACID databases as they always do. Because ACID has nothing to do with "external consistency", unless you can run it in a human brain, which you can't. Interactions with humans can only be eventually consistent. But nobody even bothers to have ACID all the way to the web browser or APIs, so it's only limited to the application, but not beyond that. With all the frauds and double spending problems remaining unsolved.

Overall the usual FaunaDB attacks on other systems, especially AP. Strong eventual consistency of AP systems is actually much stronger, than ACID and is much much easier to verify and AP is where the cutting edge of distributed systems research happens. But more importantly, SEC can offer the best theoretically possible performance and latency for multi datancenter deployments. While distributed ACID transactions don't even show viability in multi datacenter deployments at the moment, forcing vendors to degrade into weak consistency models in such setups.

freels · on Dec 4, 2018

I don't think it's correct to pit ACID against AP systems this way. Consistency is orthogonal. ACID is model for reasoning about transaction guarantees in terms of allowed anomalies at each isolation level. Serializability in ACID implies CP over AP, but weaker levels don't have the same requirement. It still comes down to tradeoffs for a specific use-case of a database system.

Not sure what you're getting at when you say ACID transactions don't work in a multi-DC context. FaunaDB aside, there are many production systems (commercial or internal) that provide them that I am aware of.

espadrine · on Dec 4, 2018

> Not sure what you're getting at when you say ACID transactions don't work in a multi-DC context.

I expect they are talking about DC as _regions_, not DC as _availability zones_ (in the AWS parlance).

Planetary databases hit latency issues on cross-region ACID transactions. Whatever the sharding, a transaction modifying one piece whose leader is in Paris and another in Beijing will have at least the round-trip time between the two.

It is already unforgiving at that scale, and will become inadmissible at interplanetary scale, requiring some form of eventual consistency.

However, I still expect that we will retain ACID operations for local changes. Humans tolerate latencies that are necessary for planetary ACID transactions. The combination of the size of the Earth and the speed of light is a lucky draw.

freels · on Dec 4, 2018

Yes, I meant regions as well. Speed-of-light bounded latency certainly means that serializable ACID transactions are not workable in some situations, but I disagree with OP that this proves them not viable period, at least at planet-scale as you say.

We're indeed pretty lucky as far as transactions are concerned...

evanweaver · on Dec 4, 2018

For further reading, Dan Abadi's post at http://dbmsmusings.blogspot.com/2018/09/newsql-database-syst... addresses the Spanner issues more directly.

I don't think there's a great reference for the Percolator transaction architecture specifically, but there is a brief description here: https://blog.octo.com/en/my-reading-of-percolator-architectu...

One thing I learned recently is that FoundationDB is essentially the Percolator model (timestamp oracle), even though FoundationDB predated the Percolator paper.

williamallthing · on Dec 4, 2018

What consistency model does FDB provide?

evanweaver · on Dec 4, 2018

In my understanding, it can provide strict serializability.

The main difference from actual Percolator is that FoundationDB manages range locks through a logically separate pool of in-memory lock caches, instead of writing durable locks to the replicas, but the end result is the same. External consistency for read/write transactions within a single datacenter at a time.

wwilson · on Dec 4, 2018

There is no correctness limitation on using FoundationDB in multiple datacenters, and if you do most of your writes in a single datacenter while using the others for disaster recovery, the latency is honestly pretty good too (especially with some of the new 6.0 features).

EDIT: I was wondering why this conversation was giving me deja vu, and then I remembered that you and Dave discussed this in detail last month: https://news.ycombinator.com/item?id=18258805

evanweaver · on Dec 4, 2018

I forgot about that; thanks for the link. I didn't intend to imply a correctness limitation.

spullara · on Dec 4, 2018

I was using IBM DB/2 Parallel Edition in 1995 at Trans Union for a year 2000 project to move off the mainframe. It was running on 60 RS/6000 SP2s. I was hoping the blog post would go further back in time.

https://pdfs.semanticscholar.org/9c08/4a56c1c625da116eca4742...

wmf · on Dec 4, 2018

It's kinda sad that Tandem and similar work has been forgotten. I've wondered what we could learn from such history and never gotten a clear answer.

yourapostasy · on Dec 4, 2018

HP still manufactures and maintains Tandem systems, under the HPE Integrity Nonstop brand [1]. Just like IBM Systems Magazine [2], there is a lot of proprietary tech described that takes awhile to seep into Linux and the open source world. You can still read the Tandem documentation, so the ideas are still there to be picked up by anyone interested in re-implementing them on Linux.

[1] https://www.hpe.com/us/en/servers/nonstop.html

[2] http://ibmsystemsmag.com/

atombender · on Dec 5, 2018

Is there a technical writeup anywhere about the design and architecture of Tandem? One that has enough detail that a skilled database designer could understand it well enough to reimplement it?

yourapostasy · on Dec 5, 2018

Not in one place, and not sufficient to give anyone an implementation roadmap, but that is to be expected, as this is still a commercially-active architecture. They're not about to give away their architectural advantages that easily.

But if you read between the lines, and are savvy enough to look up the background research papers, you'll pick up most of the technology. The devil as they say however, is in the engineering details, and that's where you'll run into a lot of special sauce.

You'll want to look at their Transaction Monitoring Facility (TMF) in detail. They baked fault tolerance right into the bones of the architectural design, and TMF is an identifiable chunk of how that was done. You aren't going to be able to pull off the same kind of fault tolerance they're delivering at just the database level, too. Their thinking about fault tolerance goes way beyond, down into the hardware level though it stops short of the silicon.

wmf · on Dec 6, 2018

https://jimgray.azurewebsites.net/papers/TEs.pdf

I don't have the background to fully understand this stuff, but it claims "The first distributed SQL -- it offers distributed data, distributed execution, and distributed transactions." while every single NoSQL/NewSQL article says things like "Legacy databases (typically the centralized RDBMS) solve this problem by being undistributed."

evanweaver · on Dec 6, 2018

That claim is technically true. There's a lot of good stuff in here—partitioned locks, transaction recovery journals, a realistic expectation of WAN latency and reliability—but page 20 ("Local Autonomy") and page 37 ("NonStop Operation") give clues as to the differences from modern systems.

Rather than build consensus-based high availability into the software, NonStop SQL relies on custom hardware and redundant networks that are designed to never fail. If they do fail, some portion of the data or some types of modifications become unavailable and the operator is expected to intervene and make choices about how to proceed.

They give an example of being unable to run an ALTER TABLE statement because two datacenters that each own a shard of the table are disconnected—the statement can't be run at either datacenter during the partition.

They give other examples where queries return partial results because a specific shard is down. This is potentially a correctness violation if it is not explicit, and it is definitely a loss of availability.

Modern distributed systems are instead designed to maintain almost total levels of availability on constantly failing hardware, and heal themselves immediately and autonomously without losing correctness.

zzzcpan · on Dec 4, 2018

What didn't we learn about Tandem and NonStop? Pretty much anything there is to learn we did learn.

dongxu · on Dec 5, 2018

TiDB developer here,

'This model is equivalent to the multiprocessor RDBMS—which also uses a single physical clock, because it’s a single machine—but the system bus is replaced by the network. In practice, these systems give up multi-region scale out and are confined to a single datacenter.'

I don't think so, although Percolaotor has a single point of physical clock, but it is easy to eliminate the SPOF problem by using high-availability algorithms such as Raft or Paxos (Just like the design in TiDB). On the other hand, the strong consistency across multiple data centers, I think, without hardware devices such as TrueTime, it is difficult to have low latency and strong consistency while allowing writes happen in multiple data centers at the same time. The more realistic situation is that there is a primary data center (where both read and write occur), and multiple data centers guarantee high availability with strong consistency, we can achieve this goal by scheduling Paxos or Raft leader to the primary data center.

iblaine · on Dec 4, 2018

TiDB should be in here but trying to keep up with the latest trends is overwhelming.

freels · on Dec 4, 2018

TiDB uses the Percolator protocol

majidazimi · on Dec 5, 2018

Why the hell does everyone make a banking example when they want to talk about transactions? BANKS RUN ON EVENTUAL CONSISTENCY, PERIOD.

Another point worth mentioning is that, single transaction coordinator model is undervalued in community. Majority of systems have the following query pattern (sorted descending):

1. Single key read --> you don't need to contact coordinator

2. Single key write --> you don't need to contact coordinator. OCC with simple version number works here.

3. Multi-key asynchronous write (updating second, third keys using a pub/sub infrastructure) --> you don't need to contact coordinator in this case.

4. True multi-key transactions that should happen atomically --> need to contact coordinator

If a single coordinator can process 20-30K/s transactions of category 4, then total throughput of the system would be around 300K queries per second which is sufficient for many workloads.

pritambaral · on Dec 5, 2018

> BANKS RUN ON EVENTUAL CONSISTENCY, PERIOD.

Especially American banks. I can't think of any other retail banking market where banks abuse eventual consistency so bad that transfers take days to clear.

known · on Dec 5, 2018

Cap theorem states that it is impossible for a distributed data store to simultaneously provide more than two out off Consistency, Availability and Partition tolerance

https://en.wikipedia.org/wiki/CAP_theorem

doombolt · on Dec 5, 2018

Apache Ignite is an ACID key-value CP database that does not depend on clock.

likeabbas · on Dec 5, 2018

Where does FoundationDB fit with this?