Spanner: Becoming a SQL System [pdf]

elvinyung · on May 14, 2017

This "modern" Spanner feels very different from the one we saw in 2012 [1]. Some interesting takeaways:

* There is a native SQL interface in Spanner, rather than relying on a separate upper-layer SQL layer, a la F1 [2]

* Spanner is no longer on top of Bigtable! Instead, the storage engine seems to be a heavily modified Bigtable with a column-oriented file format

* Data is resharded frequently and concurrently with other operations -- the shard layout is abstracted away from the query plan using the "distributed union" operator

* Possible explanation for why Spanner doesn't support SQL DML writes: writes are required to be the last step of a transaction, and there is currently no support for reading uncommitted writes (this is in contrast to F1, which does support DML)

* Spanner supports full-text search (!)

[1] https://research.google.com/archive/spanner.html

[2] https://research.google.com/pubs/pub41344.html

iambvk · on May 15, 2017

Did not digest the paper yet. Does it provide nested transactions?

My understanding from 2012 paper was that it doesn't support nested transactions, even at the storage layer.

Can anybody provide insider-knowledge if it was even a requested feature from Google devs internally?

elvinyung · on May 15, 2017

Since writes are required to be the last step of a transaction, I suspect there wouldn't be much of a point in having nested transactions.

sanxiyn · on May 15, 2017

From the paper:

> Overcoming this limitation requires supporting reading the uncommitted results within an active transaction. While we have seen comparatively little demand for this feature internally, supporting such semantics improves compatibility with other SQL systems and their ecosystems and is on our long-term radar.

No internal demand, but on "long-term radar".

rusanu · on May 15, 2017

Tangential.

For me the fascinating thing is looking at the list of authors to recognize so many from the 2005-2012 Microsoft SQL Server team. Folk I know personally as exceptional performers. Same when I look at Aurora papers. I see this as the result of Ballmer's famous HR initiatives and the massive brain drain that occurred at Microsoft around 2010-ish.

pavlov · on May 15, 2017

Can you tell some more about the HR initiatives? (I would just google it, but a bit unclear what I should be looking for.)

rusanu · on May 15, 2017

You can point back to the 2004 benefits overhaul [0], the famous Towels story [1], low compensation rates compared to Google, Amazon and Facebook [2]. Just go over minimsft.blogspot.com posts at the time.

Add to this the lack of vision and direction, catastrophic acquisitions, dismal flagship product releases. At the time there were running jokes about the Inbox filling up with "After 15 years, is time to send that email" subject lines...

  [0] http://old.seattletimes.com/html/businesstechnology/2001938654_microsoft26.html  
  [1] http://www.zdnet.com/article/microsoft-brings-back-the-towels-5000148135/  
  [2] http://minimsft.blogspot.com/2006/03/internal-microsoft-compensation.html

speedplane · on May 15, 2017

I was at the Google Cloud conference a few months ago and spoke to a few engineers there about Spanner. When I asked them about how it would affect their other storage options (e.g., datastore, cloud sql, etc.), they said that over time, all of their internal storage systems would be moved over to Spanner. His words were "you'll be using Spanner whether you know it or not."

This engineer was clearly a cheerleader for the product, so I'm dubious as to whether that will actually happen, but it's clear that they have quite a bit of confidence in it.

louis-paul · on May 15, 2017

That sounds a lot like the announcement of Azure Cosmos DB: https://docs.microsoft.com/en-us/azure/cosmos-db/introductio...

> we made the service available externally to all Azure Developers in the form of Azure DocumentDB. Azure Cosmos DB is the next big leap in the evolution of DocumentDB and we are now making it available for you to use. As a part of this release of Azure Cosmos DB, DocumentDB customers (with their data) are automatically Azure Cosmos DB customers.

From my understanding, DocumentDB ≈ Dynamo, while Cosmos DB would be closer to Cloud Spanner.

pacala · on May 16, 2017

Google's game is planet scale systems. Cloud SQL is woefully inadequate, basically a single machine DB with replication. Datastore is built on Megastore, which is the precursor to Spanner, largely confined to a handful of nearly located datacenters [think US-East].

Google's internal data will be stored on / migrated to Spanner, I expect sooner rather than later. We'll be using Spanner whenever we'll be using Google. Furthermore, it's likely we'll use Spanner whenever we use a planet scale system built on GCP, say Spotify.

sheeshkebab · on May 15, 2017

Is anyone outside of Google or msft using these proprietary k/v/sql/Json databases?

As a developer I can't bring myself to code against something I can't install on my laptop. And as enterprises would go, they won't use any data store that can't run their financials/hr/ldap/sharepoint thing.

So, who uses them and for what?

tyingq · on May 15, 2017

Your viewpoint is shared by many, but there are lots of enterprises using proprietary cloud features. They either use an abstraction layer for running on a laptop, or otherwise a CI process that kicks off dev instances and test cases on demand, forcing you to be online when you check things in. That's not terribly new. Teams have had to find solutions/standins for things like AWS load balancers, lambdas, certificate servers, etc.

Cloud spanner, being fairly new and unusual (SQL, but no INSERT/UPDATE), though, doesn't yet have a big name customer. Jda.com and quizlet.com were their reference customers.

gravypod · on May 14, 2017

Is Spanner free/open source software? Can we look at the code?

slackingoff2017 · on May 15, 2017

This is part of a worrying new trend. Increasingly you can't buy software anymore, only rent.

Innovation is being kept from scrutiny hidden behind closed doors. The kind of thing patents were meant to prevent back when the system wasn't broken.

Google is one of the better players in this regard, at least telling the world what they're up to. Try to figure out how something like Amazon's systems work and you'll run into a deafening wall of silence.

Funny that we're so willing to trust these "clouds" when we know next to nothing about their internal workings. I don't think the honeymoon will last forever. Somebody will eventually abuse their position and within a few years everyone will be "on prem" again.

brandur · on May 15, 2017

> This is part of a worrying new trend. Increasingly you can't buy software anymore, only rent.

More or less agreed, but you'd probably have to concede that the existence of papers like this one and some of Google's other Spanner publications are admirable; they're being more open about the system's design than they have to be.

Of course Google knows that the system's secret sauce is not the concept itself, but the cost of its implementation, the infrastructural harness to support it, and the resources to reliably operate it. Even with a rough understanding on how Spanner works, it's still going to be difficult to ever migrate off it for practical reasons alone — who else is going to be able to build and run an alternative?

I have huge respect for what the Spanner team is doing, but this is a reason that Citus [1] is also very interesting to me right now. You could conceivably start out with nothing but Postgres and migrate into a Citus cluster when (and if) you need to.

If a point comes where you realize that you need out, you could either (1) see if you can scale back down to simple Postgres, (2) host your own cluster on your own infrastructure with the Citus source code, or (3) migrate onto your own Postgres sharding scheme à la Instagram. At no point do you lock yourself into custom GRPC APIs which are going to be ~impossible to get off of.

GCP and AWS provide hugely useful foundational IaaS, but they're incentivized to move beyond that layer and provide more custom solutions that (1) provide better margins, and (2) lock you into their services. As people and companies building on top of these clouds, we should be looking for whatever opportunities we can to keep our stacks generic so that AWS <-> GCP <-> Azure migrations are possible, even if a last resort.

[1] https://www.citusdata.com/product/cloud

speedplane · on May 15, 2017

> GCE and AWS provide hugely useful foundational IaaS, but they're incentivized to move beyond that layer and provide more custom solutions that ... lock you into their services.

To Google's credit, they are actually moving to a more open cloud environment. They started with Google App Engine, which definitely had lock-in with it's own custom API. But now they are pushing container services and kubernetes, which are really easy to move to other clouds or run on your own servers.

slackingoff2017 · on May 15, 2017

Does this feel like rent seeking to you? The internet is not young anymore. It feels like cloud hosting is little more than the giants allowing controlled competition as long as you rent their servers. The platforms are just extensions of the massively powerful systems they use internally.

I wonder what would happen if the open source community built a viable alternative to the cloud IaaS. Like OpenStack but not a failure :). OpenFlow has shown promise and could form the core for an open IaaS. Network virtualization is the hardest part.

speedplane · on May 15, 2017

> I wonder what would happen if the open source community built a viable alternative to the cloud IaaS. ... Network virtualization is the hardest part.

I would think that the hardware itself is the hardest part. Companies move to the cloud because it reduces their internal OPs team, and you can scale up and down hardware extremely easily.

slackingoff2017 · on May 15, 2017

I respectfully disagree. Every company I've been part of moving to the cloud already had plenty of hardware. It was ease of provisioning VM's, backups, deployment, and networking that really convinced them.

VMWare comes close to offering the same thing on premesis but Oracle is too obsessed with wringing money from it to let it reach it's full potential.

elvinyung · on May 15, 2017

Interestingly, Amazon is also presenting a paper this year at SIGMOD, about Aurora: http://dl.acm.org/authorize?N37778

If my memory serves, it's the second paper they've ever published (the other one being the famous Dynamo paper[1]).

[1] http://cloudgroup.neu.edu.cn/papers/cloud%20data%20storage/d...

Joeri · on May 15, 2017

Google is not doing anything differently here from what they did before. They've been releasing papers instead of code since 2003 [1].

I think overall this has been a good thing. Had they open-sourced, we'd all be using their stuff, but there wouldn't have been the many competing open source big data projects. The fact that the rest of the world was forced to reimplement what google did created a beehive of open source technologies which is going places google never did.

[1] http://blog.mikiobraun.de/2013/02/big-data-beyond-map-reduce...

rakoo · on May 15, 2017

> Try to figure out how something like Amazon's systems work and you'll run into a deafening wall of silence.

It looks to me recent history contradicts this point. Amazon is responsible for the Dynamo paper, a mandatory thing to know if you're going to deal with open source NoSQL DBs. All the big players have released how they work internally, giving us Storm and Heron, Luigi, LevelDB, and so many others. Surely you have heard of Kafka ? Coming from a big one as well.

I seriously can't understand where you get this from. The amount of internals that is shared is really interesting.

a-robinson · on May 15, 2017

The Dynamo paper is from 2007... How much has Amazon published about their internal workings since then?

kyrra · on May 15, 2017

SaaS. Even if it was open sourced, as the original paper calls out, there is specialized hardware involved for keep clocks in sync[0].

If you want something open source, cockroachdb is the closest right now.

https://www.cockroachlabs.com/blog/living-without-atomic-clo...

necubi · on May 15, 2017

The bigger issue is that you need Google's incredible inter-DC networking, which in practice makes partitions very rare. Eric Brewer (author of CAP theorem) lays out here [0] how Spanner relies on those networking guarantees to be effectively CA.

Google's inter-dc traffic flows entirely on private links rather than on the public internet, which is very hard for any other company to match on a global scale.

[0] https://static.googleusercontent.com/media/research.google.c...

atombender · on May 15, 2017

If you run your code on Google Cloud Platform, you should get the same benefits, no?

aoeuasdf1 · on May 15, 2017

Good point

derefr · on May 15, 2017

Specialized, but not proprietary. Spanner presumably uses off-the-shelf Chip-Scale Atomic Clock (CSAC) modules with custom system-level integration. But you can buy CSACs yourself—mounted on PCIe cards—for your own data center. (Every site I can find is of the "request a quote" variety where prices are concerned. They were apparently $1500 apiece in 2011, and have probably come down since then.)

But you don't need an atomic clock to get Spanner's guarantees. CSACs are more convenient in terms of setup and requirements—but GPS clock-sources will do just fine for Spanner's 6ms quantum. You can buy a commercial-off-the-shelf GPS NTP appliance (https://www.microsemi.com/products/timing-synchronization-sy...) and run signal lines from it to all your machines; or cobble a similar solution together yourself, in true HAM fashion, using a GPS antenna + a Linux box + a UHF Software-Defined Radio card (i.e. a TV-tuner card) + GPSd + NTPd. (Or or you could buy cheap USB GPS receivers and hook them up to each and every server in your DC—but you'd need a very thin ceiling for that to work.)

Amusingly, the practicality of that last approach also means that you could run Spanner just fine on a cluster of Android phones. :)

iambvk · on May 15, 2017

> But you don't need an atomic clock to get Spanner's guarantees.

I always hear this from CockroachDB folks and fans, but no details. What are the downsides?

As per my understanding, if a server/replica could not catch up wrt time (i.e. goes out of sync), it can be identified and usually be marked as a temporary-replica-failure -- which is the maximum extent that Hybrid Logical Clocks can help us with. Rest of the system has to deal with consequences: A new replica must take it's place to keep-up the fault-tolerance-level and start making-a-new-copy -- which is in the order-of the amount of data stored in the out-of-sync replica. I assume this puts lot different design requirements on the storage-engine; also, making-a-copy and cancel-copying operations would eat network traffic, thus throughput.

IIRC Google Spanner uses atomic clocks only on few severs per data-center because cross data-center latencies are much higher and erratic (over internet). So CockroadDB has much higher rate of temporary-failures due to out of time-sync and associated downsides. It would be helpful if CockroachDB guys could shed some light on this.

bdarnell · on May 15, 2017

(Cockroach Labs CTO here)

> > But you don't need an atomic clock to get Spanner's guarantees.

This comment continues "...but GPS clock-sources will do just fine for Spanner's 6ms quantum". Providing Spanner's guarantees with reasonable performance requires specialized hardware, but there are more options for that specialized hardware than just atomic clocks.

Note that Spanner itself uses both atomic and GPS time sources according to Google's publications; when we talk about "atomic clocks" we're usually talking about the entire category of specialized time-keeping hardware instead of distinguishing atomic clocks from GPS clocks.

> I always hear this from CockroachDB folks and fans, but no details. What are the downsides?

As we describe in our blog post (https://www.cockroachlabs.com/blog/living-without-atomic-clo...), CockroachDB on commodity hardware provides a slightly weaker consistency model than Spanner (serializable instead of linearizable), and latency is sometimes higher as we need to account for the larger clock offsets in certain situations.

If you do have a high-quality time source available, we have an experimental option to use a Spanner-like linearizable mode.

tyingq · on May 15, 2017

The CockroachDB team did put up a blog post that talks about the tradeoffs.

The pertinent part was: "Spanner always waits on writes for a short interval, whereas CockroachDB sometimes waits on reads for a longer interval. How long is that interval? Well it depends on how clocks on CockroachDB nodes are being synchronized. Using NTP, it’s likely to be up to 250ms. Not great, but the kind of transaction that would restart for the full interval would have to read constantly updated values across many nodes. In practice, these kinds of use cases exist but are the exception."

https://www.cockroachlabs.com/blog/living-without-atomic-clo...

brianwawok · on May 15, 2017

The clock thing is highly dependant on workload. If your load is append only, clocks within a few seconds would be fine. If you truly have competing writes for the same key within ms, then clocks need to be pretty good. Most big data stuff I see is closer to the former.

dgacmu · on May 15, 2017

CSACs seem unnecessary. Within a few racks, PTP (IEEE 1588) can do a pretty decent job of getting things synchronized to tens of microseconds. Requires that your ToRs and NICs support it, but that's not a very onerous requirement, particularly if you're GoogleBookSoft.

This means you only need a few time receivers for a datacenter, along with careful monitoring and implementation of your time distribution, but that happens through your Ethernet switches.

More: http://www.ni.com/newsletter/50130/en/

But you need your datacenters to stay synced even if you lose GPS. You can use an atomic clock for this (Cs or Rb). But, for the rest of us, a good GPS-disciplined double-oven crystal oscillator (XO) can get you within the range needed for Spanner, IIRC spanner's time sync requirements. For example, this little one: https://www.microsemi.com/document-portal/doc_download/13341...

will do +- 7 microseconds / 24h holdover. ("holdover" == operating when it's lost GPS).

tyingq · on May 15, 2017

Sounds like spanner can't tolerate much drift: "The most serious problem would be if a local clock’s drift were greater than 200us/sec: that would break assumptions made by TrueTime."[1]

[1] http://www.bluetreble.com/2015/10/time-travel/

aliasaria · on May 15, 2017

There is also TiDB

https://github.com/pingcap/tidb/blob/master/README.md

forgot-my-pw · on May 15, 2017

With MySQL-compatible, this sounds really promising: https://www.pingcap.com/doc-mysql-compatibility

tav · on May 15, 2017

No, but it is available as part of Google's cloud services [0]. Particularly noteworthy is the fact that they offer an SLA with 99.99% availability. Even crazier is their pending multi-region version due later this year which will come with a 99.999% SLA!!

[0] https://cloud.google.com/spanner/

guilt · on May 15, 2017

Did they just write it during CockroachDB's release?

So sketchy.