What is the current story regarding replication between B2 and GCS Archive Stora...

derefr · on Oct 28, 2023

> B2 currently costs 6 $/TB/month, GCS Archive Storage costs 5x less at 1.2 $/TB/month.

Sure, but does the GCS "archive" storage-class really fit your needs? Anything less than GCS's "nearline" storage class, has fetch costs that swamp the storage costs themselves.

If you're doing FinOps/SecOps (storing customer invoice PDFs for compliance, audit logs in case of a retroactively-discovered hack, etc) — where the likelihood of fetching any given piece of archival data approaches zero — the "archive" storage class makes a lot of sense.

But if you're doing DevOps — e.g. storing infra backups intended to be used for PITR in case someone fat-fingers a SQL DELETE, or in case a bad upgrade corrupts a machine's state, or even just to serve as a read-replica bootstrap base-image (think e.g. pgbackrest) — then the "archive" storage class is ill-suited, because you actually are quite likely to read back the data. (And in fact, if you regularly test your backup restore process, you're essentially 100% likely to read back the data!)

For this reason, I've personally never seen anyone use "coldline" or "archive" storage classes for DevOps infra backups. Instead, SREs seem to gravitate to the "nearline" storage class for this use-case.

(There's also the fact that anything beyond "nearline" requires keeping the data around for quite a long time, paying for all that time the data sits around. The processes involved in DevOps infra backups often trigger periodic full backups, that then obviate the need for retaining previous full/incremental backups. Do you really need a 12-month-old DB backup, or do you only need the 1-month-old and 2-month-old ones?)

And Backblaze B2 is cheaper than GCS "nearline" storage — without any of the minimum storage lifetime requirements of GCS's colder storage classes.

(GCS Nearline vs Backblaze B2 is not as much of a slam-dunk as when comparing them to GCS Standard — $10/TB-mo vs $6/TB-mo; but if your base-load is outside of GCP, adding back in the GCS egress costs to the equation still makes it seem pretty obvious that GCS won't come out the winner on a TCO basis.)

> Do both of these guarantee that if a single data center location catches fire, your backups will not be lost?

Hard to say, actually! Let's look specifically at the case of a fire.

GCP uses the IaaS-standard term "availability zones" to describe data centers in the same region that, along with other types of isolation, are physically-distant enough that they can't catch each-other on fire. From GCS docs:

> Cloud Storage redundantly stores objects that are written to it in at least two different availability zones before considering the write to be successful.

Backblaze, meanwhile, seems to use the term "vaults" here, and seem to use it to mean racks/rooms of backup storage that are seemingly independent from a power/cooling/data perspective. (And possibly with their own isolated little room that has its own fire suppression? They don't mention it, but that's how I'd picture it.) They say that there are multiple vaults "per data centre." But often an entire campus of physically-isolated buildings — what in IaaS terminology would be separate AZs — would be considered one "data centre" in traditional hosting terms, if it's all owned by one company and operated by one shared ops staff. Since Backblaze don't introduce anything quite like the AZ concept in their docs, it's unclear whether their data centres are able to guarantee data durability in the event of a fire that destroys a single "vault."

What is clear, is that neither Google nor Backblaze considers this type of data durability to be the most important kind. Which makes sense: there are other types of disasters that can befall a data centre, or even an entire city, that will knock out (and potentially corrupt!) all AZs in that "region." I'd mention floods as the obvious thing to picture, but nobody builds a data centre on a floodplain. Instead, how about: climate change shifting the paths of tornadoes unpredictably; compression of a tectonic plate resulting in random areas of land (potentially inside a city) having the earth heaved and thrusted up by as much as meters; and key cities — and data-centers themselves as strategic targets, actually! — being bombed during a war.

Designing your archival storage to be resilient to these problems, will of course also protect your data in the event of a smaller-scale disaster like a fire, meteorite impact, or some spiteful person ramming a truck into one of the DC's buildings.

Therefore, Google and Backblaze have both designed in inter-regional replication for the purposes of "geographic data redundancy." GCS has "dual-region" and "multi-region" buckets; and Backblaze has "Cloud Replication" (https://www.backblaze.com/cloud-storage/features/replication) where you essentially make Bucket A in region A into a streaming read-replica of Bucket B in region B.

In both of these cases, enabling the feature costs money. Both providers consider the use-case for this level of durability so rare that they don't make it a guaranteed default with attempted cross-customer subsidization; but instead just offer it to the few customers who think they really need it, or who are required by law or regulation to do it.