Datastream for BigQuery Preview

carlineng · on Sept 16, 2022

I wonder if the underlying technology is from their acquisition of Alooma [1] several years back. High volume change data capture from an OLTP database to a data warehouse is a problem that seems simple on its face, but has a lot of nasty edge cases. Alooma did a decent job of handling the problem at moderate scale, so I'm cautiously optimistic about this announcement.

[1]: https://techcrunch.com/2019/02/19/google-acquires-cloud-migr...

dominotw · on Sept 16, 2022

> problem that seems simple on its face, but has a lot of nasty edge cases.

I highly recommend teams not rolling out their own solutions because of all these edge cases. My boss refused to pay a vendor solution because it was a 'easy problem' according to him. Denied whole team promotion because we spent lot of time solving 'easy problem' , outright asked us 'whats so hard about it that it deserves promotion' . Worst person i've ever worked for.

anonymousDan · on Sept 16, 2022

Well what are the hard parts then?

carlineng · on Sept 16, 2022

Just to name a few off the top of my head --

- Data type mismatches between systems

- Differences in handling ambiguous or bad data (e.g., null characters)

- Handling backfills

- Handling table schema changes

- Writing merge queries to handle deletes/updates in a cost-effective way

- Scrubbing the binlog of PII or other information that shouldn't make its way into the data warehouse

- Determining which tables to replicate, and which to leave behind

- Ability to replay the log from a point-in-time in case of an outage or other incident

And I'm sure there are a lot more I'm not thinking of. None of these are terribly difficult in isolation, but there's a long tail of issues like these that need to be solved.

mywittyname · on Sept 16, 2022

Keeping track of new record without hammering the database.

Keeping track of updates without a dedicated column.

Ditto for deletes.

Schema drift.

Managing datatype differences. I.e., Postgres supports complex datatypes that other DBs do not.

Stored procedures.

Basically any features of a DB that are beyond the bare minimum necessary to call something a relation database is an edge case.

I rolled a solution for Postgres -> BQ by hand and it was much more involved than I expected. I had to set up a read-only replica and completely copying over the databases each run, which is only possible because they are so tiny.

base3 · on Sept 16, 2022

Definitely alooma. It even uses alooma's semantics for naming variables and staging tables.

mousetree · on Sept 16, 2022

Finally! We use both Cloud SQL (Postgres) and BigQuery. Given both are managed GCP products you’d expect a simple way to replicate Postgres into BigQuery. Instead we’ve had to pay for Stirch/Fivetran/Airbyte or some janky “BigQuery external connections” to do this. Very happy they’re finally supporting an obvious use case. Surprise it’s taken so long

mywittyname · on Sept 16, 2022

We've been using the external queries feature of BQ against Cloud SQL, then redirecting the output to a new table. It's not perfect, but it's functional, so long as your tables are very simple.

diceduckmonk · on Sept 17, 2022

It’s bonkers GCP Web Console doesn’t even have a way to query CloudSQL.

And if your instance is on a private IP, you can’t use GCP’s CLI to query it, either.

The workaround ? We spun up a “bastion” GCE instance to run PSQL queries from.

kppullin · on Sept 16, 2022

We currently use the kafka connect bigquery connector, along with debezium, to "stream" both the "changelog"/"transaction log" and to "mirror" our rdbms instances into bigquery. While this works, it's been a fair amount of effort to iron out issues over time. We also have had to work around bigquery limits including issues exceeding concurrent queries (switched to batch mode, which has it's own issues) and frequency of writes (we've had to throttle to flushing every minute, which is good enough, but did have a use case for faster updates). Also have issues related to partitioning and clustering, and more...

So seeing this to potentially replace the kafka connect bigquery connector looked appealing. However, according to the docs and listed limitations (https://cloud.google.com/datastream/docs/sources-postgresql) it does not handle schema changes well nor postgres array types. Not that any of these tools handle this well, but given the open source bigquery connector, we've been able to work around this with customizations to the code. Hopefully they'll continue to iterate on the product and I'll be keeping an eye out.

atwebb · on Sept 16, 2022

Yeah the Debezium connectors have some issues that really get in the way. I'm less familiar with BQ but some other DBs, the data typing is really, really basic

(case when string then varchar(4000/max))

and similar. It looks like a relatively easy thing to incrementally improve.

gunnarmorling · on Sept 16, 2022

Hey, Gunnar here from the Debezium team. I would love to learn more about those issues you encountered with Debezium. Perhaps it's something we could improve. So if you could share your feedback either here, on our mailing list (https://groups.google.com/g/debezium), that would be awesome. Thanks!

kppullin · on Sept 16, 2022

For us, debezium has been working rather well in recent memory (there were a couple PRs I submitted that I'm proud of, even if tiny!). Most of the issues are on the kafka connector side, whether it's the bigquery sink, jdbc sink, and s3 sink.

A couple things that do pop to mind as it relates to debezium include better capabilities around backup + restores and disaster recovery, and any further hints around schema changes. Admittedly I haven't looked at these areas for 6+ months so they may be improved.

atwebb · on Sept 19, 2022

Well I'm trying to find it now! Maybe it has been updated. It was essentially a case statement on the destinations which went for some wide data types but, looking now, I see where it is looking at the schemas and getting more precision.

boundlessdreamz · on Sept 17, 2022

According to the linked page, only Geometric and Range types are excluded

pramsey · on Sept 16, 2022

No support for geometry or ranges, alas! "Datastream doesn't support columns of the Geometric data types or Range data types. The values in these columns will be replaced with NULL values."

jpau · on Sept 17, 2022

Replacing them with null seems like a weird decision (that hopefully they will iterate on).

Can anyone suggest why they might have chosen null and not a wkt/text cast?

gfarah · on Sept 16, 2022

We have been using RDS -> DMS -> Kinesis -> BigQuery for 3 years now. Very little maintenance and has worked without issues. I doubt this service will be cost competitive over a setup like ours. Perhaps between Cloud SQL -> BigQuery will make sense?

stingraycharles · on Sept 16, 2022

Kinesis -> BigQuery? How does that work? Or did you mean Kafka?

faizshah · on Sept 16, 2022

Is this using Anthos? If not aren’t the bandwidth costs high for this solution from AWS -> GCP?

jimmyechan · on Sept 16, 2022

On a related note, Salesforce and Snowflake announced a partnership for real-time data sharing. Not exactly the same, but it's interesting to see data warehouse companies building upstream. I wonder if there's a trend going on here where data warehouse companies start natively integrating with more databases, apps, and other data sources in (near) real-time

Edit. Link to announcement: https://techcrunch.com/2022/09/15/salesforce-snowflake-partn...

mywittyname · on Sept 16, 2022

Yes, I think this will be the case. Google makes it very easy to integration with customers' BigQuery instances. Have the customer supply a service account token and a target dataset, then run a query with the output table configured in that target dataset. Done.

tehalex · on Sept 16, 2022

Hopefully it 'just works' and is easier than AWS's DMS.

I use AWS DMS to replica data from RDS => Redshift and it's a never-ending source of pain.

johndfsgdgdfg · on Sept 16, 2022

Doesn't matter. Goggle will shut this service down. Moreover, Google is an ad company. It can never compete to customer focus of Amazon.

mritchie712 · on Sept 16, 2022

doesnt amazon have an ecomm business or something? Thought I heard something about that.

ta3411 · on Sept 16, 2022

This seems like an ideal use case for us. I have a naive thinking of my workflow: can someone please comment if I am off track.

I am building an e-commerce product on AWS PostgresSQL. Everyday, I want to be able to do analytics on order volume, new customers, etc. - For us to track internally: we fire client and backend events into Amplitude - For sellers to track: we directly query PostgressQL to export

Now with this, I am thinking of constantly streaming our SQL table to BigQuery. And any analysis can be done on top of this BigQuery instance across both internal tracking and external export.

Is RedShift the AWS equivalent of this?

faizshah · on Sept 16, 2022

As a heavy BQ user on my side projects, there isn’t really an alternative to BQ in AWS. I find that RedShift does not provide a lot of the functionality and ease of use that BQ provides.

That said the closest thing is Amazon Athena.

The architecture would basically be Kinesis -> S3 <- Athena where S3 is your data lake or you can do it like AWS DMS -> S3 <- Athena.

To accomplish this or the redshift solution you need to implement change data capture from your relational DB, for that you can sue AWS Database Migration Service like this for redshift: https://aws.amazon.com/blogs/apn/change-data-capture-from-on...

Like this for kinesis: https://aws.amazon.com/blogs/big-data/stream-change-data-to-...

The reason you may want to use Kinesis is because you can use Flink in Kinesis Data Analytics just like you can use DataFlow in GCP to aggregate some metrics before dumping them into your data lake/warehouse.

te_chris · on Sept 16, 2022

BQ is saas proper vs redshift where you have to pick instance sizes etc. It’s amazing, true superpower stuff in how little you have to think about it to get loads out of it.

rcaught · on Sept 17, 2022

Redshift has serverless options.

cavisne · on Sept 17, 2022

Exactly this (using bigquery but AWS for everything else) is pretty common. It takes a while to build a service like this, AWS spent too long in the wrong direction (redshift) and haven’t been able to catch up.

mchaynes · on Sept 16, 2022

basically yes. kinesis -> firehose -> s3

nickflood · on Sept 17, 2022

At ~$2.5/GiB of changes in Europe, this is a very expensive service. Is the sync process that resource-intensive, are they increasing margins to recoup development costs, or is it just targeting (big enterprise) customers that are okay paying this much for data sync?

yumraj · on Sept 16, 2022

What do they mean by near real-time?

Is it microsecond, millisecond, seconds, minutes, hours ... ?

lmkg · on Sept 16, 2022

This is a common thing for BigQuery to claim. It's quite fast for an analytics database, but it is not a transactional database. Even for static data, the simplest queries on the tiniest data still take 1/10th of a second or so, and realistic queries run in the 1-5s range.

Basically you could use BigQuery to build tools supporting human-in-the-loop processes (BI reports, exec dashboards), and you could call those "real-time" to the humans involved. But it will not support transactional workloads, and it does not provide SLAs to support that. I don't know about this particular feature, I'm guessing seconds but maybe minutes.

yolo3000 · on Sept 16, 2022

They are also introducing a sort of search/indexing soon, which will probably speed up certain types of queries.

rochoa · on Sept 16, 2022

From the documentation (https://cloud.google.com/datastream/docs/behavior-overview#p...):

> When using the default stream settings, the latency from reading the data in the source to streaming it into the destination is between 10 and 120 seconds.

brad0 · on Sept 16, 2022

I’m curious about customer use cases for this. Can anyone elaborate?

jerglingu · on Sept 16, 2022

The lifting-and-dropping process from transactional stores into analytics stores is usually the first step in analytics data pipelines. This appears to fulfill that initial ingestion step specifically for Google BigQuery. A lot of times these days this ingestion step is ran through some third party service: Fivetran, Stitch, Matillion, etc. Similar to AWS's offerings for Redshift, this looks like Google's "native" ingestion tool for BigQuery.

I'd be curious to know how it handles the wrinkles Fivetran effectively solves (schema drift, logging, general performance) and if it does so at a much cheaper price.

dtauzell · on Sept 16, 2022

If you use some sort of ETL to copy data from your OLTP to data warehouse you might use this to stream data from your OLTP to Big Query so you can use Big Query as a data warehouse.

skadamat · on Sept 16, 2022

Few come to mind!

- Startup that's running analytics off their production Postgres DB can use this as a gateway drug to try out BigQuery. Bonus for not having to ETL the data themselves or pay FiveTran / host Airbyte to do it

- Replace existing internally maintained sync job / ETL pipelines with this, if you're somewhat all-in on BigQuery / GCP for analytics (BQ is quite popular in analytics because of the pricing structure)

polskibus · on Sept 16, 2022

I wonder how well does it handle schema changes. Adding new nullable columns is easy to implement, but more complex changes can be a pain.

neeleshs · on Sept 16, 2022

Shameless plug : Syncari supports this as well, without most limits around backfills and schema changes

gigatexal · on Sept 16, 2022

Managed debezium?