I wonder if the underlying technology is from their acquisition of Alooma [1] several years back. High volume change data capture from an OLTP database to a data warehouse is a problem that seems simple on its face, but has a lot of nasty edge cases. Alooma did a decent job of handling the problem at moderate scale, so I'm cautiously optimistic about this announcement.
> problem that seems simple on its face, but has a lot of nasty edge cases.
I highly recommend teams not rolling out their own solutions because of all these edge cases. My boss refused to pay a vendor solution because it was a 'easy problem' according to him. Denied whole team promotion because we spent lot of time solving 'easy problem' , outright asked us 'whats so hard about it that it deserves promotion' . Worst person i've ever worked for.
- Differences in handling ambiguous or bad data (e.g., null characters)
- Handling backfills
- Handling table schema changes
- Writing merge queries to handle deletes/updates in a cost-effective way
- Scrubbing the binlog of PII or other information that shouldn't make its way into the data warehouse
- Determining which tables to replicate, and which to leave behind
- Ability to replay the log from a point-in-time in case of an outage or other incident
And I'm sure there are a lot more I'm not thinking of. None of these are terribly difficult in isolation, but there's a long tail of issues like these that need to be solved.
Keeping track of new record without hammering the database.
Keeping track of updates without a dedicated column.
Ditto for deletes.
Schema drift.
Managing datatype differences. I.e., Postgres supports complex datatypes that other DBs do not.
Stored procedures.
Basically any features of a DB that are beyond the bare minimum necessary to call something a relation database is an edge case.
I rolled a solution for Postgres -> BQ by hand and it was much more involved than I expected. I had to set up a read-only replica and completely copying over the databases each run, which is only possible because they are so tiny.
Finally! We use both Cloud SQL (Postgres) and BigQuery. Given both are managed GCP products you’d expect a simple way to replicate Postgres into BigQuery. Instead we’ve had to pay for Stirch/Fivetran/Airbyte or some janky “BigQuery external connections” to do this. Very happy they’re finally supporting an obvious use case. Surprise it’s taken so long
We've been using the external queries feature of BQ against Cloud SQL, then redirecting the output to a new table. It's not perfect, but it's functional, so long as your tables are very simple.
We currently use the kafka connect bigquery connector, along with debezium, to "stream" both the "changelog"/"transaction log" and to "mirror" our rdbms instances into bigquery. While this works, it's been a fair amount of effort to iron out issues over time. We also have had to work around bigquery limits including issues exceeding concurrent queries (switched to batch mode, which has it's own issues) and frequency of writes (we've had to throttle to flushing every minute, which is good enough, but did have a use case for faster updates). Also have issues related to partitioning and clustering, and more...
So seeing this to potentially replace the kafka connect bigquery connector looked appealing. However, according to the docs and listed limitations (https://cloud.google.com/datastream/docs/sources-postgresql) it does not handle schema changes well nor postgres array types. Not that any of these tools handle this well, but given the open source bigquery connector, we've been able to work around this with customizations to the code. Hopefully they'll continue to iterate on the product and I'll be keeping an eye out.
Yeah the Debezium connectors have some issues that really get in the way. I'm less familiar with BQ but some other DBs, the data typing is really, really basic
(case when string then varchar(4000/max))
and similar. It looks like a relatively easy thing to incrementally improve.
Hey, Gunnar here from the Debezium team. I would love to learn more about those issues you encountered with Debezium. Perhaps it's something we could improve. So if you could share your feedback either here, on our mailing list (https://groups.google.com/g/debezium), that would be awesome. Thanks!
For us, debezium has been working rather well in recent memory (there were a couple PRs I submitted that I'm proud of, even if tiny!). Most of the issues are on the kafka connector side, whether it's the bigquery sink, jdbc sink, and s3 sink.
A couple things that do pop to mind as it relates to debezium include better capabilities around backup + restores and disaster recovery, and any further hints around schema changes. Admittedly I haven't looked at these areas for 6+ months so they may be improved.
Well I'm trying to find it now! Maybe it has been updated. It was essentially a case statement on the destinations which went for some wide data types but, looking now, I see where it is looking at the schemas and getting more precision.
No support for geometry or ranges, alas! "Datastream doesn't support columns of the Geometric data types or Range data types. The values in these columns will be replaced with NULL values."
We have been using RDS -> DMS -> Kinesis -> BigQuery for 3 years now. Very little maintenance and has worked without issues. I doubt this service will be cost competitive over a setup like ours. Perhaps between Cloud SQL -> BigQuery will make sense?
On a related note, Salesforce and Snowflake announced a partnership for real-time data sharing. Not exactly the same, but it's interesting to see data warehouse companies building upstream. I wonder if there's a trend going on here where data warehouse companies start natively integrating with more databases, apps, and other data sources in (near) real-time
Yes, I think this will be the case. Google makes it very easy to integration with customers' BigQuery instances. Have the customer supply a service account token and a target dataset, then run a query with the output table configured in that target dataset. Done.
This seems like an ideal use case for us. I have a naive thinking of my workflow: can someone please comment if I am off track.
I am building an e-commerce product on AWS PostgresSQL. Everyday, I want to be able to do analytics on order volume, new customers, etc.
- For us to track internally: we fire client and backend events into Amplitude
- For sellers to track: we directly query PostgressQL to export
Now with this, I am thinking of constantly streaming our SQL table to BigQuery. And any analysis can be done on top of this BigQuery instance across both internal tracking and external export.
As a heavy BQ user on my side projects, there isn’t really an alternative to BQ in AWS. I find that RedShift does not provide a lot of the functionality and ease of use that BQ provides.
That said the closest thing is Amazon Athena.
The architecture would basically be Kinesis -> S3 <- Athena where S3 is your data lake or you can do it like AWS DMS -> S3 <- Athena.
The reason you may want to use Kinesis is because you can use Flink in Kinesis Data Analytics just like you can use DataFlow in GCP to aggregate some metrics before dumping them into your data lake/warehouse.
BQ is saas proper vs redshift where you have to pick instance sizes etc. It’s amazing, true superpower stuff in how little you have to think about it to get loads out of it.
Exactly this (using bigquery but AWS for everything else) is pretty common. It takes a while to build a service like this, AWS spent too long in the wrong direction (redshift) and haven’t been able to catch up.
At ~$2.5/GiB of changes in Europe, this is a very expensive service. Is the sync process that resource-intensive, are they increasing margins to recoup development costs, or is it just targeting (big enterprise) customers that are okay paying this much for data sync?
This is a common thing for BigQuery to claim. It's quite fast for an analytics database, but it is not a transactional database. Even for static data, the simplest queries on the tiniest data still take 1/10th of a second or so, and realistic queries run in the 1-5s range.
Basically you could use BigQuery to build tools supporting human-in-the-loop processes (BI reports, exec dashboards), and you could call those "real-time" to the humans involved. But it will not support transactional workloads, and it does not provide SLAs to support that. I don't know about this particular feature, I'm guessing seconds but maybe minutes.
> When using the default stream settings, the latency from reading the data in the source to streaming it into the destination is between 10 and 120 seconds.
The lifting-and-dropping process from transactional stores into analytics stores is usually the first step in analytics data pipelines. This appears to fulfill that initial ingestion step specifically for Google BigQuery. A lot of times these days this ingestion step is ran through some third party service: Fivetran, Stitch, Matillion, etc. Similar to AWS's offerings for Redshift, this looks like Google's "native" ingestion tool for BigQuery.
I'd be curious to know how it handles the wrinkles Fivetran effectively solves (schema drift, logging, general performance) and if it does so at a much cheaper price.
If you use some sort of ETL to copy data from your OLTP to data warehouse you might use this to stream data from your OLTP to Big Query so you can use Big Query as a data warehouse.
- Startup that's running analytics off their production Postgres DB can use this as a gateway drug to try out BigQuery. Bonus for not having to ETL the data themselves or pay FiveTran / host Airbyte to do it
- Replace existing internally maintained sync job / ETL pipelines with this, if you're somewhat all-in on BigQuery / GCP for analytics (BQ is quite popular in analytics because of the pricing structure)
[1]: https://techcrunch.com/2019/02/19/google-acquires-cloud-migr...