Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: PeerDB (YC S23) – Fast, Native ETL/ELT for Postgres
261 points by saisrirampur on July 27, 2023 | hide | past | favorite | 101 comments
Hi HN! I'm Sai, the co-founder and CEO of PeerDB (https://www.peerdb.io/), a Postgres-first data-movement platform that makes moving data in and out of Postgres fast and simple. PeerDB is free and open (https://github.com/PeerDB-io/peerdb) and we provide a Docker stack for users to try us out. Our repo is at https://github.com/PeerDB-io/peerdb and there’s a 5-minute quickstart here: https://docs.peerdb.io/quickstart.

For the past 8 years, working at Microsoft on Postgres on Azure, and before that at Citus Data, I’ve worked closely with customers running Postgres at the heart of their data stack, storing anywhere from 10s of GB of data to 10s of TB.

This was when I got exposed to the challenges customers faced when moving data in and out of Postgres. Usually they would try existing ETL tools, fail, and decide to build in-house solutions. Common issues with these tools included painfully slow syncs - syncing 100s of GB of data took days; flaky and unreliable - frequent crashes, loss of data precision on target etc., and; feature-limited - lack of configurability, unsupported data types and so on.

I remember a specific scenario where a tool didn’t support something as simple as the Postgres’ COPY command to ingest data. This would have improved the throughput by orders of magnitude. We (customer and me) reached out to that company to request them to add this feature. They couldn’t prioritize this feature because it wasn’t very easy - their tech stack was designed to support 100s of connectors rather than supporting a native Postgres feature.

After multiple such occurrences, I thought, why not build a tool specialized for Postgres, making the lives of many Postgres users easier. I reached out to my long-time buddy Kaushik, who was building operating systems at Google and had led data teams at Safegraph and Palantir. We spent a few weeks building an MVP that streamed data in real-time from Postgres to BigQuery. It was 10 times faster than existing tools and maintained data freshness of less than 30 seconds. We realized that there were many Postgres native and infrastructural optimizations we could do to provide a rich data-movement experience for Postgres users. This is when we decided to start PeerDB!

We started with two main use cases: Real-time Change Data Capture from Postgres (demo: https://docs.peerdb.io/usecases/realtime-cdc#demo) and Real-time Streaming of query results from Postgres (demo: https://docs.peerdb.io/usecases/realtime-streaming-of-query-...). The 2nd demo shows PeerDB streaming a table with 100M rows from Postgres to Snowflake.

We implement multiple optimizations to provide a fast, reliable, feature-rich experience. For performance, we can parallelize the initial load of a large table, still ensuring consistency. Syncing 100s of GB goes from days to minutes. We do this by logically partitioning the table based on internal tuple identifiers (CTID) and parallelly streaming those partitions (inspired by this DuckDB blog - https://duckdb.org/2022/09/30/postgres-scanner.html#parallel...)

For CDC, we don’t use Debezium, rather handle replication more natively—reading the slot, replicating the changes, keeping state etc. We made this choice mainly for flexibility. Staying native helps us use existing and future Postgres enhancements more effectively. For example, if the order of rows across tables on the target is not important, we can parallelize reading of a single slot across multiple tables and improve performance. Our architecture is designed for real-time syncs, which enables data-freshness of a few 10s of seconds even at large throughputs (10k+ tps).

We have fault tolerance mechanisms for reliability (https://blog.peerdb.io/using-temporal-to-scale-data-synchron...) and support multiple features including log-based (CDC) / query based streaming, efficient syncing of tables with large (TOAST) columns, configurable batching and parallelism to prevent OOMs and crashes etc.

For usability - we provide a Postgres compatible SQL layer for data-movement. This makes the life of data engineers much easier. They can develop pipelines using a framework they are familiar with, without needing to deal with custom UIs and REST APIs. They can use Postgres' 100s of integrations to build and manage ETL. We extend Postgres' SQL grammar with a few new intuitive SQL commands to enable real-time data streaming across stores. Because of this, we were able to add dbt integration via Dagster (in private preview) in a few hours! We expect data-engineers to unravel similar integrations with PeerDB easily, and plan to make this grammar richer as we evolve.

PeerDB consists of the following components to handle data replication: (1) PeerDB Server uses the pgwire protocol to mimic a PostgreSQL server, responsible for query routing and generating gRPC requests to the Flow API. It relies on AST analysis to make informed decisions on routing. (2) Flow API: an API layer that deals with gRPC commands, orchestrating the data sync operations; (3) Flow Workers execute the data read-write operations from the source to the destination. Built to scale horizontally, they interact with Temporal for increased resilience. The types of data replication supported include CDC streaming replication and query-based batch replication. Workers do all of the heavy lifting, and have data store specific optimizations.

Currently we support 6 target data stores (BigQuery, Snowflake, Postgres, S3, Kafka etc) for data movement from Postgres. This doc captures the current status of the connectors: https://docs.peerdb.io/sql/commands/supported-connectors.

As we spoke to more customers, we realized that getting data into PostgreSQL at scale is equally important and hard. For example one of our customers wants to periodically sync data in multiple SQL Server instances (running on the edge) to their centralized Postgres database. Requests for Oracle to Postgres migrations are also common. So now we’re also supporting source data stores with Postgres as the target (currently SQL Server and Postgres itself, with more to come).

We are actively working with customers to onboard them to our self-hosted enterprise offering. Our fully hosted offering on the cloud is in private preview. We haven’t yet decided on the pricing. One common concern we’ve heard from customers is that existing tools are expensive and charge based on the amount of data transferred. To address this, we are considering a more transparent way of pricing—for example, pricing based on provisioned hardware (cpu, memory, disk). We’re open for feedback on this!

Check out our github repo - https://github.com/PeerDB-io/peerdb and go ahead and give it a spin (5-minute quickstart https://docs.peerdb.io/quickstart).

We want to provide the world’s best data-movement experience for Postgres. We would love to get your feedback on product experience, our thesis and anything else that comes to your mind. It would be super useful for us. Thank you!




How does column exclusion work? We would like to prevent PII columns from being replicated into BigQuery.


Looks awesome! Maybe I missed it in the docs but I wonder how you solved the problem of a source table's initial snapshot when performing CDC (locking a large table on a production DB can be problematic).

Did you implement something like described here? https://netflixtechblog.com/dblog-a-generic-change-data-capt...

Lastly, I wonder what plugin you're using for the Postgres logical replication?

Congrats!

Edit: re-read it and saw about the DuckDB blog's inspired ctid solution for the initial snapshot. Do you find querying by ctid okay performance wise? As far as I remember it will use sequential scans


Great question. It uses something called as Tid Scan, which kind of means directly reading from disk based on an address and Tid Scans are very efficient. The snapshot approach captured in the blog I shared, ensures consistency.


It does involve pinning the WAL while the backfill is happening, though, which is quite different from the incremental backfill achieved by DBLog.

The former can be faster, because you’re able to walk the table in physical storage order, but your DB cannot reclaim WAL segments while this happens (disk can fill).

The latter allows for resumable backfills that can be tuned to the available spare capacity of the source DB, and doesn’t cause WAL segments to back up, but must walk tables in logical order which can sometimes be slower (for example, if you key using a random UUID).


Congratulations on the launch. Does PostgreSQL to PostgreSQL streaming using PeerDB have any benefit over just using Streaming Replication?

Could this be used as a sort of "live backup" of your data? (i.e. just making sure that data isn't lost if the server dies down completely, not thinking of HA)

Sorry if it's a bit of a stupid question, I realize it's not the main focus of PeerDB.


Postgres Streaming replication is very robust and has been in Postgres since multiple 10s of years. Logical replication/decoding (that PeerDB uses) is more recent - introduced in the last decade. However streaming replication is harder to manage/setup and a bit restrictive - most cloud providers don't give access to WAL, so you cannot use streaming replication to replicate data across cloud providers.

Sure you can use PeerDB for backing up data - using CDC based replication or query based replication and both of these are pretty fast with PeerDB. You can have cold backups (store data to s3, blob etc) or hot backups (another postgres database). However note that the replication is async and there is some lag (few 10s of seconds) on the target data-store. So if you are expecting 0 data-loss, this won't be the right approach for backups/HA. With streaming replication, replication can be synchronous (synchronous_commit setting), which helps with 0 data-loss.


I would add rather soon GIS stuff there like ArcGIS and Geopackage support you could find really good niche at GIS world becouse how powerfull Postgis is and there is that many tools that support ETL in Postgis.


Niche in GIS world is a very good idea. 100% agreed on how powerful (and mature) postgis is. Worked with many customers for whom postgis was critical to their app. Thanks for taking the time to share this input. Very useful! We will iterate and explore this further.


QGis and postgis are the primary open source universal tools


I would most orgs that have massive Geo ETL problems use FME, love it or hate it, it's probably able to do it


Excited to give this a try. I tested almost every existing solution and as you said, nothing worked at scale.

In the end I wrote my own tool specifically optimized for the use cases. It’s been rock solid for years so I know that better solutions are possible. But it has its limitations with little flexibility and I wouldn’t want to try an initial load of a table with 1B rows.

It’s great to see someone with experience with large high throughput instances working on this.


Thank you for the above comment. This is exactly why we are building PeerDB. Let us know when you are testing peerdb out, we would love to collaborate and help as much as possible. It will be great feedback for us too!


Good stuff! It's pretty nuts Snowflake doesn't offer an integration like this out of the box. BigQuery kind of supports this[1], but it's not easy to set up or monitor.

Good luck!

1 - https://www.youtube.com/watch?v=ZNvuobLvL6M


Thanks for the comment! Yep Google provides Datastream. We tried it out and the experience was pretty good! However it was very much tied to the GCP eco-system. With PeerDB, our goal is to be open - be community driven than cloud driven. Also just to mention, as called out in the post there are more features apart from cdc (query based streaming, postgres as the target etc) that we will keep adding to help Postgres users.


Nice! I like the focus on Postgres. Most ETL tools end up trying to build for a larger matrix of source and targets which limits using database specific features and optimizations. Is the CDC built primarily on top of the logical replication / logical decoding infrastructure in Postgres? If so, what are the limitations in that infrastructure which you'd like to see addressed in future Postgres versions?


That is a really good question! A few of them that come to my mind:

1/ logical replication support for schema (DDL) changes

2/ a native logical replication plugin (not wal2json) which is easier to read from the client side. pgoutput is fast but from reading/parsing from the client side is not as straightforward.

3/ improve decoding perf - i've observed pgoutput to cap at 10-15k changes per sec, for an average usecase. This is after good amount of tuning - ex: logical_replication_work_mem etc. Enabling larger tps - 50k+ tps would be great. Also this is important for Postgres, considering the diverse variety of workloads users are running. For example at Citus, I saw customers doing 500k rps (with COPY), I am not sure logical replication can handle those cases.

4/ logical replication slots in remote storage. one big risk with slots is that they can grow in size (if not read properly) and use up storage on the source. allowing shipping slots to remote storage would really help. i think Oracle allows something like this, but not 100% sure.

5/ logical decoding on standby. it is coming in postgre 16! we will aim to support in PeerDB, right after it is available.

I can think of many more, but sharing a few top ones that came to my mind!


Congratulations on the launch Sai. Having worked with him over the years, I know that Sai knows what postgres migration means. I have seen him deal with countless migrations in and out of our services. I am excited to see what they have built


Yes please! I love this. The abstraction required for more generic ETL solutions makes them a real pain for my two use-cases: Postgres-to-Postgres (online instance to analytics instance) and Postgres-to-Bigquery (online WAL change data to Biqquery).

I cannot wait to try this to see if I can remove Meltano (Postgres-to-Postgres) and my custom Postgres-to-Bigquery code.


Glad that our thesis resonated with you! Let us know how the tests go and also please feel free to reach out to us anytime. We would love to collaborate with you during the implementation process and see how best we can help. Would be great feedback for us too!


Pretty cool stuff, I would use it just for mirroring the data itself. Curious if you are planning to have change events for e.g. add/update/delete to the records? I would love to get them in a stream and directly dumped into a data-store like bigquery.


Yep, change events (CDC) is already supported! PeerDB replicate any DML (insert/update/delete) efficiently to the target data-store (incl BigQuery).


Your code seems to be relying on logical events to reach you in order from the postgres server. That doesn't always happen, and if to that a client restart follows you effectivelly loose data. How do you deal with this?


Hi - Congratulations! In the streaming use case, does it restart from where it left off in case the target peer or source peer is down/restarts etc?


Great question. Yes it does. PeerDB keeps track of what rows have been streamed and what are yet to be streamed. During failures (restarts, crashes etc), it uses this to resume from where it left off. More details on how we do it can be found in this blog - https://blog.peerdb.io/using-temporal-to-scale-data-synchron...


Thanks so if we do a join on two large tables, does it wait for the query to complete or it can start straight away.

Or the reverse, is it possible to have a forever running query that can stream results as and when new data comes...


Yes, PeerDB can stream query data continuously. You need to specify a watermark column (incremental id or timestamp column) as a part of the mirror. PeerDB uses this column to keep track of data that needs to be synced.


Looks very intriguing! Tried to get something quickly going with a small db set up I have. Just ran into a `peer type not supported` error and was wondering which three databases are supported of the ones you have listed. See the attached picture. https://d.pr/i/HYIk0Z+


That is a typo, you should be able to create all those peers. For which type of peer did you run into this issue?


Congrats on the launch!

I've worked with Sai for years, so I just wanted to put in a good word for PeerDB and its founders. Sai is resourceful and relentless; his energy and optimism are contagious. Kaushik complements that with deep backend and analysis skills.

Data movement is a big pain point with different players. I think it's time that there's a Postgres-centric solution out there built by a team who gets Postgres. Best of luck!


Congrats!! We also focus on performance at CloudQuery (https://github.com/cloudquery/cloudquery) by using Golang, gRPC and still trying to be abstract enough to support different databases :)

In any case good luck!


Seems like a really useful tool. Would your system support Postgres Aurora on AWS as a source database? Or does it require some lower-level access to Postgres server?

We are currently using DMS to send data to S3 and from there to Snowflake.


PeerDB should work or Aurora PostgreSQL. It should work for both log based (CDC) and query based replication. Log based because Aurora supports pgoutput plugin. Curious, are you leveraging CDC to move data to S3? or more query (batch) based?


We use DMS in continuous replication mode, which appears to use CDC under the hood according to https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.C...

In our setup DMS pushes Parquet files on s3. Snowflake then loads data from there.

We’ve occasionally had to do a full table sync from scratch, which is painfully slow. We are going to have to do that in the very near future - when we are upgrading from Postgres 11 to Postgres 15.

The S3 step also seems unnecessarily complicated, since we have to expire data from the bucket.

How does PeerDB handle things like schema changes? Would the change replicate to Snowflake? (I’m sure this is in the docs, but I’m supposed to be on holiday this week ) Thanks for the quick reply.


Gotcha, that really helps. Schema changes feature is coming soon! We are actively working on it. This thread captures our thinking around it - https://news.ycombinator.com/item?id=36895220 Also have a good holiday! :)


I think you were referring to this thread: https://news.ycombinator.com/item?id=36897010


Thank you for pointing to the right thread! :)


Congratulations on the launch. Looks great.


"Frustratingly simple"...that is a very strange adverb to use there in my opinion. When I'm assessing a tool, I don't think frustrating is a word I want to see used to describe it. Just me?


Having dealt with ETL tooling that starts as "simple" and next thing you know is a few dedicated hires for what was supposed to be a simple pipeline, the phrase resonates a lot. If you've already built out a team to do this, tried multiple different tools, and then have something that just works I'd be all for it and it would have been just that to me frustrating to have gone down some other path first.

As a vision statement to me it resonates, now curious to give it a try and see if it fulfills on that vision.


Good point! We got mixed feedback on this. Some people really loved it and some felt similar to what you mentioned. Transparently, we left it as is, as it was intriguing the audience. However point taken, will take this as inputs to future changes, if any :)


It is cute but since it is a negative word, I would suggest not using it. Instead use something like "Amazingly Simple" or "Incredibly Simple".


Maybe "surprisingly simple"?


Very interesting. One of our colleagues/friends also proposed this one! :) Will take this as inputs, if we change it. Thanks!


Another vote for "surprisingly". Although "frustratingly" is kind of fun because it's so weird...just too negative. If you want to keep a provocative adverb, maybe "oddly" or "weirdly" would be fun too...not normal, but not so negative.


I like it. Nothing is better than a pleasant surprise. You win this one dang!!


I agree definitely something that I would notice from reading


It makes me think of "cute aggression"


Any thoughts on type2 scd, when loading Postgres data into snowflake/bigquery? We use fivetran’s “history mode” for this and it’s been very useful.


So, you’ve got funding from Hacker News’s side hustle, great, but what’s the business model beyond that? Why would anyone pay you for hosting if all the code is open-source and anyone can host it on their own, in their own cloud of choice?

(Also, not having pricing when launching seems like a very strange choice, since potential buyers might pass and never come back.)


Thanks for the feedback! Many dev-tools and infra products, specifically in the Postgres space are open source. Citus (my previous gig) is an example here. Customers still pay for these products because of 2 main reasons a/ operationalizing an open source tool to support production workloads requires good amount effort - ex: setting up HA, advance metrics/monitoring etc. and they want to offload it to by buying a paid offering that is more plug and play b/ they want to work with a team which can empathize with their challenges, are experts in the area and helps make them successful. With PeerDB, we are expecting something similar and are committed to make our customers successful.

On the pricing side, valid feedback. We are actively working with customers and are coming up with custom (reasonable) pricing based on their use-case and usage level of product. Through this process we are getting a ton of feedback. As mentioned in the post, a common concern we heard from customers is that the existing tools are expensive (pricing is black box) - they charge based on the amount data-transferred. We are thinking of ways to make pricing more transparent (see post on what our thinking has been so far). But haven't landed on the right strategy yet. We didn't want to rush through publishing any pricing.


Many businesses, big or small, are cheapskates. If the paid offerings don’t get you much over the free one, many companies will just take the open-source thing and make it work for them. Does the “Cloud” offering even get you any support?


We anticipate both groups of businesses. Considering we are building a product for ETL/data-movement which innately has multiple moving parts & fragile, we anticipate a good chunk of businesses preferring to offload the effort of managing to us!


They don't have to convert everyone, just enough to cover their cost(and pay back investors)


On the benchmark section of the homepage, it seems like the performance of "peerdb" is ranked second in every category. The way you put the title, then the green bar, which is supposed to indicate something, and then all the text has the same formatting, including the header. To be honest, it is a bit confusing to look at.


Point taken. The green bar indicates how fast peerdb is. But I understand how it can cause confusion. We received this feedback from a couple other folks too! We will fix it soon. :)


If you look it carefully, the green bar indicate performance of peerdb


Yep, that is correct. How fast peerdb is vs competition.


What happens if there's a table schema change on a mapped table on the source side? What about on the target side?


Hi there, I’m Kaushik, one of the co-founders of PeerDB. PeerDB doesn’t handle schema changes today.

For CDC, change stream does give us events in case of schema changes, we would have to replay them on the destination. Schema changes on the destination are not supported, the general recommendation is to build new tables / views and let PeerDB manage the destination table.

For streaming the results of a query, as long as the query itself can execute (say a few columns were added or untouched columns were edited) mirror job will continue to execute. In case this is not the case, there will be some manual intervention needed to account for the schema changes.

Thanks for the question, this is a requested feature and on our roadmap.


Calling out limitations like this in the documentation would go a long way in building confidence in the project. Better yet, if there's an example of how to deal with "day-2" operational concerns like this.

Simply looking at the docs on these two pages, its unclear to me whether there's a way to update the mirror definition when a schema change occurs or if I need to drop & recreate the mirror (and what the effects of this are in the destination):

- https://docs.peerdb.io/sql/commands/create-mirror

- https://docs.peerdb.io/usecases/Streaming%20Query%20Replicat...

All-in-all, very excited to see this project and will be watching it closely!


Thanks for the feedback and I agree on making these missing features more visible in our documentation! We did it here - https://docs.peerdb.io/usecases/Real-time%20CDC/postgres-to-... But will make it more visible soon i.e. in streaming query, cdc, CREATE MIRROR docs etc. We were thinking something on the lines of ALTER MIRROR or provide a new OPTION in CREATE MIRROR that will automatically pick up schema changes etc. Exact spec is not yet finalized.


Hope in the future you introduce more Source and Target dbs just like ReplicaDB open-source software


Thanks for the comment! As we work with customers we will add more source and target dbs. A couple of things, our scope as of now is data-movement in/out Postgres. And as we add more data-stores as sources/targets to Postgres, providing a high quality experience will be the primary focus than expanding coverage.


Congrats! I’m curious where PeerDB positions itself relative to ReplicaDB? Also wrt sources, I’m curious whether CSV will ever be on the roadmap? Although it’s antiquated it’s ubiquitous, so I was surprised it’s not supported with launch. That said, I’ve encountered ridiculous challenges with data movement from CSV to Postgres and I’m curious if that alone was the blocker?


Thank you. Our goal is to focus on postgres and provide a fast (by native optimizations, see above post for a few examples), simple and a feature-rich data-movement experience in/out of Postgres. So adding more connectors based on customer feedback will be a part of this journey!

In regards to CSV as a connector, postgres’s COPY command should do it right? Am I missing something? Is it CSV files in cold storage (like s3 etc)? OR periodic streaming of CSV files into Postgres?


That’s right! If it’s easy then it should be easy for your team to add —- but if it’s not easy then it’d be even more useful for your team to add! Win win


You bring up a great point. Periodically streaming CSV files to Postgres from storage through a single SQL command (CREATE MIRROR) is indeed very helpful for customers. We will add this to our product prioritization. With the infra we have, this shouldn't be too hard to support!


Looks good! Do you have any benchmark against Debezium for CDC?


Not yet. But very soon. A few benefits of PeerDB vs Debezium incl. 1/ easy to setup and work with - no dependence on kafka, zookeper, kafka connect. 2/ managed experience for CDC from PostgreSQL through our enterprise & hosted offerings 3/ performance wise, with the optimization we are doing (parallelized initial loads, parallelized reading of slots, leaner signature of CDC on the target), I'm expecting PeerDB to be better. However not sure by how much. Stay tuned for a future post on this :)


Congrats on your launch, great to see that much activity in the CDC field. A few comments on the comparison to Debezium (I've been its project lead for many years):

> no dependence on kafka, zookeper, kafka connect

That's not required with Debezium either, using Debezium Server (even with Kafka, ZK is obsolete nowadays anyways)

> managed experience for CDC from PostgreSQL through our enterprise & hosted offerings

There's several hosted offerings based on Debezium (one example being what we do at decodable.co)

> performance wise, with the optimization we are doing

I'd love to learn more about this. Debezium also supports parallel snapshotting, but I'm not clear what exactly you mean by parallelized reading of slos and the CDC impact on targets. Looking forward to reading your blog post :)


Thanks for the above reply! Useful feedback/inputs for us.

> Looks like Debezium Server doesn't require Kafka, ZK for setup. However it supports only messaging queues as sinks (targets). So to stream CDC from Postgres to a DWH - one needs to a) setup/manage messaging infra as a part of their stack to capture CDC changes b) write/manage reading from the message queue and replaying the changes to the target store (ex: Snowflake, BQ etc). With PeerDB, you can skip these 2 steps. CREATE MIRROR can have targets that are queues, DWHs or databases.

> That is true, however the ones we tried aren't very simple to work with. For example - confluent was super tricky to work with. One has to setup a sink (to message queues), use another connector to move those changes to snowflake, use something else to normalize those changes to the final table. Overall total number of moving parts were quite a lot. decodable.co might give a better experience, will give it shot! :)

> On parallel snapshotting, very interesting, looks like it is a recent feature Debezium added (March 2023). I missed that one. In regards to parallelized reading of slot, Postgres enables you to read a single slot concurrently across 2 connections using 2 separate publications - each publication filters a set of tables. Same here, we are also excited for the benchmarks. Will keep you posted! :)

Thanks again!


This looks good, we are going to give a try. Any plan to support Clickhouse as destination? Perhaps a ReplacingMergeTree in the end.


Do you plan to help create better ways to ingest data too postgres? There are notable gaps there. Or is the focus mostly egress.


Our goal is to make data-movement in/out of Postgres fast and simple! Based on customer feedback, we already added SQL Server as a supported source for streaming query to Postgres.We will indeed keep adding more connectors. Here is a reference to another relevant thread - https://news.ycombinator.com/item?id=36895220


Cool. Well I wish you every success. I may suggest a feature for you all.


Congratulations on the launch! Looks really exciting.


Moving data in and out of Postgres in a fast and reliable way is exactly what my startup needs. I am looking forward to trying PeerDB!


Is an elasticsearch sink on the roadmap?


We hadn't planned for ES sink. However after the hn launch, there have been a few requests from customers. Will add that as inputs to our roadmap. Will keep you posted! Thank for posting the question.


Any plans on supporting redshift as target?


Redshift should work as it is PostgreSQL based - under the hood we use simple DML, DDL and COPY commands. We haven't yet tested it, but worth giving it a shot! We have user testing PeerDB for a redshift like database and it works.


Not as trivial as some data types are different (jsonb, array, uuid etc) but will give it a try


Gotcha, worth giving it a shot! If any data-type behaves finicky let us know (via github issue), we should be able to add support quickly.


This is amazing!


Can you summarize the value prop please? I can't read this long form essay.


PeerDB syncs your Postgres data with other databases in real-time.


Thanks for chiming in! :) Adding a bit more color -

Fast and simple way to move data in and out of Postgres.

This includes moving data from multiple sources to Postgres and moving data to multiple targets from Postgres.


must be fate.. I'm cto of a company suddenly faced with this exact problem.

do you guys work with aurora and can you push the data to red shift? we're currently looking at airbyte but looking at other options as well.


Yes we should. You can also benefit from few of performance enhancements / usability features mentioned in the above post. Would also love to collaborate with you during the evaluation process. Would be great feedback for us. You could reach out to us via Request Access on our website. Would be happy to assist you :)


I would give AWS DMS a try and if not happy go with airbyte or similar.


Can this be used with Citus or Hydra?


Both should be supported as target data-stores.

As a source, PeerDB should likely work with any Postgres based databases (like Citus). Query based replication should work! Log based (CDC) replication could have a few quirks - i.e. the source database should support "pgoutput" format for change data capture. As we evolve we do planning to enable a native data-movement experience for Postgres based (both extensions and postgres-compatible) databases!


I love it. We'll talk soon.


I would probably just use Kafka Connect for data migration, but this looks pretty cool.


We tested kafka connect and felt it was hard to work with. Had to setup 3 steps a) stream changes to kafka b) sync raw changes from kafka to the target data-store c) normalize the raw changes to the target. Overall it wasn’t that trivial and there were multiple moving parts. PeerDB should be much simpler than that - create peers and create a mirror using simple SQL commands. This takes care all the above steps.

Also my understanding is the kafka connect means you need to use debezium. Sharing a thread which captures a few benefits of PeerDB vs Debezium - https://news.ycombinator.com/item?id=36898413


Looks really cool! Nice work.


Awesome!


Meta: These links are redirecting to https://docs.peerdb.io/introduction#demo for me:

Real-time Change Data Capture from Postgres (demo: https://docs.peerdb.io/usecases/realtime-cdc#demo)

Real-time Streaming of query results from Postgres (demo: https://docs.peerdb.io/usecases/realtime-streaming-of-query-...)


thank you for pointing this! just fixed it.


On the supported connectors page[0], the link to the right of the row for Postgres/S3 is to localhost instead of the docs.

[0] https://docs.peerdb.io/sql/commands/supported-connectors#:~:...


Thanks a lot! Fixed it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: