Hacker News new | past | comments | ask | show | jobs | submit login
PolarDB, yet another open source database system based on PostgreSQL (github.com/alibaba)
143 points by jinqueeny on May 30, 2021 | hide | past | favorite | 87 comments



Oh, what's going on here?

> OceanBase, the database of Alibaba's fintech company Ant Group, will be open-source soon, possibly as early as June 1, according to Sina Tech.

https://cntechpost.com/2021/05/27/ant-groups-in-house-databa...

Also the team published many papers about PolarDB.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C48&q=%22...


Looks like they're different technologies, found that oceandb has been offered in AliCloud under ApsaraDB umbrella https://www.alibabacloud.com/product/oceanbase


Sorry I didn't make myself clear. I was surprised because two databases from Alibaba are open sourced or to be open sourced.


Alibaba is huge and different teams have different goals.

That one is for fintech.


These are just KPI projects as their performance are judged by meaningless metrics. OceanBase WAS open sourced, after the news reports and publicity, Alibaba deleted all code and replaced it with a Chinese announcement says it will no longer be an open source project.


Looking at their code tree, this is based on 11.2 (https://www.postgresql.org/docs/11/release-11-2.html): https://github.com/alibaba/PolarDB-for-PostgreSQL/blob/maste...

So this is code from two years ago missing stability fixes from upstream up to 11.12, at short sight.


Looks like it, yeah.

Also, Postgres 12 introduced pluggable storage, which might help to implement a shared-nothing architecture without huge changes to vanilla Postgres (I haven't looked at how large their delta is)


Citus Data enables scale out, while also being a pure extension. That means you can upgrade Postgres like normal to the latest point release using whatever normal upgrade process you want (e.g. OS packages).

It has worked for a long time without the need for Postgres 12. However, the new APIs introduced in v12 did enable us to offer columnar compression as an option, which complements a lot of scale-out use cases.

See: https://www.citusdata.com/blog/2021/03/06/citus-10-columnar-...

I believe using the extension facilities of Postgres is far superior to a fork in the medium to long term.

(Disclaimer: I work for Citus, and on columnar compression.)


> I work for Citus, and on columnar compression.

:-)

I am waiting for the basic index support for columnar tables !

https://github.com/citusdata/citus/pull/4950

:-)


Are there any stability fixes you have in mind? In my recollection, there is not much instability in PG releases.



OTOH the PolarDB specific changes seem to be contained enough that if you decide to run it in production, you can probably just apply most of the changes from the v11 branch yourself.

But I agree it's not a very good look to code-drop something on a .2 release when there's been 2,5 years of fixes.


The diffs are non-trivial (EDIT: only included changes in directories where conflicts are most likely to occur):

    $ git log --oneline REL_11_2..REL_11_12 -- src/backend src/include ':!src/backend/po' | wc -l
    536
    $ git diff --stat REL_11_2..REL_11_12 -- src/backend src/include ':!src/backend/po' | tail -1
     352 files changed, 14651 insertions(+), 7078 deletions(-)
Even if the conflicts are minor, it's going to be annoying to try to work it out. If you are hitting a specific crash, there's a good chance you can backport the fix cleanly, but I doubt you can just pull in all of the fixes proactively without some knowledge of the details of the fork.

I haven't really looked at the details... perhaps PolarDB already has many (or all) of the fixes since 11.2. Also I haven't actually tried a merge, I'm just assuming the difficulty based on the number of diffs (and my experience doing minor version merges in the past).

(Disclaimer: I work for Citus Data. Citus takes the approach of a pure extension, which means it works on unmodified Postgres, and minor upgrades typically don't interfere at all.)


What are people/companies using currently to make their postgres database distributed these days?

Currently running my app + db on Heroku in the EU and would like to scale out since latency is abysmal for users in Australia and Asia.


fly.io[1] supports Postgres clusters with read replicas in different cities[2]

It looks like AWS's Aurora Postgres doesn't support cross-region read replicas, but apparently their Aurora "Global Database" offering does[3]

- [1] https://fly.io

- [2] https://fly.io/docs/reference/postgres/#scaling-horizontally...

- [3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...


For fully managed (similar to Heroku Postgres) Crunchy Bridge [1] supports replicas across regions including say EU to AU. Which would actually couple really well with Fly.io for the app side.

- [1] https://www.crunchydata.com/products/crunchy-bridge/


The simple solution is to run Postgres replicas in those regions and do local reads for your app while directing the writes to the master region. This works well for read-heavy apps, and you can also put the master in a geographically central location to help with write latency.

If you need cross-region multi-master then I recommend something like CockroachDB or Yugabyte that have regional distribution as a core feature.


I’m glancing over the Heroku Postgres docs and they seem to offer easy to set up read replicas called “followers”. Wouldn’t this work for you?


Heroku is only available in EU and US east. Latency from Australia is somewhat unavoidable. You can maybe reduce it by using Cloudflare or similar to terminate TLS as close to your users as possible.


Cloudflare give different Anycast IPs for each plan and they don't always hit the local ingestion point. Some people shot themselves in the foot with free/pro plan in certain markets (India, aus etc..) I don't believe in May 2021 local ingestion (TLS termination) is happening on anything less than Business plan in MEL or SYD


I think Citus is one option. Or possibly TimescaleDB


Those are meant for a single location, not cross-region distribution.


I don't think that's right.

Citus has a post from 2017 talking about allowing for horizontal scaling of a postgres instance. [0]

Same thing with Timescale. [1]

--------------------------------------------

[0] https://citusdata.com/blog/2017/02/16/citus61-released/

[1] https://blog.timescale.com/blog/timescaledb-2-0-a-multi-node...


It seems that Alibaba is going more open source than Amazon for their clouds services. That’s interesting and they may use this approach as a key differentiator.


Based on what metric? Amazon took over Elastic's open source server and another major one IIRC. That was huge. PostgreSQL is not about to disappear.


They did it only after ES decided to abandon the Apache license. I doubt that Amazon would have done it if they were not forced. Redshift is also Postgres based and Amazon never released something. So based on that metric Alibaba is more open.


There would be no open source Elastic had Amazon not taken it on. It was a smart move and benefits their balance sheet to do so. The point though is that move was greater than replicating existing open source work.


? Team on my previous workplace patched ElasticSearch to add functionality for years. Could thay do it if there were no open source Elastic?


I agree you can't patch software without an actively supported open source package. You may have meant to reply to someone else.


No commit history? It would have been interesting to follow the development work to see what changes they've made compared to PostgreSQL.


This is effectively required when open sourcing something that was previously internal. I've done it at a company. There could be company internal things that leak in old revisions of code and commit messages. Even if the latest commit on master is clean, it is really hard to know that every revisions, throughout the whole history is clean.


Good point. In such a case, I think it would still be nice if at least trying to split the total change into a few separate commits, that builds upon each other. That way the commit log could be a good start to look at for someone who wants to understand the code base.


With the first commit being the specific commit from the PostgreSQL repo which they base their work on, i.e. that they forked from.


They are following the Gaussian approach. Show only the final result!


What more db’s are based on Postgres?


Greenplum

Hadapt

Netezza

PipelineDb

Postgres-XL

Redshift

AgensGraph

TimescaleDB

Fujitsu Enterprise Postgres

PolarDB

CrunchyBridge


A quick note on Crunchy Bridge, it is pure Postgres, not "based" on Postgres. We haven't forked or modified code at all. And some of the above (Timescale) are extensions which do not change Postgres, but hook into the extension APIs.


Yes this is an important distinction between and extension for Postgres vs a fork of.

The short-sightedness of forking boggles my mind every time. Oh, you don't want a team of hundreds of literally the best Postgres programmers in the world working for you full time all the time for free? Ok dude, have fun with your fork.


Forking also happens because of the license.


Yugabyte


It has an API compatible with the postgresql wire protocol, but I don't think it fits in with the list since it doesn't share any code.


Yugabyte lifts the complete top query layer from Postgres and swaps the back end for RAFT per table. It definitely qualifies.


I think "based on" is more like implemented on top of, similar to timescaledb


EdgeDB


Looks like EdgeDB hasn't done enough marketing. No one has mentioned EdgeDB except you. Well we're (I love EdgeDB) still in beta.


Greenplum, Postgres-XL and PipelineDB.


I'm still really upset that confluence bought and ruined pipeline db. It solved my problem perfectly and I haven't seen one yet that fits what I needed as well.


Looks to me like an acquihire of a team that would otherwise (probably) have ended up winding up the company, sadly.



TimescaleDB.


Yellowbrick


Cockroachdb


No, this one's built from scratch in Go and based off LevelDB/RocksDB. They support Postgres wire protocol, but not based off any Postgres code.


This field is getting crowded wonder if there will be some type of consolidation.


There’s no need for consolidation when Postgres already exists.


Sometimes I dream about a highly available multi primary postgresql.


Do you also dream of loss of certain ACID semantics or crippling performance issues?

In the real world, it is extremely hard to provide all the same guarantees you get out of a single instance of <database vendor> if you turn around and spread it across the internet.

If you have super deep control over the physical & temporal environment around your system, you can cheat the rules a little bit (i.e. Google).


Yes I know that the perfect database cannot exist. I think some trade-off are possible though. Today I use postgresql with a primary and replicas, together with couchdb in the same application. They complement each other, but I think something good between is possible.


I would recommend having a look at Yugabyte.


Why yugabyte over cockroachdb?


Licensing. CockroachDB does not allow as-a-service.


Yugabyte does allow it though? I could take it and create my own company based around providing yugabyte as a service?


Yes. The core database is fully Apache 2 licensed with no strings attached: https://docs.yugabyte.com/latest/legal/. Maybe it will change if the company behind feels they're ripped off. Who knows, but for now, yes, you could.

CockroachDB has an explicit clause is the licensing: Yes, employees and contractors can use your internal CockroachDB instance as a service, but no people outside of your organization will be able to use it without purchasing a license: https://www.cockroachlabs.com/docs/v21.1/licensing-faqs.html....


Thanks. I should try it out.


CockroachDB?


Perhaps. It's missing the 30 years of experience, the PostGreSQL reputation, an opensource license, and a better name.


> a better name

NGL, I've had to explain to enough people by now that the "funny name" is because it was made by ex-Googlers with a view towards resiliency, as per the idiom that only cockroaches will survive global nuclear war.


YugobyteDB ?


yea, some sort of vitessdb equivalent to postgresql would be amazing!


there used to be citus but now it's part of microsoft Azure


True, it is developed by Microsoft and available as a service on Azure. It is also open source, actively maintained and improved, and it's a PG13-compatible Postgres extension that adds both distributed database capabilities and columnar storage. :)

https://github.com/citusdata/citus

(Citus engineer)


> PG13-compatible

So not (IBM-System-)R-compatible, then?


well some use cases need something that is distributed like PolarDB, YugobyteDB, CockroachDB etc.


I'd love to try this out. What I'm missing though is at least some Docker based deployment not involving building stuff from sources and reinventing distributed architecture myself.


Can anybody comment on whether this actually works properly and does what it claims to do?


What are the differences between this and YugobyteDB?


Yugabyte is different DB technology than postgres but with PG compatibility layer, means you can use existing postgres query/application, the rest is different beast. This one is more comparable with citus, an PG extension. Meaning it runs on top postgres. However from their github page they also offer patched version of PG which I suppose offer some tighter integration with the extension.


“high availability through Paxos based replication”. I thought Paxos is supposed to be a CP system.


HA isn't really about 100% availability. Any single system that promises such is extremely likely to be misleading you somehow. Your in-flight query is going to get interrupted no matter how fancy your clustering is, and I struggle to even come up with hypothetical use cases where this is something you can't afford to have happen, ever.

All you need is that in the event of a failure the clustered system can still recover quickly enough (to a well-defined state!) that the application layer can deal with the transient failure without significant impact on users, maintaining the illusion of availability.


> I struggle to even come up with hypothetical use cases where this is something you can't afford to have happen, ever.

a rocket is using this query to adjust their trusters ;)


Such a rocket would likely have two or more independent systems that would each have to agree on the adjustment, so one of them temporarily failing would not pose a problem. Though I doubt there are any rockets using database queries as part of their control system.

In those kinds of systems I suspect the approach is to enumerate every possible scenario and prove that the system behaves correctly in all of them, and if you can't do that, the system may be too complex and you need to redesign it to be simpler so that you can guarantee that it does not fail.


You have N >= 3 nodes, or N >= 2 and a non-compute arbitrator. One of them goes down, stops responding. You still have quorum, data processes continue just fine. That's high-availability in a CP system.


[flagged]


How do you think the next dominant Database platform will start? Is your expectation that innovation requires polish before hitting the (free open source) “marketplace”?


My expectation is that the next dominant platform will offer something unique which creates enough value that I'm willing to overlook the unfinished parts that deduct value.

When Neo4j and other graph databases went big in the past decade, they value they created: "finally! I don't have to faff around with RDBMS tables and Cobb purists to store and query my non-relational object-graph!" - despite the fact that Neo4j then (and still?) didn't support running serving databases concurrently or schema enforcement or even transactional atomicity (I think they fixed that recently?)

So in this context, what business-value does PolarDB add or create that makes it worthwhile to deal with its expected short-life?


It’s distributed Postgres. The same reasons you would use Cockroach or Citus instead of stock Postgres also justify using polar.

Given that there’s multiple companies that exist to scale out Postgres, someone sees business value.


> It extends PostgreSQL to become a share-nothing distributed database, which supports global data consistency and ACID across database nodes, distributed SQL processing, and data redundancy and high availability through Paxos based replication. PolarDB is designed to add values and new features to PostgreSQL in dimensions of high performance, scalability, high availability, and elasticity. At the same time, PolarDB remains SQL compatibility to single-node PostgreSQL with best effort.

That’s pretty compelling, if it delivers.


It is alibaba: it's life might not be so short? They are huge and have a cloud provider arm.


This! Very often the US centric community here on HN is underestimating the sheer size and expertise in Alibaba‘s cloud ecosystem.


> How do you think the next dominant Database platform will start?

Not this way.


At least it's open source and NOT made by Google?

Also, how do you know it's a 'buggier version of postgres'? have you tried PolarDB yourself?


you re missing open source's main power to proliferate.


Why wouldn’t I use cockroachDB?


I don't know, maybe you should.

CockroachDB seems to be a distributed database system written in Go which has implemented a Postgres query/wire protocol compatibility layer.

PolarDB is a Postgres fork actually using the Postgres codebase and extending it to a distributed database system. Maybe one day they can unfork because it's possible to implement PolarDB on top of Postgres as an extension and/or they contribute/get all their changes into Postgres core.


Maybe because you need the as-a-service aspect and the licensing does not allow it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: