Hacker News new | past | comments | ask | show | jobs | submit login
Making PostgreSQL Scale Hadoop-style: Benchmark Numbers (citusdata.com)
134 points by ranvir on Oct 31, 2014 | hide | past | favorite | 31 comments



I wish this were open-source. Citus could certainly still make money hosting or supporting the code.

But lack of sharing is what we get when major open-source projects do not choose the GPL.


It's Apache V2, which is a lot better than GPL for many of us.

https://github.com/citusdata/cstore_fdw/blob/master/LICENSE

They're really nice talented people, and a great example of a company giving back to the community.


It seems that only the storage portion is open source. The portion that scales Postgresql horizontally isn't. Am I wrong?


An huge amount of effort and innovation is going into BSD/MIT/Apache projects, including PostgreSQL. What is your evidence that the results are better with the GPL?


Pricing pages without prices. I hate them.


Usually means they are trying to sell to enterprise rather than small business. It's pretty much just a variant of "If you have to ask, you can't afford it". Pricing pages like this aren't about finding the price, but initiating a potentially months or years long sales process.

On that other hand, I guess they could just still be working out their pricing ;)


Products with pricing pages without prices doesn't exist in my world. I aim to forget products like that ASAP.


"How much?"

"How much you got?"


You pay a little, you get a little scaling.

You pay a lot, you get a lotta scaling.


"How about three-fiddy?"


three-fiddy it is! (Umur from Citus here)

I hear you, and we are working on fixing that even as it might take some time. The challenge for us is that for an enterprise, alternatives could cost literally in the millions (see Oracle pricing at $100k's for just a single 8-core commodity machine). For start-ups, we have offered Citus for prices lower than $5k per node in the past, and we provide an entirely free community version as well.

Essentially, our take is to not have pricing be what stops you from using CitusDB. And if you are an enterprise, the value you get from using Citus should far exceed that you'd get from any other alternative out there.


Yes, it would suck to not capture the value you create, and I think you totally deserve to be paid very handsomely for this tech, of course. Oh, and please make sure to charge extra from the energy, finance and healthcare sectors.

It's just that "please call for pricing" means a negotiation with a sales guy, which many people find uncomfortable, unless they work in corporate purchasing.


Pricing is hard, especially on truly high tech product. It is always sold for less than it is truly worth, a hit you take for the art.


No thats not really true, everything is sold for less than it is "truly" worth. Price discrimination works both ways, no reason to assume the seller will capture it all.


Most of the time when garage based uber hackers make a product they lose proportionally more. Not that this is bad, but higher tech doesn't mean correspondingly more profit.



(Ozgun from Citus Data)

This benchmark confuses CitusDB with PostgreSQL + cstore_fdw extension. CitusDB scales out PostgreSQL to multiple machines, and cstore is a columnar store for PostgreSQL. The author has a clarification posted at the end.

For single node Postgres + cstore numbers on TPC-H, we found that a few simple changes notably help. 1/ Analyze on foreign tables + increasing work_mem helps join queries by 2-4x, and 2/ Using the double precision instead of the numeric type increases aggregate function performance by 6x.

Lastly, we agree that vectorized execution can result in notable performance wins! See https://github.com/citusdata/postgres_vectorization_test for some initial work. We hope to incorporate some of MonetDB's vectorized execution features in cstore_fdw in the future.


Instead of seeing this asa Hadoop alternative - this might be a better alternative to the clunky data warehousing options like Vertica, Netezza, Greenplum, etc.

The Citus vs Hadoop comparison feels a little apples vs oranges as presented.

I worked a bit with Netezza appliances which use an older version of postgres which can spread queries across a Bladecenter ... I wonder how this compares.

The downside of the Netezza (beside the huge cost) is that it is not expandable at all - to get more Netezza you need to buy another multirack system.

Also there is a bottleneck getting data in and out as there are individual host servers that you launch jobs through (ibm x3650s if I remember correctly).

Hadoop does a significantly better job than something like Netezza in those 2 areas.

I guess the head to head comparison would be Citus vs Impala/ Hbase? That is probably where a 'massively parallel' postgres setup that can scale horizontally would out perform its hadoop counterpart.


I don't know much about the practical operation of this kind of software. What is it that makes Citus a better alternative to, say, Greenplum? Both of them are PostgreSQL-derived parallel column-store databases, right? What is Citus's USP?


Netezza is fast. It solves many problems very well. It is expandable. How can it be free and expandable at the same time while also being blazingly fast?

Free hardware?


I would love to see a comparison to a cost-matched Redshift cluster, especially since this test is running on Amazon's hardware.


Neat. Postgres has always had a kick-ass I/O layer - particularly on ext4.

I think showing Q2 and Q11 numbers would've been great, because for something like Tez, this is how those plans look in Hive (before the cost-based optimizer work)

http://people.apache.org/~gopalv/tpch-plans/q2_minimum_cost_...

http://people.apache.org/~gopalv/tpch-plans/q11_important_st...

Postgres's query planner should shine for those.


You've seen better performance on ext4 than XFS? The opposite has been my experience (mainly on 1tb data across 100 million rows, 20,000 queries/sec). btrfs + compression was 5x faster than XFS, but btrfs has nasty kernel deadlock bugs when the disk is almost full.


I wish postgresql was easy to cluster.

I tried google'n for tutorials but there are none.

There are no books on clustering or sharding postgresql too? At least I haven't found any.


It is tricky. It is also hard to make a real FT postgresql instance, as most tutorials have a single pgpool node doing the load balancing, which shifts the SPOF to the pgpool node. You can do it more or less with a virtual IP ala http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_maste...

To add sharding on top of that is a similar tutorial, but even more complicated.


What about Postgres-XL?

I'm genuinely asking. I did not use it yet, but when I was researching database design subject that was my assumption we would use if we need to scale horizontally.


So what's the difference between Citus and Greenplum?


There are some basic SparkSQL configs not discussed in the blog post; see more here: http://apache-spark-developers-list.1001551.n3.nabble.com/Su...


Great results. Kudos to Citus team.


What about Hive?


I don't think they make monitors wide enough to show Hive results on the same graphs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: