Hacker News new | past | comments | ask | show | jobs | submit login

Hive is a Data Warehouse that runs on top of Hadoop.

Snowflake is a Data Warehouse/Lake, but it's also it's own custom SQL DB.

Cassandra is a NoSQL DB

RDS is running (some standard relational DB)

Apache Druid is a columnar analytics DB centered around realtime uses. It has it's own query language and delegates to Apache Calcite for specific DB/datasource drivers. Can integrate with Kafka/Hadoop.

Presto is (to my understanding) like a meta-DB that can query multiple databases. Similar to an integrated Apache Calcite or Google's ZetaSQL.

There are a LOT of overlapping concerns here, which is why the confusion.

Essentially they have 2-3 different products in several categories targeting generally the same usecase.




AFAICT, Hive/Snowflake are the only overlapping concerns here, and I can see why they would have both (Snowflake may be the better product, but at the end of the day Netflix owns their Hive install).

Everything else solves different problems at Netflix scale. I'm spitballing here but Cassandra could be used for metadata serving (high throughtput, embarassingly parallel reads with high uptime), RDS for their billing system (transactions, ACID, etc), Druid for realtime OLAP and Presto as an interface to Hive/Snowflake.

A smaller company wouldn't need this level of complexity (if you aren't large, you could probably serve your metadata from MySQL, and just use Snowflake as your OLAP engine).


Is there a good resource for this comparison plus the many streaming/ETL permutations. It's a bit confusing even just what the various Apache products do (beam, kylin, etc, etc).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: