Why does Netflix have half a dozen different database/database-esque systems? Sn...

mrep · on July 11, 2020

FAANG companies have thousands of engineers working on their main products. At that scale, you cannot manage everything with that many people in one binary nor one database. Thus, business logic and data gets siloed into separate services which individual teams own hence the rise of microservices and different database systems for different use cases.

They also have billions to tens of billions of dollars worth of hardware costs per year so what can seem like a waste of time to optimize can end up saving millions of dollars worth of hardware a year.

didibus · on July 11, 2020

> Is there a good reason for this that people without experience at FAANG orgs can't grasp?

I think yes. It's about the scale of teams working on different parts of the products.

Its not like one team runs a DB where all other teams all put their data into. So multiple team make their own choice of DB to use for their own use cases, and with their own preferences and reasons. So as a whole, Netflix or other FAANGS can grow to use many DB, many languages, many frameworks, etc.

jlgaddis · on July 11, 2020

Others have already mentioned several reasons that would justify their choices. I completely understand why a company of Netflix's size has half a dozen different databases.

Recently, I ran into a somewhat similar situation that left me asking "why the hell are there five different databases running on this ONE virtual machine?".

It was one of VMware's virtual appliances running one of their products. I want to say it was either vRealize Network Insight or vRealize Operations Manager but I'm not 100% sure. It was likely one of those or something along those lines.

To stand up the smallest possible deployment of this virtual appliance, the "minimum recommendations" were something like 8 or 10 vCPUs, 32 GB of RAM, and 800 GB of storage -- and, if memory serves, this was for a single node!

Anyways, I got looking into it a bit and discovered that there were a total of five (5) separate database instances running on this one single virtual machine. I can't recall the specifics now but I'm pretty sure there were two PostgreSQL instances and one MongoDB instance. The other two have escaped my memory but I believe #4 was Cassandra. NFI now what #5 was.

On a positive note, afterwards I completely understood why the "minimum recommendations" were what they were. I would have (naively?) thought it'd be better to spread those database instances across a few VMs but I suppose (from VMware's perspective) when your customers call up screaming because their appliance they just paid six figures for is running like crap before they've even really put it into production, 1) you've got just one VM you've got to deal with and 2) the easy "fix" is to simply tell the customer to give it more resources ("another 10 vCPUs and an additional 64 GB of RAM and it should be fine!").

</rant>

jacques_chester · on July 11, 2020

What you're seeing is a collective action problem. Each team makes a rational decision, but only does so locally. Solving a collective action problem requires widely distributed efforts that are valuable collectively, but which don't make sense individually.

I collected a list of these starting at my previous employer and got up to 91 examples. Stuff like "consistent configuration names" (ssl_verify vs verify_ssl), "consistent log format", "consistent Kubernetes CRD naming scheme" etc are all examples.

It is not an easy thing to solve. Especially since folks don't usually think of it in economics terms.

Disclosure: I work for VMware, but not on the products you mentioned.

dmlittle · on July 10, 2020

Not sure about all of them but different databases have different use cases. It's not necessarily Postgres vs. MySQL where they are competing databases for the same use case. Snowflake, for example, is a data warehouse that (I believe) stores data in a columnar format. This makes it much better at performing aggregate queries (how many shows were watched on netflix today?) but a lot slower for specific queries (where in this episode is gavinray currently at?).

gavinray · on July 10, 2020

Hive is a Data Warehouse that runs on top of Hadoop.

Snowflake is a Data Warehouse/Lake, but it's also it's own custom SQL DB.

Cassandra is a NoSQL DB

RDS is running (some standard relational DB)

Apache Druid is a columnar analytics DB centered around realtime uses. It has it's own query language and delegates to Apache Calcite for specific DB/datasource drivers. Can integrate with Kafka/Hadoop.

Presto is (to my understanding) like a meta-DB that can query multiple databases. Similar to an integrated Apache Calcite or Google's ZetaSQL.

There are a LOT of overlapping concerns here, which is why the confusion.

Essentially they have 2-3 different products in several categories targeting generally the same usecase.

nemothekid · on July 11, 2020

AFAICT, Hive/Snowflake are the only overlapping concerns here, and I can see why they would have both (Snowflake may be the better product, but at the end of the day Netflix owns their Hive install).

Everything else solves different problems at Netflix scale. I'm spitballing here but Cassandra could be used for metadata serving (high throughtput, embarassingly parallel reads with high uptime), RDS for their billing system (transactions, ACID, etc), Druid for realtime OLAP and Presto as an interface to Hive/Snowflake.

A smaller company wouldn't need this level of complexity (if you aren't large, you could probably serve your metadata from MySQL, and just use Snowflake as your OLAP engine).

texasbigdata · on July 11, 2020

Is there a good resource for this comparison plus the many streaming/ETL permutations. It's a bit confusing even just what the various Apache products do (beam, kylin, etc, etc).

ramraj07 · on July 11, 2020

Huh, even MySQL and postgres are not apples to apples - we are just now launching a service in MySQL instead of our standard postgres because MySQL has features that postgres doesn't (better connection pooling, potential data partitioning)

gshulegaard · on July 11, 2020

I am sure MySQL is a great fit for your use case.

For others reading your comment though, I did want to list some things I have used with Postgres that relate to connection pooling and data partitioning:

* PGBouncer for connection pooling/sharing.

* Postgres Table Inheritance for table partitioning (https://www.postgresql.org/docs/12/ddl-inherit.html)

* PGPartman for automating the creation of partitions (https://github.com/pgpartman/pg_partman)

* Citus for low-barrier data sharding (it's a Postgres Extension like PostGIS) (https://www.citusdata.com/)

dmlittle · on July 11, 2020

I would probably say MySQL and Postgres is apples to apples just different types of apples (granny vs delicious). Their differences are not as big as a regular relation DB vs. a columnar DB.

mlthoughts2018 · on July 11, 2020

Supporting more choices is hugely important even for very small companies. One of the biggest alarm bells that it’s time to leave a company is when they start deeply enforcing just one database solution, just one message queue solution, just one language, just one big data processing engine, etc.

You need extreme diversity of these tools to benefit from inventiveness of your staff.

Company mandates to strictly limit to one way will just beat creativity out of people and burn them out.

This is especially stupid if the reason for enforcing just one way or just one tool is to make SRE lives easier. Aside from that being the tail wagging the dog and making no business sense, you have the power to use software abstractions to still make SRE lives easier even without limiting freedom to use any system, even in a small company.

jacques_chester · on July 11, 2020

Choice isn't free. It imposes costs on other teams and the company as a whole. Other teams need to deal with more idioms and performance patterns, need more libraries, need to operate a more heterogeneous network etc. The company itself faces higher costs for integrating data, higher difficulty in rotating engineers between teams, has more difficulty hiring, can't get cost efficiencies by investing in optimisation of fewer platforms and so on.

Engineering isn't art. It's engineering, a field where creativity and innovation works within economic, technical and moral constraints.

jeffffff · on July 11, 2020

while mandates to only use one tool can be overly restrictive, allowing a free for all is equally if not more dangerous. there are a lot of benefits to using a standard set of tools, and developers are for the most part really bad at calculating the ROI of introducing a new tool where an existing option may already be good enough. https://mcfunley.com/choose-boring-technology

freewilly1040 · on July 11, 2020

Probably some are older systems that have some trailing number of jobs that are not worth moving. That would be my guess for why they have both Hive and Snowflake, for example

rgblambda · on July 11, 2020

I noticed the same for streaming. They have Spark and Flink rather than use Spark Streaming.

jedberg · on July 11, 2020

> Also why is it on Medium under a Paywall, they are worth, quite literally, nearly $200 billion dollars =/

Better distribution. Medium will recommend the article to their readers.

jeffffff · on July 11, 2020

you need separate datastores for operational systems and analytics systems. operational datastores are optimized for point reads and low latency and analytical datastores are optimized for scans and high throughput. you also really don't want your analytics workloads interfering with user-facing services, isolation is very important, so typically you'd have data pipelines that replicate data from your operational systems to your analytics systems. cassandra and rds are probably both used for operational use cases and not analytics. the rest are for analytics.

of the analytics tools some of these are storage systems, some are compute engines, and some are both.

snowflake is both a storage system and a compute engine. it is very nice but it is also very expensive. there are a lot of computations or analyses that could be ROI positive but that aren't ROI positive on snowflake.

hive has two main components, hiveserver2 and the hive metastore. hiveserver2, which is the part that executes hive sql queries, is a compute engine. the hive metastore is a storage system (in combination with hdfs or s3 typically). hiveserver2 has mostly been superseded by spark sql and presto at this point although it's still used in a lot of places for legacy reasons. the hive metastore is a critical component of a data lake as spark, presto, and most other standalone compute engines rely on it to determine where the files are for a particular dataset, how to read them, and what the schema is.

presto is a compute engine that executes sql against data typically stored in s3 or hdfs in parquet or orc format and registered in the hive metastore. it is excellent for low latency querying but all query execution state must be able to be held in memory. for queries where the state is too large to fit in memory, spark is a better fit. presto can also read data from other datastores such as mysql but this is a really good way to break your website (see point above about isolation) and is best avoided.

druid is both a storage system and a compute engine and use cases overlap heavily with presto. druid does two things that presto does not. the first is that druid allows you to update your data in near realtime. this is extremely challenging to do in hive/s3/hdfs based systems. the second is that druid maintains what is essentially a search index on top of your data which in some cases can dramatically improve performance of filtering operations. unfortunately druid stores all its data on locally attached disks and not s3, so data storage costs in druid are around 10-15x higher than a presto/hive/s3 solution. regarding near realtime updates, apache hudi and databricks delta lake are bridging this gap for presto/hive/s3, so unless druid can separate storage from compute and be used as a standalone compute engine i don't see it surviving in the long term.

jakereps · on July 11, 2020

> Also why is it on Medium under a Paywall, they are worth, quite literally, nearly $200 billion dollars =/

Maybe their scripts determined them hosting on Medium was costing them too much money. /s