Making Netflix's Data Infrastructure Cost-Effective

jonplackett · on July 10, 2020

When I read something like this I realise just how immense of a lead tech companies are building vs 'normal' companies. There's just layer of expertise in how to streamline and organise themselves that anyone without a 1000 strong engineering team and huge engineering budget just are never going to think about.

hibikir · on July 11, 2020

Having worked at some of those companies, along with small startups, the biggest difference isn't really plain old sophistication, but expenses so high as to make the engineering effort worthwhile.

If your production system is 10 instances on some random cloud, a 10% efficiency savings saves me 1 instance, so maybe $2k a year. Taking into account opportunity costs vs doing things that raise revenue, said startup would consider the effort a waste of time unless it took a few hours.

With the same architecture, but instead 20K instances, then suddenly that 10% is saving 2000 instances, and 4 million a year. Unless there's a major engineering shortage, chances are that spending a over a month on 4 million in yearly savings will be completely justified, and would even be a highlight in someone SRE's review.

Chances are that the optimization wasn't even any harder in the big tech company: It's just that small savings on big piles of money are suddenly worth it. It's not possible to find millions of dollars in loose change between the proverbial couch cushions if you didn't have the chance to spend hundreds of millions in the first place. Heck, in a growth company, even at that size, 4 million might not be enough savings, as there might be even bigger things.

Same with dev tooling: Saving a company 5% CI times is not great with 10 developers, but if you are making five thousand developers more productive, suddenly you can hire an entire team of very specialized developers, and it's not a luxury.

This is also why often copying what large companies is foolish when you are small: The tradeoffs are going to be completely different. What would be an unacceptable flaw in a large company is just fine in your startup.

The real trick IMO is the 200-800 range: Large enough that the simple solutions for small companies have probably broken down and are causing pain, but nowhere near the staffing to hire yourself a team to, say, add types to Ruby, get a team to build around the weaknesses of your database, or whichever other problem you have that any member of FAANG would just throw 10 million dollars in staffing costs without batting an eye. I've seen way too many companies that stopped being able to grow their company at those intermediate sizes, and get stuck in technical hell.

kabes · on July 11, 2020

The problem is that managers at small companies with some, but insufficient, technical knowledge also read these articles and think they need to do the same. I come across overengineereed stuff at small companies all the time for things they don't need but only do because it's what Google/Facebook is doing.

mlthoughts2018 · on July 11, 2020

Engineers do this mostly, not as much managers (even less so for technical managers). Managers typically either want things done yesterday and would see this as wasteful over-engineering, or would be delegating and trusting a senior or staff engineer if they said this was truly necessary.

_vvdf · on July 11, 2020

This is where things are heading, especially for business or corporate 'consumers':

Need web hosting? Certainly you can't deal with a 2Tbps DDoS attack, so let's route that traffic through Cloudflare.

Sending email? Can't send it out with those /23s you have, they're on some blacklist from a shady DNSBL provider. Need to pay for GSuite or Office365.

Can't protect your CEO against a targeted attack? GSuite Advanced Protection or Microsoft ATP has you covered.

organsnyder · on July 11, 2020

Some of those have been true for quite some time. I ran my own mail server back in the 2000s to early 2010s, and incoming mail was especially hellish (effective spam filtering was basically impossible).

These are really just variations on the theme of economies of scale, aren't they?

johnthescott · on July 13, 2020

commercial dns black lists can be very effective at filtering spam and not expensive for personal use.

ramraj07 · on July 11, 2020

Ill argue that you don't need it for a small company - just don't overcomplicate your life by using technology that's designed for big companies.

Stick to simple systems like elastic beanstalk or app engine + RDS and something like snowflake and you get powerful, manageable, auditable projects. I'm also finding great benefit in launching each project in its own AWS account. This means you know exactly what project costs how much!

jonplackett · on July 11, 2020

I was thinking more in terms of large but traditional companies. The kind tech companies put out of business regularly!

jacques_chester · on July 11, 2020

If it makes you feel better, normal companies have "cost accountants", whose entire job is to worry about this kind of problem. Cost accounting a fairly well-developed field with a long history. There are several major approaches and some less well known ones.

I think the most applicable in our field is "resource cost accounting", which looks superficially like what Netflix described. You keep "resource accounts" in the units of the actual stuff consumed by your business processes, then map backwards from the final product to costs.

Of course I am not an accountant. Just took a course and found it enlightening.

jonplackett · on July 11, 2020

I think the difference is that Netflix sees these problems and has the person-power to solve them and make themselves leaner. I guess it comes down to how far ahead you're looking really and tech companies are incentivised to look further ahead than your average big company. They're all planning for world domination (whether that actually happens or not)

jacques_chester · on July 12, 2020

I don't disagree, I'm just trying to convey that in some sense the forward-looking effort would be accelerated by looking sideways as well. Cost accounting is practiced widely in many industries and appears in medium-sized businesses (tens of employees) as well as large businesses (tens of thousands of employees).

gavinray · on July 10, 2020

Why does Netflix have half a dozen different database/database-esque systems?

Snowflake, Hive, Cassandra, RDS, Druid, Presto

Is there a good reason for this that people without experience at FAANG orgs can't grasp? To the naive, it seems like mixing Postgres, MySQL, Mongo, and Oracle.

Also why is it on Medium under a Paywall, they are worth, quite literally, nearly $200 billion dollars =/

mrep · on July 11, 2020

FAANG companies have thousands of engineers working on their main products. At that scale, you cannot manage everything with that many people in one binary nor one database. Thus, business logic and data gets siloed into separate services which individual teams own hence the rise of microservices and different database systems for different use cases.

They also have billions to tens of billions of dollars worth of hardware costs per year so what can seem like a waste of time to optimize can end up saving millions of dollars worth of hardware a year.

didibus · on July 11, 2020

> Is there a good reason for this that people without experience at FAANG orgs can't grasp?

I think yes. It's about the scale of teams working on different parts of the products.

Its not like one team runs a DB where all other teams all put their data into. So multiple team make their own choice of DB to use for their own use cases, and with their own preferences and reasons. So as a whole, Netflix or other FAANGS can grow to use many DB, many languages, many frameworks, etc.

jlgaddis · on July 11, 2020

Others have already mentioned several reasons that would justify their choices. I completely understand why a company of Netflix's size has half a dozen different databases.

Recently, I ran into a somewhat similar situation that left me asking "why the hell are there five different databases running on this ONE virtual machine?".

It was one of VMware's virtual appliances running one of their products. I want to say it was either vRealize Network Insight or vRealize Operations Manager but I'm not 100% sure. It was likely one of those or something along those lines.

To stand up the smallest possible deployment of this virtual appliance, the "minimum recommendations" were something like 8 or 10 vCPUs, 32 GB of RAM, and 800 GB of storage -- and, if memory serves, this was for a single node!

Anyways, I got looking into it a bit and discovered that there were a total of five (5) separate database instances running on this one single virtual machine. I can't recall the specifics now but I'm pretty sure there were two PostgreSQL instances and one MongoDB instance. The other two have escaped my memory but I believe #4 was Cassandra. NFI now what #5 was.

On a positive note, afterwards I completely understood why the "minimum recommendations" were what they were. I would have (naively?) thought it'd be better to spread those database instances across a few VMs but I suppose (from VMware's perspective) when your customers call up screaming because their appliance they just paid six figures for is running like crap before they've even really put it into production, 1) you've got just one VM you've got to deal with and 2) the easy "fix" is to simply tell the customer to give it more resources ("another 10 vCPUs and an additional 64 GB of RAM and it should be fine!").

</rant>

jacques_chester · on July 11, 2020

What you're seeing is a collective action problem. Each team makes a rational decision, but only does so locally. Solving a collective action problem requires widely distributed efforts that are valuable collectively, but which don't make sense individually.

I collected a list of these starting at my previous employer and got up to 91 examples. Stuff like "consistent configuration names" (ssl_verify vs verify_ssl), "consistent log format", "consistent Kubernetes CRD naming scheme" etc are all examples.

It is not an easy thing to solve. Especially since folks don't usually think of it in economics terms.

Disclosure: I work for VMware, but not on the products you mentioned.

dmlittle · on July 10, 2020

Not sure about all of them but different databases have different use cases. It's not necessarily Postgres vs. MySQL where they are competing databases for the same use case. Snowflake, for example, is a data warehouse that (I believe) stores data in a columnar format. This makes it much better at performing aggregate queries (how many shows were watched on netflix today?) but a lot slower for specific queries (where in this episode is gavinray currently at?).

gavinray · on July 10, 2020

Hive is a Data Warehouse that runs on top of Hadoop.

Snowflake is a Data Warehouse/Lake, but it's also it's own custom SQL DB.

Cassandra is a NoSQL DB

RDS is running (some standard relational DB)

Apache Druid is a columnar analytics DB centered around realtime uses. It has it's own query language and delegates to Apache Calcite for specific DB/datasource drivers. Can integrate with Kafka/Hadoop.

Presto is (to my understanding) like a meta-DB that can query multiple databases. Similar to an integrated Apache Calcite or Google's ZetaSQL.

There are a LOT of overlapping concerns here, which is why the confusion.

Essentially they have 2-3 different products in several categories targeting generally the same usecase.

nemothekid · on July 11, 2020

AFAICT, Hive/Snowflake are the only overlapping concerns here, and I can see why they would have both (Snowflake may be the better product, but at the end of the day Netflix owns their Hive install).

Everything else solves different problems at Netflix scale. I'm spitballing here but Cassandra could be used for metadata serving (high throughtput, embarassingly parallel reads with high uptime), RDS for their billing system (transactions, ACID, etc), Druid for realtime OLAP and Presto as an interface to Hive/Snowflake.

A smaller company wouldn't need this level of complexity (if you aren't large, you could probably serve your metadata from MySQL, and just use Snowflake as your OLAP engine).

texasbigdata · on July 11, 2020

Is there a good resource for this comparison plus the many streaming/ETL permutations. It's a bit confusing even just what the various Apache products do (beam, kylin, etc, etc).

ramraj07 · on July 11, 2020

Huh, even MySQL and postgres are not apples to apples - we are just now launching a service in MySQL instead of our standard postgres because MySQL has features that postgres doesn't (better connection pooling, potential data partitioning)

gshulegaard · on July 11, 2020

I am sure MySQL is a great fit for your use case.

For others reading your comment though, I did want to list some things I have used with Postgres that relate to connection pooling and data partitioning:

* PGBouncer for connection pooling/sharing.

* Postgres Table Inheritance for table partitioning (https://www.postgresql.org/docs/12/ddl-inherit.html)

* PGPartman for automating the creation of partitions (https://github.com/pgpartman/pg_partman)

* Citus for low-barrier data sharding (it's a Postgres Extension like PostGIS) (https://www.citusdata.com/)

dmlittle · on July 11, 2020

I would probably say MySQL and Postgres is apples to apples just different types of apples (granny vs delicious). Their differences are not as big as a regular relation DB vs. a columnar DB.

mlthoughts2018 · on July 11, 2020

Supporting more choices is hugely important even for very small companies. One of the biggest alarm bells that it’s time to leave a company is when they start deeply enforcing just one database solution, just one message queue solution, just one language, just one big data processing engine, etc.

You need extreme diversity of these tools to benefit from inventiveness of your staff.

Company mandates to strictly limit to one way will just beat creativity out of people and burn them out.

This is especially stupid if the reason for enforcing just one way or just one tool is to make SRE lives easier. Aside from that being the tail wagging the dog and making no business sense, you have the power to use software abstractions to still make SRE lives easier even without limiting freedom to use any system, even in a small company.

jacques_chester · on July 11, 2020

Choice isn't free. It imposes costs on other teams and the company as a whole. Other teams need to deal with more idioms and performance patterns, need more libraries, need to operate a more heterogeneous network etc. The company itself faces higher costs for integrating data, higher difficulty in rotating engineers between teams, has more difficulty hiring, can't get cost efficiencies by investing in optimisation of fewer platforms and so on.

Engineering isn't art. It's engineering, a field where creativity and innovation works within economic, technical and moral constraints.

jeffffff · on July 11, 2020

while mandates to only use one tool can be overly restrictive, allowing a free for all is equally if not more dangerous. there are a lot of benefits to using a standard set of tools, and developers are for the most part really bad at calculating the ROI of introducing a new tool where an existing option may already be good enough. https://mcfunley.com/choose-boring-technology

freewilly1040 · on July 11, 2020

Probably some are older systems that have some trailing number of jobs that are not worth moving. That would be my guess for why they have both Hive and Snowflake, for example

rgblambda · on July 11, 2020

I noticed the same for streaming. They have Spark and Flink rather than use Spark Streaming.

jedberg · on July 11, 2020

> Also why is it on Medium under a Paywall, they are worth, quite literally, nearly $200 billion dollars =/

Better distribution. Medium will recommend the article to their readers.

jeffffff · on July 11, 2020

you need separate datastores for operational systems and analytics systems. operational datastores are optimized for point reads and low latency and analytical datastores are optimized for scans and high throughput. you also really don't want your analytics workloads interfering with user-facing services, isolation is very important, so typically you'd have data pipelines that replicate data from your operational systems to your analytics systems. cassandra and rds are probably both used for operational use cases and not analytics. the rest are for analytics.

of the analytics tools some of these are storage systems, some are compute engines, and some are both.

snowflake is both a storage system and a compute engine. it is very nice but it is also very expensive. there are a lot of computations or analyses that could be ROI positive but that aren't ROI positive on snowflake.

hive has two main components, hiveserver2 and the hive metastore. hiveserver2, which is the part that executes hive sql queries, is a compute engine. the hive metastore is a storage system (in combination with hdfs or s3 typically). hiveserver2 has mostly been superseded by spark sql and presto at this point although it's still used in a lot of places for legacy reasons. the hive metastore is a critical component of a data lake as spark, presto, and most other standalone compute engines rely on it to determine where the files are for a particular dataset, how to read them, and what the schema is.

presto is a compute engine that executes sql against data typically stored in s3 or hdfs in parquet or orc format and registered in the hive metastore. it is excellent for low latency querying but all query execution state must be able to be held in memory. for queries where the state is too large to fit in memory, spark is a better fit. presto can also read data from other datastores such as mysql but this is a really good way to break your website (see point above about isolation) and is best avoided.

druid is both a storage system and a compute engine and use cases overlap heavily with presto. druid does two things that presto does not. the first is that druid allows you to update your data in near realtime. this is extremely challenging to do in hive/s3/hdfs based systems. the second is that druid maintains what is essentially a search index on top of your data which in some cases can dramatically improve performance of filtering operations. unfortunately druid stores all its data on locally attached disks and not s3, so data storage costs in druid are around 10-15x higher than a presto/hive/s3 solution. regarding near realtime updates, apache hudi and databricks delta lake are bridging this gap for presto/hive/s3, so unless druid can separate storage from compute and be used as a standalone compute engine i don't see it surviving in the long term.

jakereps · on July 11, 2020

> Also why is it on Medium under a Paywall, they are worth, quite literally, nearly $200 billion dollars =/

Maybe their scripts determined them hosting on Medium was costing them too much money. /s

kohtatsu · on July 11, 2020

Hard sign-up walled :<

https://i.imgur.com/mSk4uiI.png

kofejnik · on July 11, 2020

nope; opened just fine in an anon chrome window with ublock origin

RcouF1uZ4gsC · on July 11, 2020

I wonder if Netflix is spending a lot of money and effort on things that consumers don't really care about.

I don't really care for personalized recommendations. As long as the movies and TV Shows have some type of keyword or grouping, I am happy. Given the relatively fixed size of the catalog, and that most movies and TV shows are already classified (for example as Action or Romance or Comedy etc) this doesn't seem like something where a lot of data would do a lot to help.

For me, most streaming movie companies are pretty equivalent in their experience. The only data I really need them to keep track of about me is what I have watched so I know if I have already watched something before, and where I am in the movie/tv series, so I can pick back up when I log in again. Everyone pretty much does it. As for streaming quality, basically everyone knows how to transcode movies and put them on a CDN. The size and quality of the catalog is the biggest differentiator for me.

syspec · on July 11, 2020

"I personally don't use feature X"

discodave · on July 10, 2020

Why on earth would Netflix want their blog behind a paywall?

Absurd.

timy2shoes · on July 10, 2020

You can open it in incognito mode to avoid the paywall.

ramraj07 · on July 11, 2020

Problem solved, absurdity not solved.

thunderbong · on July 10, 2020

I didn't get any paywall

aedocw · on July 10, 2020

I see "You have 2 free stories left this month." at the top.

It's a vanity URL for Medium. If they've got their own domain why would they want to drive any traffic to medium? That doesn't make any sense to me.

devwastaken · on July 10, 2020

Probably red tape and sysadmins. Easier to just use a different service than it is to convince the sysadmins to host your blog software and maintain that.

dvtrn · on July 11, 2020

Makes me wonder what made them choose Medium for their engineering blog versus your friendly neighborhood static website host.

SergeAx · on July 11, 2020

Am I getting it right and Netflix' own hardware is negligible or non-existent and all their infrastructure is deployed to AWS?

yencabulator · on July 12, 2020

Netflix runs its own CDN system. That's a significant amount of hardware to own & manage, but their operating costs are probably less than what a conventional data center would have, because it's in the ISP's interest to host a local cache (to minimize bandwidth use).

https://datacenterfrontier.com/mapping-netflix-content-deliv... [2016] https://openconnect.netflix.com/

TheGuyWhoCodes · on July 11, 2020

Wow so a very large portion of their spend is on Cassandra. I bet it's heavily modified for their needs but I wonder if Scylladb would be able to replace some of those clusters for a considerable saving in maintenance time and money.