Hacker News new | past | comments | ask | show | jobs | submit login
Adopting Microservices at Netflix: Lessons for Architectural Design (nginx.com)
207 points by davidkellis on Feb 25, 2015 | hide | past | favorite | 74 comments



My comment, slightly edited, from the previous posting of this, at https://news.ycombinator.com/item?id=9106813

> One kind of coupling that people tend to overlook as they transition to a microservices architecture is database coupling, where all services talk to the same database and updating a service means changing the schema. You need to split the database up and denormalize it.

That sounds like a decision you wouldn't want to take lightly; the kind of thing you might do once your company is already big. I wouldn't want to start out that way though, it sounds like a recipe for a mess.


Yes I just came in here to write a comment along those lines. I mean surely you're opening a whole can of worms in terms of consistency etc. I get the feeling Netflix happens not to have use cases where strong consistency is a requirement. I'd be interested to get more detail about how they went about the transition - even just pointers to more on the meta-data management tools they use.


The consistency problem is an open question in my mind. I definitely don't like the idea of having some data synchronization tool to fix the inconsistent data across services problem. I wonder what the best practice is for maintaining data consistency across services.

Does anyone know?


Ideally you don't have to sync the data because one service owns that data. Other services request that data via api. In a RESTful world those api requests are cacheable.


But what about the situation where you have an entity service that owns the data for one piece of the domain, for example a People service, and then other services, like the Address service and the Billing service, reference a particular person. In that scenario, I can imagine the Address service and the Billing service would have a foreign key referencing a person in the People service. Then, what happens if the Person gets deleted? In that case, we've got a consistency problem, even though each service owned its data.

Is the best practice to not use entity services?


The People service can also store Addresses. Call it the Identity service. Include People, Businesses, Relationships, and Addresses.

The Billing service can then reference People or Businesses (and if Businesses, then sub-People), bill to an Address, etc.

No one's saying every object should be a service; you need to find the correct lines to divide across.

In our system (which has been service-oriented for five years), we don't do deletes. We do 'inactive' (UPDATE table SET ACTIVE=0…), but never deletes.

Especially in a case of your billing example, you never want to delete a person or address, because that's historical data you need to retain, but we just keep everything. If it goes in the database, it's because we want to keep it forever.


Do you have EU customers? And if so, how do you deal with data protection?


You could have a service bus, where you publish a "PersonDeleted" message that the other services would subscribe to. It decouples the Person service from all the other related entity services.

You'd have to allow for propagation delay. Plus the possibility of a message storm if you delete something fairly fundamental.


> You could have a service bus, where you publish a "PersonDeleted" message that the other services would subscribe to. It decouples the Person service from all the other related entity services.

You're still screwed if you complete a transaction on the deleted person's still-existing account, now that your system is no longer transactional...


Your objection is hypothetical/abstract and when you ground it in specific use cases there are plenty of patterns that emerge addressing how to deal with the inconsistent state. For example, just-in-time/read reconciliation, batch remediation, actually making some subset of actions transactional/consistent and suffering lower availability there, and so on and so forth.


I'm not saying there are no solutions, but saying "just add an event bus" is unlikely to be sufficient. Whatever you do, you're going to pay additional costs in terms of complexity.


Yeah, if you're an ACID person, this approach is going to present conceptual challenges. The propagation delay is a mostly-solved problem, which I know because lots of high-scale sites work. Getting a summary of their design decisions around this would be a huge time-saver, but I don't know of one.


You can setup distributed transactions e.g. using Zookeeper, Consul, Redis etc.

It adds complexity of course but that is the give/take when you use microservices.


So the problem you've identified is real. I used to have some bootleg footage of some private amazon tech talks where the speaker emphasized that in distributed systems it was generally a terrible idea to have transactions span entities.

I think you basically have to learn to live in an eventually consistent world. In the case of people being deleted I would imagine that the user service exposes a pub/sub interface where address and billing services subscribe to "delete" events.


Hardly need "private bootleg" footage to discover this reality. Pat Helland (at the time, working at Amazon) wrote a paper about it maybe 10 years ago.

http://adrianmarriott.net/logosroot/papers/LifeBeyondTxns.pd...


You don't HAVE to live in an eventually consistent world. If you use something like ZeroMQ or use REST then you can "notify" other services of a "person deleted" event in a synchronous manner.


That assumes the network is always good and services are up. Welcome back to eventual consistency (or none at all)


If the network is bad then your monolithic app wouldn't work either.

The problem of services being up/down has been solved with service discovery e.g. Consul, Etcd, Zookeeper.


That has nothing to do with the fact that if your systems are distributed, you will have eventual consistency.

If System A needs to tell System B about an event in order for A and B to remain consistent, but B is down, you've got eventual consistency, because B can't become consistent with A until it's back up and has performed whatever recovery is necessary to process that event. Service discovery does nothing to solve that problem.


What @saryant said.

In addition, the network isn't just up or down. It's varying shades (dare I say, 50 shades?) of down or broken. A single machine might not be accessible due to a switch issue. An entire rack or aisle might be compromised by a bad router or faulty routing table. A network cable might be flaky. The truth is you just don't know, and that's all inside a single LAN.

Your service discovery system could be able to see service {A,B,C}, but service A can't talk to B or C due to network issues. It happens.

http://www.rgoarchitects.com/Files/fallacies.pdf


The "each service owns its data" scenario shouldn't take precedence over cases like this, where consistency is aligned with obvious business rules. If Person gets deleted, then "on delete cascade" should take care of that Person's Address and Billing records.

For updates and maybe reads, it's a different story.


In my experience, you only expose APIs that are either standalone and transaction-ally independent (change address, delete address etc. in your example) or composite services (say people service) that should manage this distributed transaction. How transactions are managed under the hood vary based on implementation.

One may argue that in this case, "people" service doesn't go by description of micro-service as given in the article. But we need to understand that services get called in some context and there has to be someone there to do the plumbing. That someone can either be a db query, some code in the service, or app/application calling the services. And "generally" you would prefer service code over other two and hence a composite service.

IMO it may also be okay to have People, address, billing under one schema if service granularity and context allows so.


cgh, I've hit the reply limit, but I wanted to ask how you'd implement the cascade deletes thing over services? Would the People service have to emit events describing that a person was deleted that the Address and Billing services would be expected to subscribe to in order to handle that the person was deleted?


Sorry, I should have been more clear. I'm assuming a shared database. So cascading deletes would be defined in the table's schema. Let's pretend we're using Postgresql:

\d Person

[A bunch of table schema stuff]

Referenced by:

TABLE "Billing" CONSTRAINT "billing_id_fk" FOREIGN KEY (id) REFERENCES Person(id) ON DELETE CASCADE

(I typed that off the type of my head so it might not be quite correct.)


The original position being argued though was "... where all services talk to the same database... You need to split the database up and denormalize it.".

So the basic premise is that there is no shared database, and thus having the database enforce cascading deletes is not an option.


Right, thanks for the clarification. Sorry for the slight derailment.


Yeah, events. Which are message queues for most things, api calls if it really needs to be done synchronously


You wouldn't so much delete them, as deactivate them (mark them inactive but keep them in around for retrieval). The consuming service would react differently to an inactive person as an active person.


Look into Event Sourcing and CQRS.

It explains how to manage situations like this.


Probably, a Person shouldn't be deleted if it might be referenced elsewhere. If it's no longer a valid customer, it should be updated to reflect that.


If the two (the database schema and the microservices API) are designed and maintained separately, this assertion (that "updating a service means changing the schema") is not necessarily true. They have separate responsibilities -- the database is persisting the data, and the services provide some business logic, and while they might need to change together this is not always the case.


Database-as-an-integration-layer is a well known anti-pattern.

It's tempting because it's easy, and at first glance it seems to solve lots of problems (consistency, communication etc).

It's a big mistake.

If multiple service are reading and writing from the same DB it rapidly becomes impossible to change things.

Things like input validation changes suddenly have to be implemented in multiple places (which is hard), and done simultaneously (which often becomes hard enough to stop work being done).

In a microservice based approach, the data flows though shared services, and so changes can be done in fewer places.

(Note that read-only, reporting-style databases are separate. I think there is a good case for these being shared)


Multiple services reading from the same DB can -- and should -- be using different accounts that have access to different objects that abstract the underlying data model from the various clients and keep them loosely coupled. This is almost as old of a recommendation as relational DBs themselves.

You can create a tightly coupled design with a shared DB, but there is nothing inherent with shared DB integration requires that.


This seems like it's just taking the idea of decoupling your service and talking to them via APIs a little further by saying.. decouple them to an even smaller granularity?

This has always been the generally accepted way to scale out software services. Is there a novel idea being discussed here, or just that they've been doing this at Netflix?


Yes, a lot of organizations already design their backends in this way without necessarily thinking to use a brand new term to describe it.

Cockcroft defines a microservices architecture as a service-oriented architecture composed of loosely coupled elements that have bounded contexts.

To me this just reads as "service-oriented architecture in a sensible way".

No one thinks to intentionally build a SOA with tightly coupled components with poor boundaries.


When it comes down to it, n-tier, SOA, and microservices are all expressions of the same basic ideas.


Exactly. Even in the late 90s "3 tier" development pushed the idea of having a separate layer and sticking to APIs. This is just saying that instead of making big layers, make smaller ones. Which isn't bad, it's just that most of the time you don't need this. Add a new service, like "GetRecommendationFromFacebook?" OK, go ahead and deploy, because you haven't changed the rest of the services in your layer, like "LoginUser".

And there was a whole hubbub of discovery what with UDDI and DISCO and all that jazz I never really understood.

It'd be nice if they gave some solid examples of the "micro" part to distinguish it from the general SOA idea.


If your app depends on lots of of them, it's only going to run as fast as the slowest dependency. 1% chance of poor performance isn't to bad, the joint distribution of 20 microservices each with a 1% chance, well that gets pretty ugly. In the normal case, everything is great, but the failure modes of each service become a much bigger deal.

It's a great architecture, but fan out of dependencies is a real risk.


This is solved because you can scale each service differently. Too many DB queries from your messaging service? Upgrade the DB, add another read slave. Too much CPU load on your image processing service? Add more image processing nodes.

Breaking your system out into multiple dependencies means you can not only scale your infrastructure, but you can scale individual parts of your infrastructure based on demand, bottlenecks, usage, etc.

Netflix has talked about in the past how, because their systems are broken apart, they don't have to deal with these issues. Rating service having problems? Don't show user ratings. Search service offline for updates? Disable search. If Netflix was one giant (Rails? Django? Node?) app, it would be very difficult to cut out poorly-performing parts temporarily.


> Rating service having problems? Don't show user ratings. Search service offline for updates? Disable search

As an example of a (probably?) bad way to organize services, I worked on a project that had factored a role-based access control system into its own service. Every single web request hit this service, which made it a single point of failure, performance critical, impossible to temporarily disable, etc.


One alternative to centralized role servers is to use client certificates. I've used x509 certs for this purpose. They are pretty hairy, but so is rolling your own authentication/authorization/token system.


Another alternative is JSON Web Tokens. Many of the benefits of Client Certificates while avoiding many of the hardships.


A low percent of poor performance is much easier to achieve if your service is simple, often so simple its answers can be cached. Even if it's outright down, some requests still can be served from the cache.

If you weave together 20 services to produce one mega-service, it's much harder to optimize for performance and even just keep the implementation correct. Caching of complicated multi-factor answers is less frequently possible.

Also, 20 microservices may feed e.g. 5 large 'end-user' services together, in various combinations. If one of the microservices is slow, only these of the end-user services are affected that actually use it.

Monolithic large services are harder to combine, so the risk that an unrelated remote service somehow gets called in the process and slows things down is higher.


How would you do it?

Bringing those dependencies in under one roof doesn't eliminate their risk. Though it does increase the chance that an error in one of these small dependencies brings down the whole system.

With microservices if one of the services runs into trouble, you can ignore it and still serve the other 90-99% of your site without it. You can also deploy updates to services without having to deploy your entire site.


So you have short timeouts and retries that are load balanced to different nodes. But ideally your services are fast even in their 99 percentiles so this isn't an issue. This is much easier to achieve in a small service than a huge complex one.


Um. maybe. 5 machines behind a load balancer. Normal case, load is even, 100 requests to each server. One server starts running into trouble, exceeding timeouts. your load is now ~125 per server, because each client retries frequently. Is 125 enough to push over a "slow" threshold on the others? This will further magnify the load.

The load balancer will spin up more machines, so now you have 10 machines leaning on whatever the back end is.

Yes, your approach is great - but you really have to understand the failure modes - if you're living on the edge, you could have a pretty un fun cascading error.


Thats what circuit breakers are for to not have cascading errors.


I don't find that performance runs at % chance scale. When things perform poorly, they are often very consistent. With this architecture you can rapidly iterate & scale your poor performing core services, while the non-core services can run slowly w/o causing a big deal.

The usual alternative is redeploying and scaling the entire app to iterate on performance, which is a much slower process. If performance is a concern, microservices should be a big win.


All APIs have performance SLAs that are not consistent. You may have 50% of requests finish in 200ms or less, 90% finish in 300ms or less, and 99% finish in 600ms or less. You can do work to narrow the performance variation but performance is a % chance.


My team recently added a microservice to support our fairly monolythic backend service. The big challenge we found was that it takes a lot of effort to make a new (micro)service. We need to create a system of alarms (instead of relying on existing catch all defaults). We need its own test environment, we need to find ways to send traffic to pre-prod. We needed to figure out how to bootstrap the new service into the companies infrastructure. We needed to think about how the dependent service would authenticate against the new service. All of that on top of the core feature work.

All these things are good. You want isolated, focused test environments. You want tightly defined alarms. However we underestimated how long creating a new service would take. In the end we ended up pushing features out when they were ready but before the operational work was complete. Unsurprisingly we saw the issues we knew we wanted to protect against.

Better microservice franeworks that match the companies infrastructure would be helpful. Make building microservices cheap by building tools to speed up the process.


I've had the same experience lately. Coordinating a number of staging environments for an evolving SOA backend has been challenging for both developers and ops. In addition to the services, each one can have many other things to worry about: monitoring, error collection, which version is deployed to which environment, replicating the complexity locally to work on it...


Front End teams tend not like micro services, there is too much overhead in getting too little data. As an example we integrate with one micro service where get back a boolean and and a date. We have the overhead of an http call and all the error handling that goes with it for two pieces of data which would be better aggregated into another service. We story point an integration with a new service as an 8, but adding a new field (or two) in an existing API data structure is a 1.

I hope micro services is not just a new fashion in software and is actually useful ten years from now.


It is by no means a new fashion. In an interview from 2006, Werner Vogels (CTO & VP of Amazon) talks about it. http://queue.acm.org/detail.cfm?id=1142065


A few questions:

a) How do you prevent technical debt? It seems to be more difficult due to APIs which shouldn't have breaking changes. In theory you could always version up the APIs and serve both versions or just add a new API for a breaking change, but these solutions seems awkward.

b) How do you start developing multiple microservices at the same time? I would expect APIs to change a lot in the beginning, which would mean that updating one microservice would break another. Perhaps that is acceptable before the first "stable release" of a microservice.


This is basically it. Designing a game? Build the Game service, with all your game logic. Need users and authentication now? Start writing an Identity service, and so on.

The only difference is that instead of starting to write some Identity class and use it in your Game service, you write some Identity class and expose it via a REST API, and then provide an interface library that interfaces with that REST API. Call it IdentityInterface or libidentity or something. Pydentity, whatever. It makes an HTTP request, gets a serialized object, unserializes it, and returns it.

For simplicity, put all your public models in that library, and it gets shared by both the Identity service and the Game service. Those models represent an object and what you can do with it. In the Identity service is where all of that actually happens.

This is also how you solve the 'multiple microservices at the same time' problem; your interface library provides the public interface, and the backend REST API is the 'private' API used by the public interface. You make changes to the backend API and the public library and no one notices, or you make incompatible changes to the public service and fix everything before you deploy; ideally, you add new APIs, migrate services over, then deprecate the old ones.

In the end, each service sees the world as fundamentally the same; there's a library with classes and functionality, and you use that to do things. If you design it right, it's never obvious from your code that you're accessing a different service elsewhere in your infrastructure.


The problem is that microservices are not objects. They leak reality into your problem domain in a way that simply cannot be made to go away.

If regular object oriented programming languages had method calls that randomly failed, were delayed, sent multiple copies of a response, changed how they behaved without warning, sent half-formed responses ... then yes it would be the same.

Distributed systems are hard, because you cannot change things in two places simultaneously. All synchronisation is limited by the bits you can push down a channel up to, but not exceeding, the speed of light. In a single computer system this problem can be hidden from the programmer. In a distributed system, it cannot.

Probably the most devastating critique of the position that "it's just OO modeling!" came in A Note on Distributed Computing, published in 1994 by Waldo, Wyant, Wollrath and Kendall[0]:

"We look at a number of distributed systems that have attempted to paper over the distinction between local and remote objects, and show that such systems fail to support basic requirements of robustness and reliability. These failures have been masked in the past by the small size of the distributed systems that have been built. In the enterprise-wide distributed systems foreseen in the near future, however, such a masking will be impossible."

[0] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7...


One of the things I've heard about Erlang is that its processes can be distributed very easily. The paper you cite was written four years before Erlang was open-sourced; I wonder if Erlang/OTP would hold up to their analysis.


If you look, one of three strategies they examine is "treat all calls as remote", which is the approach taken in Erlang.


Honestly, I've been following along with these trends a lot and it seems like the answers for 1&2 come down to team size. If you have a bigger team, with more development inertia microservices can seem amazing and the tradeoffs are worth it (repeated work vs. development tempo).

To a small team that doesn't have the inertial issues to generate the benefits of microservices, it seems like they are nice in theory but have too much overhead to supplant monolithic approaches.


> How do you start developing multiple microservices at the same time?

Same as any other project: Develop from the outside in.

In practice, trying to develop in the "optimal order' leads to speculative development that will be wasted.


Nothing prevents you from having multiple microservices sharing the same codebase.

That said, the "it's just like the web" model doesn't sound fantastic to me. It sounds like your app now depends on contracts which are only enforced by good practices, not by something strongly typed you can check at compile time, unless you use something like protocol buffers to generate the boilerplate.


This is where the test-driven world, which in my experience is strongest on the dynamic language side of programming, has come back around full circle.

In microservices, everything is dynamically typed.

There is no single binary produced by a single compiler performing whole-program checks of consistency. Even tools like protobufs don't help when code bases drift, or someone introduces a foreign tool, or someone upgrades versions and introduces a subtle mismatch, or some doesn't know you call their service and shuts it down ...

Turns out that driving from tests, and starting those tests from the outermost consumer, is a fairly well-proved way of coping with such conditions.


> There is no single binary produced by a single compiler performing whole-program checks of consistency. Even tools like protobufs don't help when code bases drift, or someone introduces a foreign tool, or someone upgrades versions and introduces a subtle mismatch, or some doesn't know you call their service and shuts it down ...

Static typing is not a panacea, but large codebase plus dynamic typing everywhere sounds like a recipe for disaster. No matter the amount of testing.

> Turns out that driving from tests, and starting those tests from the outermost consumer, is a fairly well-proved way of coping with such conditions.

You need tests no matter what. However, static typing means a much greater confidence in your codebase.


As soon as you distribute your system, you have dynamic typing, whether you like it or not.

At runtime you are inspecting incoming messages and then routing them to code. It doesn't matter what language the code is written in, it will need to route and validate the messages at runtime.

The type system cannot provide compile-time assurances of behaviour, because it cannot create a single consistent binary which enforces the guarantees.

Your only remaining tool is to drive code from tests and only from tests.


> As soon as you distribute your system, you have dynamic typing, whether you like it or not.

You have serialization/deserialization issues. You can still type your messages.

> At runtime you are inspecting incoming messages and then routing them to code. It doesn't matter what language the code is written in, it will need to route and validate the messages at runtime.

Of course.

> The type system cannot provide compile-time assurances of behaviour, because it cannot create a single consistent binary which enforces the guarantees.

If you make the assumption that you deploy up-to-date binaries, then knowing at compile time that your producer and consumer use the same data structure for the messages they exchange would give me much better confidence than "it looks like the API conforms to what's written on the wiki".


> You can still type your messages.

You can hope that they respect the type. For a robust distributed system, you will have to check everything at runtime.

> If you make the assumption that you deploy up-to-date binaries, then knowing at compile time that your producer and consumer use the same data structure for the messages they exchange would give me much better confidence than "it looks like the API conforms to what's written on the wiki".

My reading is that we agree that running code is the only source of truth, we disagree on what guarantees distribution deprives us of.


If you cannot ensure that your producer receives messages following a certain schema, even though you enforce it statically in your codebase, you also cannot ensure that your running code passes your tests.


Which is why I start from integration testing of the whole system, with frenemy tests for any foreign services that I must rely on.

You're right that tests don't make Byzantine failures go away. But neither do static types. My point that distribution turns all systems into analogies for dynamic language programming remains, and so the emphasis on tool support changes along with it.


This reminded me of the AngularJS team deciding to go with (optional) runtime type checking over compile-time checking (which is what TypeScript has done to JavaScript). Their reasoning was that you can use runtime checking for REST responses, which can be argued to somewhat reduce the need for writing tests.


Most systems I've seen these days don't have any compile-time type checking, since they're all written in Ruby, Python, or Node.js.

In the normal case of development, you tend to have a broken-out system. For a game for example:

    + Game code
    --+ User authentication classes/functionality (which accesses DB)
    --+ Messaging classes/functionality (which accesses DB)
    --+ User metrics classes/functionality (which accesses DB)
In the new design you'd have this:

    + Game code
    --+ User authentication classes/functionality (which accesses REST service)
    --+ Messaging classes/functionality (which accesses REST service)
    --+ User metrics classes/functionality (which accesses REST service)
In other words, in a clean design, your Game code is accessing a library which provides user Authentication functionality, one which provides Messaging functionality, and one which provides Metrics functionality.

In this new design, you have exactly the same thing - a library which abstracts the details of communicating with the service, encoding data, etc. A person making changes to those libraries, which other services use, is responsible for either not making backwards-incompatible changes, or, when that isn't possible, working with other teams to ensure a clean upgrade path (or doing it themselves, if your lines are sufficiently blurred).


> In this new design, you have exactly the same thing - a library which abstracts the details of communicating with the service, encoding data, etc. A person making changes to those libraries, which other services use, is responsible for either not making backwards-incompatible changes, or, when that isn't possible, working with other teams to ensure a clean upgrade path (or doing it themselves, if your lines are sufficiently blurred).

The new design trades a "modular but monolithic" design for complexity and brittleness, IMHO. The ability to spin up new instances of a given service on demand is interesting, but it sure sounds like reinventing Erlang without Erlang's tooling.



I really wish the team I'm on could understand this!


building and rebuilding and maintaining applications is hard enough. Now we have whitepapers from a company that wants to sell us on a new paradigm which -- cough cough -- they just happen to have software to support.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: