Hacker News new | past | comments | ask | show | jobs | submit login
Service mesh use cases (2020) (lucperkins.dev)
135 points by biggestlou on Feb 11, 2023 | hide | past | favorite | 48 comments



I’ve only read about Service Mesh, my impression was that it seems to add an awful lot of processes and complexity just to make developer’s lives slightly easier.

Maybe I’m wrong but it almost feels like busy work for DevOps. Is my first impression wrong? Is this the right way to architect systems in some use cases, and if so what are they?


>slightly easier

As a company grows sooner or later most of these features become pretty desirable from an operations perspective. Feature developers likely don't and shouldn't need to care. It probably starts with things like Auth and basic load balancing. As the company grows to dozens of teams and services then you'll start feeling pain around service discovery and wish you didn't need to implement yet another custom auth scheme to integrate with another department's service.

After a few retry storm outages people will start paying more attention to load shedding, autoscaling, circuit breakers, rate limiting.

More mature companies or ones with compliance obligations start thinking about zero-trust, TLS everywhere, auditing, and centralized telemetry.

Is there complexity? Absolutely. Is it worth it? That depends where your company is in its lifecycle. Sometimes yes, other times you're probably better off just building things and living with the fact that your load shedding strategy is "just tip over".


We’re the process of moving all of our services over to a service mesh and while the growing pains are definitely there, the payoff is huge.

Even aside from a lot of the more hyped up features of service mesh, the biggest thing Istio solves is tls everywhere and cloud agnostic workload identity. All of our pods get new tls certs every 24 hours and nobody needs an API key to call anything.

Our security team is thrilled that applications running with an Istio sidecar literally have way to leak credentials. There’s no API keys to accidentally log. Once we have databases setup to support mTLS authentication, we won’t need database passwords anymore.


Some of the functionality you mentioned above is possible without a service mesh.


All of the functionality of kubernetes can be implemented independently. It’s still a useful set of abstractions, therefore/because it’s understood by a large portion of the industry.


It’s 100% a question of scale. And I don’t mean throughput, I mean domain and business logic complexity that requires an army of engineers.

Just as it’s foolish to create dozens of services if you have a 10-person team, you don’t really get much out of a service mesh if you only have a handful of services and not feeling the pain with your traditional tooling.

But once you get to large scale with convoluted business logic that is hard to reason about because so many teams are involved, the search for scalable abstractions begin. Service mesh then becomes useful because it is completely orthogonal to biz logic and you can now add engineers 100% focused on tooling and operations, and product engineers can think a lot less about certain classes of reliability and security concerns.

Of course in todays era of resume driven development, and the huge comp paid by FAANGs, you are going to get a ton of young devs pushing for service mesh way before it makes sense. I can’t say I blame them, but keep your wits about you!


If you can convince your business folks to run shit on the command-line then there is basically no need for services ever. I know it sounds insane but its how it was done in the old days and there really is only a false barrier to doing it again.


Place I worked had support staff copy-pasting mongo queries from google docs -- worked in the early days but eventually you have to start building an admin interface for more complicated processes

When it was just mongo installs are easy since they only needed a mongo desktop client


Terminal can handle auth.


Many of the use cases described in the post are solved by service meshes.

So, in my opinion, the questions are introspective:

- “Do I have enough context to know what problem those solutions are solving, and to at least appreciate the problem space to understand why someone may solve it like this?”

- “Do I have or perceive those problem to impact my infrastructure/applications?”

- “Does the solution offered by the use cases described appeal to me?”

If yes at the end, then one potential implementation is a service mesh.

A lot of these are solved out-of-the-box with Hashicorp’s Nomad/Consul/Vault pairing, for example!


It is true that a lot of those use cases are covered by "basic" Kubernetes (or Nomad) without the addition of Istio or similar, e.g. service discovery, load-balancing, circuit-breaking, autoscaling, blue-green, isolation, health checking...

Adding a service mesh onto Kubernetes seems to bring a lot of complexity for a few benefits (80% of the effort for the last 20% sort of deal).


> Adding a service mesh onto Kubernetes seems to bring a lot of complexity for a few benefits

I think the benefits are magnified in larger organizations or where operators and devs are not the same people. And the complexity is relative to which solution you pick. If you're already on Kubernetes, linkerd2 is relatively easy to install and manage; is that worth it? To me it has been in the past.


I like how you frame the questions. How many times people pick a technology without answering them? Even having some knowledge in them.

I am wondering does Nomad/Consul continue to scale after some level?


I don't know about Consul, but Nomad has been scaled to 2,000,000 containers on >6000 hosts

https://www.hashicorp.com/c2m


It's a "big company" thing. In my opinion, the best way to add mTLS to your stack is to just adjust your application code to verify the certificate on the other end of the connection. But if the "dev team" has the mandate "add features X, Y, and Z", and the "devops team" has the mandate "implement mTLS by the end of Q1", you can see why "bolt on a bunch of sidecars" becomes the selected solution. The two teams don't have to talk with each other, but they both accomplish their goals. The cost is less understanding, debuggability, and the cost of the service mesh product. But, from both teams' perspective, it looks like the best option.

I'm not a big fan of this approach; the two teams need to have a meeting and need to have a shared goal to implement the business's selected security requirements together. But sometimes fixing the org is too hard, so there is a Plan B.


I very much disagree the sentiment that adding mTLS is just “verifying the certificate on the other end of the connection”. You ignore the process of distribution and rotation of certificates which is non-trivial to implement application side.


I honestly thought about covering a few ideas in the post, but decided it was off topic. The service meshes do include some rudimentary key generation and distribution code, which is nice to not have to build yourself. The simplest thing, if you're deployed in k8s or similar, is cert-manager + a CA + code that reloads keys when the secret is updated (pretty easy to write). This has downsides (good luck when your CA expires!) but it is easy and does keep itself functional. Cloud providers also have a service like this, which protects the root key with their own IAM (and presumably dedicated hardware); it's definitely a route you'll want to look into.

What's missing are a bunch of things you probably want to check before issuing keys; was the release approved, was all the code reviewed before release, is the code reading the foo-service key actually foo-service? That involves some input from your orchestration layer; i.e. an admission controller that checks all these things against your policies, and only then injects a key that the application can read. (Picking up rotated keys becomes more difficult, but this might be a good thing. "If you don't re-deploy your code for 90 days, it stops being able to talk to other services" doesn't seem like the worst policy I can think of in a world where Dependabot opens up 8 PRs a day against your project.)

This all has the downside that it doesn't really prevent untrusted applications from ruining the security; a dump_keys endpoint that prints the secret key to a log, nefarious code checked into source control but approved (perhaps due to a compromised developer workstation), etc. Fixing those problems is well outside the scope of a service mesh, but something you have to have a plan for. CircleCI didn't! Now you read 3 blog posts a day about how they got hacked.

Anyway, not sure where I was going with this, but application teams need to consider their threat model and protect against it. Security isn't a checkbox that can be checked by someone that didn't write the code. Sure, you can get all sorts of certifications this way that look nice on your marketing page, but the certifications really only cover "did they do the bare minimum to look kind of competent if it was 10 years ago". If you have sophisticated adversaries, you're going to need a sophisticated security team.


Can’t each service just have a job that calls the Let’s Encrypt api once a day to get a new cert?


Most of my programming peers want to focus on solving product-related problems rather than authe, authn, tls config, failover, throttling, discovery…

We want to automate everything not related to the code we want to write. Service meshes sound like a good way to do that.


Right - by why not use something like an API gateway then?


API gateways are primarily used for HTTP traffic coming from clients external to your backend services eg. an iOS device (hence the term 'gateway' vs. 'mesh'). I don't think they support thrift or grpc (at least aws doesn't, not sure about other providers). https://aws.amazon.com/api-gateway/


Google cloud supports grpc on their api gateway: https://cloud.google.com/api-gateway/docs/grpc-overview


That can work, but it means you simply outsourced the problem to AWS. It's not a bad idea per se, but it means your service needs to talk, in some way, http.

You could use the service mesh thing from AWS, along with cognito jwts, for authenticatetion and authorization


You can easily self host your own proxy. I bet API gateway is just Nginx, Traefik or HAProxy under the hood anyway.


I suspect if a Service Mesh is ultimately shown to have broad value, one will make it's way into the K8S core.

To me, it's a fairly big decision to layer something that's complex in it's own right on top of something else that's also complex.


> I suspect if a Service Mesh is ultimately shown to have broad value, one will make it's way into the K8S core

I'm not so sure. I suspect it'll follow the same roadmap as Gateway API, which it already kind of is with the Service Mesh Interface (https://smi-spec.io/)


Indeed, all major Service Meshes solution for Kubernetes implements (at least some part) the SMI specification. There is a group composed of these players working actively on making such spec a standard.

Understanding these few CRDs give great insights on what do expect from a Service mesh and how thinks are typically articulated.


I just wrote something extremely similar, but it's only internal right now.

I personally find that the service mesh value-prop is hard to justify for a serverless stack (mostly Cloud Run, but AWS Lambda too probably), and in situations where your services are mostly all in the same language and you can bake the features into libraries that are much easier to import.

Observability is a great example of this. In serverless-land, you're already getting the standard HTTP metrics (ex request count, response codes, latency, etc), tracing, and standard HTTP request logging "for free."


> I personally find that the service mesh value-prop is hard to justify for a serverless stack (mostly Cloud Run, but AWS Lambda too probably), and in situations where your services are mostly all in the same language and you can bake the features into libraries that are much easier to import.

If you’re running server less you already have 90% of what you’d get from a service mesh.

I will tell you that having seen what happens in big companies, baking distributed concerns into libraries always ends in disaster long after you’re gone.

When you have a piece of code deployed in 200 separate apps, every change requires tons of project management.


Now imagine you have something that has the complexity and change volume of a distributed control plane bringing together load-balancing, service advertisement, public key infrastructure, and software defined networking, and then try to imagine running it at the same reliability as your DNS.

Also: proxies, proxies everywhere, as far as the eye can see.


And in addition to that, all of those immediately becoming the same centralized single point of failure. What could possibly go wrong (on high load)? ;p


In most implementations this is not the case. Service Meshes tend to either follow a sidecar or a DaemonSet approach. You don't have a single proxy, people usually complain about the exact opposite.


This is the production ready part that people will usually discover later on their own skin...


Thanks for this.

I have never deployed a service mesh or used one but I am designing something similar at the code layer. It is designed to route between server components. That is, at the architecture between threads in a multithreaded system.

The problem I want to solve is that I want architecture to be trivially easy to change with minimal code changes. This is the promise and allure of enterprise service buses and messaging queues and probably Spring.

I have managed RabbitMQ and I didn't enjoy it.

If I want a system that can scale up and down and that multiples of any system object can be introduced or removed without drastic rewrites.

I would like to decouple bottleneck from code and turn it into runtime configuration.

My understanding of things such as Traefik and istio is that they are frustrating to set up.

Specifically I am working on designing interthread communication patterns for multithreaded software.

How do you design an architecture that is easy to change, scales and is flexible?

I am thinking of a message routing definition format that is extremely flexible and allows any topology to be created.

https://github.com/samsquire/ideas4#526-multiplexing-setting...

I think there is application of the same pattern to the network layer too.

Each communication event has associated with it an environment of keyvalues that look similar to this:

  petsserver1
  container1
  thread3
  socket5
  user563
  ingestionthread1
  
  
  
These can be used to route to keyspace ranges (such as particular users to tenant shards or load balance) to other components. For example users1-1000 are handled by petsserver1 and socket5 is associated with thread3.

In other words: changing the RabbitMQ routing settings doesn't change the architecture of your software. You need to change the architecture of the software to match the routing configuration. But what if you changed the routing configuration and the application architecture changed to match?


>> But what if you changed the routing configuration and the application architecture changed to match?

If there were 3 ways to categorise scaling (there’s more than this in reality) then they might be vertical, horizontal then distributed.

You’re describing an architecture that’s in the horizontal scaling world view.

You’re not in vertical because you’re using higher powered (but slower) strategies like active routing for comms between components where in vertical you’d have configurable queues but no routing layer.

You’re not in distributed scaling mode because your routing is assuming consistent latency and consistent bandwidth behaviours.

I don’t think one architecture to rule them all is a solvable problem. I’d heartily and very gratefully welcome being proven wrong on this.


Thanks for your comment. It's definitely food for thought.

You remind me of fallacies of distributed computing by mentioning consistent latency and bandwidth.

https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com...

I'm still at the design stage.

Those architectures you describe, I am hoping there is a representation that can describe many architectures. There's probably architecture's I am yet to think of that are unrepresentable with my format.

Going from 1 to N or removing, adding a layer should be automatable. That's my hope anyway.

I want everything to wire itself automatically.

I am trying to come up with a data structure that can represent architecture.

I am trying to do what inversion of control containers do per request but for architecture. In Inversion of control containers you specify a scope of an object that is instantiated for a scope such as for a request or for a session. I want that for architecture.


it’s such a fundamental problem space and with such a rich diversity of possible solutions that at a minimum you’re going to create something seriously useful for a subset of types of application. But it’d be transformational for computing if you cracked the whole problem. I hope you do.

I do like your idea of outsourcing the wiring (an error prone, detail heavy task) away from humans.


I'd say most of these patterns are supported by NATS, it can do pub/sub but actually also has excellent support for RPC and in the latest iteration it also has a KV store baked in. I've been using it for a few pet projects so far and it has never been the weakest link.


I keep hearing about NATS but I am yet to use it leisurely or for work.

Thanks for the recommendation :-)


Service meshes make it easier to roll out advanced load management/reliability features such as prioritized load shedding, which would otherwise need to be implemented within each language/framework.

For instance, Aperture[0] open-source flow control system is built on service meshes.

[0]: https://github.com/fluxninja/aperture

[1]: https://docs.fluxninja.com


I really did not enjoy dealing with our service mesh at the last place I worked.


Out of curiosity, was it something built internally? Or were you relying on a public solution?


We were using Envoy Proxy


I thought service mesh main use case was to reduce time to production delivery, allowing hotfixes to be much more reactive. Am I totally wrong?


It's not about fast delivery, at least not in this way. Arguably if you need mTLS, traffic shaping, cross service observability, service discovery... Then yes, it's much faster to use an existing solution than built it yourself. But it won't make you hot fixes shipped faster.

Service Mesh is nothing new. People tend to call it differently back then. The key features it bring are: - Traffic shaping: mirroring, canary, blue-green... - Cross service observability - End to end encryption - Service discovery


I think it is named Blue-green deployments in the article


Article is from 2020. Please add to title.


Added. Thanks!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: