Istio Observability with Go, gRPC, and Protocol Buffers-Based Microservices

gstafford · on April 21, 2019

Also published (free access) here: https://wp.me/p1RD28-6k1

williamallthing · on April 22, 2019

Have you considered doing a similar writeup for Linkerd? (https://linkerd.io)

sklivvz1971 · on April 21, 2019

I found this interesting and an excellent general overview of today's monitoring practices while using microservices.

I have to say, though, that I am always perplexed by these toy microservice architectures because they are not solving a problem but creating it. For example, we would not need sophisticated distributed tracing if we used less moving parts; we would not need to use highly optimized (but hard to debug) protocols like grpc and protobuf if we did not have to rely on a massive graph of calls for each service, and so on.

Of course, there are a bunch of lovely use cases for microservices, mostly around "how do we make 1k developers collaborate over our codebase?" (answer: enlarge the codebase!), which always sounded like a self-fulfilling prophecy to me, but then again, I loathe working for huge companies.

If that is your case, then these tools will make your developer life a little more bearable. If you are playing with microservices because they are fresh and new, please consider a more conservative architecture first.

orf · on April 21, 2019

These solutions didn’t appear out of thin air because people get bored one day. Micro services solve tangible, real world problems.

gRPC isn’t necessary hard to debug, and it brings a whole host of improvements over whatever ad-hoc mess you used before.

ioquatix · on April 21, 2019

Improvements like...?

orf · on April 21, 2019

Speed, schemas, code generation, built in backwards compatibility, extension points, public service definitions, introspection to name a few.

Thaxll · on April 22, 2019

You forgot real bi-directional streaming but otherwise I agree :)

mping · on April 21, 2019

Performance, I guess. Also, a schema. Eg json has schema but most apis I know don't use json schemas.

sklivvz1971 · on April 21, 2019

> These solutions didn’t appear out of thin air because people get bored one day.

I certainly did not say these architectures appeared out of thin air. They were invented to solve the problems of internet-scale companies.

> Micro services solve tangible, real world problems.

Incorrect. Microservices solve tangible, real-world problems when they are applied to the correct problems. Microservices might also create tangible, real-world problems whether they are applied correctly or not. In fact, that's one of the points of the article, if you read between the lines.

> gRPC isn’t necessary hard to debug, and it brings a whole host of improvements over whatever ad-hoc mess you used before.

I don't know what you are referring to. If your microservices are well done, good for you. You still need to optimize the transport, which is a cost you pay only because of your choice of architecture. If you used thicker services or a monolith you would have different problems, but not that one.

chessturk · on April 21, 2019

As a developer who uses go microservice architecture and is in the process of containerizing our services at work, this is very valuable. Thank you!

As an elixir/erlang enthusiast, haven't we just reinvented BEAM and observer?

pdimitar · on May 1, 2019

Yep. No small amount of Erlang/Elixir programmers are kind of amused how K8s is trying to invent OTP for nodes.

mercer · on May 1, 2019

I wish we could somehow convince the undoubtedly very smart people working on K8s to just get on the BEAM and go from there? You know, work on Dialyzer, fix my code highlighting issues in VSCode, make Elixir faster, somehow make it easy to create cross-platform binaries, and give me a platform-native Observer ;).

pdimitar · on May 1, 2019

I suppose it's not easy to part with the illusion that containers should not be ephemeral.

I mean, that's probably easy to do for microservices; you fire off a function, it does something useful and returns back quickly. Having K8s do a home-grown lambda hosting for you is useful there.

But the moment you introduce any state, the whole idea of ephemeral containers is much more of a hindrance than something actually useful. A singular BEAM VM might run (realistically) tens of thousands of processes that do a lot of useful work. It's just not practical to kill off such a container and hand-wave away the huge costs of spinning it back up.

IMO K8s and Erlang/Elixir are mutually exclusive at this point, sadly.

There are projects in progress that attempt to solve the distributed supervisors ideas -- Swarm, Firenest, Horde -- but even without that, any Erlang/Elixir app can go a long way before needing distributed coordination.

eeZah7Ux · on April 21, 2019

The complexity and amount of code involved is staggering. I'm not sure how we can possibly justify the Kubernetes, Istio, Docker, Envoy, Prometheus, Grafana, Jaeger, Kiali, Helm stack.

What problem are we solving that we were not solving 15 years ago, again? How much time and effort are being saved (by organizations smaller than google)?

_skel · on April 21, 2019

It attempts to solve the problem of managing elastic container infrastructure, in which everything is changing all the time.

The industry is in a big transition. First we ran VMs in datacenters, then we ran them in cloud providers. Then we started running containers and realized that their ephemeral nature made it possible to treat deployment like code, and Kubernetes is now the standard for doing that.

So, Kubernetes gives us easy abstractions for deployment. But having lots of little ephemeral containers that are constantly changing creates problems of security and visibility and routing. Service mesh is an attempt to solve that problem.

I would point out that Kubernetes has become an industry standard at this point. A few years ago it was reasonable to think Mesos would become a standard, or that multiple solutions would coexist. In fact, Kubernetes has destroyed its competition and Mesos is basically dead. It would currently be insane to adopt one of Kubernetes's competitors or try to roll your own.

If your organization standardized on Mesos a few years ago, you probably regret that decision today, and you are probably forced to plan a transition to Kubernetes.

The service mesh market is very immature at the moment, and the barrier to entry is still low. Witness the recent arrival of AWS App Mesh, which is probably going to do quite well. I would be very wary of committing to a particular mesh until the dust settles a bit and we see a clear winner. Otherwise you run the risk of choosing the Mesos of the mesh world.

ecnahc515 · on April 21, 2019

Overall I agree, but Mesos is still big for the frameworks around big data and stateful services. Kubernetes doesn't have the same level of good operators/controllers for heavily stateful services yet, and the ones that do exist aren't very mature.

With time I believe this will get better, but lack of mature client libraries in Java is probably a good reason why this is currently the case. Most of these stacks are all Java based. The people who write software that uses Mesos, or work on Mesos, which is all Java based don't have Kubernetes libraries with the same maturity as the Go based libraries. So it's either wait/fix the Java libraries, or write (and maintain?) an operator in a language that isn't what you regularly use.

gen220 · on April 21, 2019

As a small counterpoint of anecdata, I work at one of those companies that standardized on Mesos(with Apache aurora for scheduling, Kerberos+LDAP for access control). While our devs have small axes to grind with aurora, I have never heard our SREs complain about Mesos. The only complaint one might plausibly raise is that it’s not Kubernetes(!), in the sense that 5 years from now it might be hard to find an SRE with Mesos experience vs K8s experience.

Mesos’s “batteries not included” approach means it does not now and will not ever have feature parity with K8s, but we’ve managed to cobble enough batteries together to make it operationally sufficient for our needs. We certainly aren’t planning a transition away as far as I can tell. Anyways, just my 2¢! :)

_skel · on April 21, 2019

I expect that you will eventually be forced to transition in the next few years. Mesos will fall further and further behind Kubernetes due to lack of development and lack of integrations with new service meshes, telemetry systems, etc. Most new projects boast Kubernetes integration, and omit Mesos integration. Although you may want to hold out with Mesos so you can wait and see if Kubernetes gets supplanted by a newer model.

Even Docker, which competes directly with Kubernetes via Swarm and Compose, has felt the need to include a Kubernetes cluster in Docker for Desktop. That's how comprehensively Kubernetes has taken over the container orchestration market.

In any case I was not specifically criticizing Mesos, but rather using the Mesos vs Kubernetes example to point out that committing to a particular service mesh this early in the evolution of the concept is probably unwise.

owaislone · on April 21, 2019

I think we are in the middle of a transition. Last transition was dedicated servers to VMs. Current one is VMs to containers. Everything that is going on in this space is just part of that transition. Once the dust settles down, kubernetes or some form of it will become the standard cloud provider interface so instead of using AWS or GCP APIs, developers will just use the k8s APIs irrespective of where they run. This will greatly benefit Google and to some Azure but the most of all, it'll make developers' lives a lot easier as they won't have to manage clusters themselves and can rely on cloud providers' hosted container management solutions which most likely will be k8s.

While we as developers have taken on a lot more complexity today, soon we'll only care about our apps as most of these components will standardize and be available on all clouds. At least i hope so.

closeparen · on April 21, 2019

How about 5,000 engineers each committing and deploying daily?

ex_amazon_sde · on April 21, 2019

How about 100k engineers? Amazon has more and it builds and deploys packages in a way not too dissimilar from Linux distros.

closeparen · on April 21, 2019

Amazon doesn’t have a comparable tool in each of these categories?

No deployment automation, no service discovery/load balancing, no tracing, no time series aggregation or visualization.

The ones they sell to the public aren’t needed at all internally?

lugg · on April 21, 2019

If that's the real problem this is the wrong solution.

I don't care how good your hiring practices are, there is zero chance you get 5000 devs using this stack correctly.

Less is more at this scale.

closeparen · on April 21, 2019

Devs by and large aren't doing this stuff, it's abstracted by infrastructure teams behind application frameworks and a "Deploy" button UI.

Much of this stuff was developed by such infrastructure teams to solve their own problems; people realized they were common across the industry and started collaborating on open source. The alternative is typically half-baked, homegrown deployment automation, service discovery, etc. not a fundamentally simpler architecture.

spyspy · on April 21, 2019

this is not true. the promise of k8s led many companies to believe that individual dev teams can now manage their own infra and it’s been a giant mess. almost no one had the expertise or patience to properly manage their clusters. my company had pivoted back to a central cluster teams can run on that’s managed by real infra people, and dev teams leaning more heavily on PaaS/serverless systems like that they can actually wrap their heads around.

orf · on April 21, 2019

Managing infra != managing your own cluster. Don’t be silly, that was never kubernetes value proposition. It’s always been shaky to self host and manage that yourself, even if you are a seasoned infrastructure guy.

The idea was that they could manage the infrastructure their services need - ingresses, secrets, certificates, pods, load balancing, service connections, etc as code. And it’s been a wild success.

spyspy · on April 21, 2019

im not trying to be “silly”. i’ve seen this happen first hand. you can argue that that wasn’t the original idea but i’ve seen enough “k8s solved all our problems!” type conference talks and blog posts to see how people can be fooled.

orf · on April 22, 2019

I too have seen technical car crashes, where people have chosen a technology and completely missed whatever value proposition it has, misusing it in the process. i.e "let our developers manage a cluster"?!

I wouldn't draw many conclusions about the technology itself from that though.

ec109685 · on April 22, 2019

Do you have a few examples of these?

_skel · on April 21, 2019

> I don't care how good your hiring practices are, there is zero chance you get 5000 devs using this stack correctly.

Google is able to pull it off. Obviously people make mistakes but there are systems and processes in place to take care of that. It actually does work pretty well.

ex_amazon_sde · on April 21, 2019

Google, Amazon and Facebook pull it off without using the tools listed.

closeparen · on April 21, 2019

But with their own, older implementations of the same constructs, which inform the design of the listed tools.

unit_circle · on April 21, 2019

Kubernetes and co. Is an effective way to aggregate tens to thousands of machines, promote best practices through primitives (autoscaling, rollouts, disruption budgets, traffic shifting, monitoring, etc.) And promote the use of immutable infra in orgs.

15 years ago doing this stuff well took at least 10x the engineers it does today and you would have had to cut yourself on every sharp corner that these tools make smooth.

At my last company (a startup) I (1 person) bootstrapped infra for a continuously deployed web app that distributed scientific workloads across 10k+ cores of compute in 6 months with no prior experience running large clusters. 15 years ago that would have been impossible. Frankly it wasn't really difficult enough to be fun with these tools.

jrockway · on April 22, 2019

We are all doing things more complicated than we were 15 years ago. 15 years ago, it was routine for your completely static website to go down when one of the up-and-coming social media sites (think Slashdot) happened to link to you. Now, our entire computing lives are globally distributed. Write a document? Move to another computer and continue editing it right where you left off, and share it with your coworkers for feedback. Video chat with your doctor instead of driving to their office after making an appointment on the phone for a week from now. Read hundreds of global newspapers while you're underground on the subway commuting to work. As much as people on Hacker News hate technology, things are actually pretty good. It's so good that we just take it for granted.

How do you justify these technologies versus what we had 15 years ago?

Kubernetes: 15 years ago, you waited 3 months for Dell to ship you a new server, then you went to your datacenter on Saturday to install it in your rack. Hmm, the air conditioner seems broken. File a support ticket with the datacenter, pay $10,000, then spend next weekend migrating your application to the new server. Now? Edit the line that says "replicas: 1" to say "replicas: 2" and kubectl apply it. Enjoy the rest of your weekend with your family. Now, your customers can purchase your products on Black Friday, meaning extra money for your company and extra salary for you. "Come back later." They never do.

Istio: Istio exists because of flaws in the design of Kubernetes's load balancing. It thinks that "1 TCP connection = 1 transaction" but in the world of persistent connections (gRPC, HTTP/2), that is untrue, so Istio exists to bring that sort of abstraction back. 15 years ago, you're right, you didn't need it. Your website just said "MySQL connection limit exceeded" and you hoped that your customers would come back when you got around to fixing it.

Docker: Instead of manually installing Linux on a bunch of computers, you have a scripting language to set up your production environment. The result is the ability to run your complicated application on hundreds of cloud providers or your own infrastructure, with no manual work. You reduce the attack surface, protecting your users' data, and your ensure that bugs don't get out of control by limiting each application's resource usage. 15 years ago, you spent hours configuring each machine your software ran on, crossing your fingers and praying that your machine never died and that the new version of Red Hat didn't break your app.

Envoy: 15 years ago you used Apache. Now you use Envoy. From transcoding protocols (HTTP/1.1 to HTTP/2, gRPC to gRPC/Web or HTTP+JSON) to centralizing the access and error logs across thousands of application to providing observability for anything using the network, it's the swiss army knife of HTTP. It's light, it's fast, it's configurable, and it does what it claims to do extremely well. Maybe you don't need it, but SOMETHING has to terminate your SSL connections and provide your backend application servers with vhosts. Might as well be Envoy. It's the best.

Prometheus: 15 years ago, you waited for your users to report bugs in your software. Now you can monitor them 24/7, and get an alert in Slack before your users notice that things are going south. I am not sure how you argue against monitoring and metrics. Maybe you like reading the log files or waiting for your coworkers to swarm your desk saying they can't work because your software blew up. I hate that. Prometheus lets me see inside my applications so I never have to wonder whether or not it's working.

Grafana: A nice UI for looking at Prometheus metrics and annoying me when they are not good. Clean code, nice UI, great featureset... could not live without.

Jaeger: Jaeger exists in a world that's moved past "our app" to "our cluster". Maybe you hate microservices, it's a pretty popular thing to hate. But if you are using them, you need to know they are communicating, and Jaeger is that. Another service I couldn't live without. (At Google, we had a shitty version of Jaeger called "Dapper". It was indispensable. Jaeger is just a version of that that works better and you can use outside of Google.)

Kiali: Never used it. I imagine it's good when you have a production environment shared by multiple teams, and you want to keep an eye on unexpected dependencies.

Helm: Pretty awful, use kustomize instead. 15 years ago, though, you just had 100 random files in /etc/ and /var/lib/cgi-bin that Helm attempts to replace that. Now you get a backup, source control, code reviews, and guaranteed consistency between machines. You never had an outage 15 years ago because someone edited some random file in production? Lucky you, because I sure did. Helm attempts to make configuration less "interesting" and "fun". I think it's a bad design, but it's way better than what we did 15 years ago.

Hope this helps.

gcb0 · on April 21, 2019

it solves the cloud provider problems.

if everyone make their platform out of those small and generic blocks, it will be very easy for them to provide standar services around the blocks. while if you had an efficient and sane application with two of the basic concepts those blocks provides coded in, the cloud providers would have to provide custom bindings for all the things they wrap around the small blocks. making their lives much more difficult.

thats why you mostly see larger companies advertising the sexiness of containers. because they are either the cloud provider, or they already are binded to a cloud provider, or they work internally by departments that in the end act much like cloud providers anyway.

mattbillenstein · on April 21, 2019

The cloud provider problem is lock-in using proprietary services - but you don't have to do it that way either, pretty much everyone can run a virtual machine running whatever flavor of Linux you like...

gcb0 · on April 22, 2019

but only the cloud providers get rich by convincing you to do all the work to make your application horizontally scalable so they can sell you the convenience of scaling horizontally.!

nevi-me · on April 21, 2019

Is Istio/Envoy supported on a non-k8s deployment?

We're deploying an application that's in containers but not running on k8s, and has:

* an Angular front-end, using grpc-web

* a gRPC-web proxy (improbable-eng; golang)

* an nginx proxy (for Angular, and routing grpc traffic)

* an authentication manager (grpc+rust)

* an audit logging service (grpc+rust)

* a database service (grpc+kotlin)

* a 'logic' engine thing (grpc+kotlin)

* jaeger for distributed tracing

It's been a joy to develop, but I wouldn't mind reducing the 2 proxies to 1 with maybe Envoy. I tried rewriting grpc-web proxy in Rust, but gave up after struggling and not having enough time to complete it.

_skel · on April 21, 2019

It is a goal of Istio to be portable and not tied to Kubernetes, but I don't think it's there yet.

I have looked for people who have successfully run Istio in production outside Kubernetes and I cannot find any. Most of the documentation and examples you will find online are for Kubernetes.

owaislone · on April 21, 2019

Envoy assumes nothing about k8s AFAIK. It should be able to run just fine outside it but if you were using k8s, then using ambassador instead of pure Envoy would be a better choice. http://getambassador.io/

Thaxll · on April 21, 2019

How is Rust support for gRPC last time I checked it was very slow / broken compare to supported languages? Also I quiet don't understand using 3 different languages for a tech stack, why don't you stick to one language?

steveklabnik · on April 21, 2019

My understating is that the Linkerd folks have a production grpc client and server.

nevi-me · on April 22, 2019

Yes, I'm using Tower-grpc from them. Stable and has the features that I need.

kod · on April 22, 2019

Do you not need bidirectional streaming, or do they support it now? There were no examples of how to do it last I checked.

styles · on April 21, 2019

It should be. I was at a talk where they said Envoy was and I believe that Istio is coming soon. From what I remember, Envoy is what you need to plugin to bridge your infrastructure.

xissy · on April 21, 2019

Can’t read due to medium free access limit policy.

gstafford · on April 21, 2019

Also published on my own blog: https://wp.me/p1RD28-6k1

SlowRobotAhead · on April 21, 2019

Things I don’t get:

1. why are they choosing gRPC over REST if the application is entirely web based. With REST you have a standardized system of verbs and not specific application functions (GET /user/id , 200 OK <data> vs.... MyGetUserFoo(ID) , some return data unique to application). With RPC both sides need very specific knowledge of the functions each other have and their arguments all the time, service discovery seems harder and an update to application seems to nearly always imply and update to web.

2. What is the model supposed to be for typical protobuf schema sharing? I like PBs, but it seems a little harder than JSON or CBOR or MessagePak or other schemaless serializers in that the proto file has the same sync issue among all of your endpoints.

3. Wouldn’t be really swell if serializers were supported by cloud providers and Microservices a little more? I have a project right now that every time a message goes from our services to an endpoint it needs encoding and decoding. This is kind of a header now for every Microservices we have, it would be nice to not have the option of making so many mistakes.

owaislone · on April 21, 2019

> With RPC both sides need very specific knowledge of the functions each other have and their arguments all the time

Some might say that is a good thing. It kind of makes your API "type-safe". You can also auto-generate client libraries for the grpc services so your JS code would literally import a function and call it with args instead of dealing with XHR stuff.

> 2. What is the model supposed to be for typical protobuf schema sharing? I like PBs, but it seems a little harder than JSON or CBOR or MessagePak or other schemaless serializers in that the proto file has the same sync issue among all of your endpoints.

I don't think the "sync" issue is a problem really. We already have sync issues even if you just use REST. We need to make sure your client code is up to date to handle any changes in your API. This is more or less something engineers need to sync manually. May be write additional tests, may be annotate APIs with versions, etc. Something like grpc makes this explicit and since it can generate code, it allows us to build tooling that can automate most of the "sync issues".

nine_k · on April 21, 2019

REST is seriously more expensive traffic-wise. Traffic may be cheap, but latency of receiving a large message can't be helped.

Protobufs can be and in normal practice are made backwards-compatible. For serious breaking changes, you want to version your endpoints anyway.

ec109685 · on April 21, 2019

gRPC’s responses are backwards compatible, so you aren’t mandated to update all clients at the same time.

lugg · on April 21, 2019

You're still blocked from using later features potentially even reaching a dependency chain deadlock until you update.

Systems like this work far better when you just work to not break backwards compatibility full stop.

Similarly, they work better when you don't tangle up dependencies in a bunch of shared code, generated or otherwise.

joshuamorton · on April 21, 2019

I'm unclear what problem your describing.

Your always blocked from using later features until an update because we'll, you won't have application code that can take advantage of those features until you update.

ehsankia · on April 21, 2019

That's missing the point. Yes, the goal is to get to a state where all the services are running the new code, but the hard part is that transition. You can manually handle it using versioning and such, but gRPC handles that for you.

_j7tr · on April 21, 2019

JSON is as well.

isoskeles · on April 21, 2019

JSON is agnostic towards anything-compatibility. If you want your service to be backwards compatible, and it is serving JSON, I think you're only going to achieve your goal through rigorous 'integration' tests.