As a maintainer and end-user, my answer to this is...yes and no. It's important to clarify that, stability - something mentioned in the article - has several major definitions:
- Stability in the specification
- Stability in semantic conventions
- Stability in the protocol representation
- Stability in SDKs that can generate data
- Stability in the Collector that can receive, process, and export that data
Unfortunately, for many people, they may interpret "stable" in one of those categories as "stable for everything", and then get really annoyed when they find their language doesn't actually have stable support (or any support!) for that concept.
What I'm most proud of in 2023 is all of the little things we made progress on with components that engineers have to materially deal with. On the website, we documented what feels like a million little things and clarified tons of concepts that people told us were confusing. Across all the SDKs, we fixed tons of little bugs, added more and more instrumentations, and completed the unsexy work to make metrics generation stable across most of our 11+ languages. The Collector added oodles and oodles of support for different data sources, and OTTL went from a neat component to a rock-solid general-purpose data transformation tool.
There's so much more work to do, but I'm really happy about the progress.
The biggest issue with OpenTelemetry is how aggressively it's being pushed, despite not being mature enough. The AWS X-Ray team frequently suggests switching to OTel on bug reports and feature requests, but the performance and resource overhead of OTel collector for Lambda is just awful right now. It doesn't make sense for any performance sensitive workload.
Beyond that, it gives off an "over-engineered" vibe. It's probably not, and the complexity of being a unified standard that can work across so many different variations is inherently going to need a lot of abstractions, but it feels so much more difficult to go through OpenTelemetry compared to an opinionated observability SaaS.
I agree that it gives off an over-engineered vibe. I think part of that is that a lot of the "Getting Started" docs don't give you a feel for how to use the framework. It's more along the lines of "install this package and create this esoteric config file and this very particular telemetry logging will work". That's not a particularly useful walkthrough unless I have that exact same use case and want nothing more.
OTEL is a great concept with lots of potential. But current state of the project, reading about it just makes me dizzy.
There are too many layers of indirections across the documentation, the specification, instrumentation libraries, collectors, protocols, backends, github issues, enhancement proposals, distributions, api vs sdk. Try to file an issue and you will be bounced betweeen at least three of them.
I get it, the idea is to be vendor agnostic. But the central hub needs to be a lot more refined. Especially the language instrumentation api don’t need to pretend they are independent of the central project.
I mean, sure, you can improve performance a bit by increasing the RAM/compute capacity on the Lambda. But it always adds a pretty steep overhead right now, no matter how much capacity you throw at it.
Depending on the language it definitely has some "over engineering" issues. I love the API/SDK split and I think folks undervalue that.
I dislike:
1. Context/Scope: super complicated and too much abstraction/generalization here.
2. Span->activate returns a scope that needs to be deactivated manually. Which is different from ending the Span.
It gets very complicated and I assume supports folks swapping parts of sdk out within scope down the call stack. I'm curious how much that is used and if it justifies it's conceptual weight.
Again some of this stuff is lang specific problems. In java, mostly fine. In PHP (and I assume other environment per request dynamic langs) deep overkill.
This is less likely an issue with OTel itself but just that AWS still doesn't natively support it. The x-ray daemon is similar to the collector, but it's run for free in the background while the OTel collector is in the user's lambda resources.
It was fun having freedom to work on OTel within x-ray to provide better instrumentation users but was always frustrating how whenever pushing for more native support within the team such as ingesting otlp directly, the answer was always no since it meant losing control. Note that I don't think the actual reason is to overcharge users (otherwise why invest in things like graviton?), though the result does end up being that.
Ours ends up in Prometheus and I wish I had just used that instead.
But one of our coworkers was gung ho about tracing, so I obliged not knowing what I had gotten myself into. And the moment I turned tracing on is got quenched because in a mature app the amount of tracing you want to do is about ten times the limit for messages per second per sender. It’s a toy and that functionality is now turned off, though the call graph changes to support it are still there and make our stack traces ugly.
Tracing is pretty life-changing when it works well though. It's worth spending some engineering resources to try and get it if you can, even if it's at a lower sampled rate.
I'm definitely curious how data volume was a big concern for traces. I work with several customers with enormous trace volume, and while they absolutely need to do tail sampling so their bill doesn't explode, the actual export of spans from their services isn't typically the problem to solve.
We don’t have a monorepo so I can’t tell you how many lines of code we have anymore, but if it’s less than 300k lines I’ll eat my hat. It’s gotta be half a million. It’s a beast. No insane fanout, but unless I very carefully cherry picked a couple of points in the code instead of piggy backing in our already pretty good telemetry choke points, then I’d hit the limit even in preproduction.
We don’t historically do a lot of things well, but everything does correlationID propagation properly and some have pretty good telemetry, and have done since before I got here. Most other things in the “well” column were heavy lifting by myself and a handful of other instigators, some of whom gave up and left.
That’s probably what upsets me so much about OTEL. It made one of our strengths into another thing to complain about.
I get “designed by Java developers” vibes, which is a kind of overengineering to be sure, but a particularly nasty one, and one none of us should have to deal with in 2024.
We moved a project from statsd to otel last year, and I really wish we had spent that time on something else. I really wish I had gotten to spend my time in something else. Statsd makes it the aggregator’s problem to deal with stats, so application lifetime (you mention lambda) is not a problem. They can fire and forget, and at a much higher data rate than we achieved with OTEL.
The main feature of OTEL is the tagging, but Amazon charges you for using it. So much so that we had to impoverish our tags until they provided only a small factor of improvement in observability that was ultimately not worth the cost of migration. If I had straight ported, we would have generated about 40x as much traffic as we got to in the end by cutting corners.
OTEL is actively hostile to programming languages with a global interpreter lock. Languages where you run processes proportional to the number of cores on the machine need to tag each process separately. If you don’t, then the stats interfere with each other. And that brings us back to pricing by tag, because now you have 16-128x times as many combinations of tags because each process has a unique tag per machine.
We ended up having to put our own aggregator sidecar in place that could merge the counters and histograms from the same machine. If you cross your eyes it just looks like one of the statsd forks that adds tags. Which would have been so much easier for us to do.
And each process remembers every stat it has ever seen, otherwise it will self clobber. So again, 16-128x as much memory for bookkeeping unless you send them to a bookkeeping process.
They had weird memory leaks that only got sorted out in the summer. There’s a workaround, but it nearly caused a production outage for us and we violated our SLAs, which costs us money.
We also had total stats loss because the JavaScript implementation is Typescript and their typescript implementation did not assert that numeric inputs were numbers instead of numeric strings. That lead to number + string arithmetic bugs, which lead to giant numbers that OTRL choked on and dropped.
It also took a lot of work to get our sidecar to consume input from 30+ processes without dropping any. Most of our boxes run about cpucount + 1-2. We don’t have an obscene amount of telemetry, but it’s a mature app with years of “we need to track X” conversations. It wasn’t until September that I was confident we could ingest from 64 cores at once, and I have no idea how we’d handle 128. Because again, OTEL does not like more than one process that thinks it’s the “same” app, so you have to tag or centralize to disambiguate.
And the thing I hate the most about OpenTelemetry: it has One Bad Apple Syndrome. Because the stats are accumulated and sent in bulk, if it does not like one value in the update, it drops the entire message. One poison pill stat causes 100% loss of telemetry from that machine. That is a stupid fucking design decision and I want the author of that particular level of hell to feel my anger about this jackass decision with every fiber of my being. Fuck you, sir. You have no business working in standards track software. Get out.
> The main feature of OTEL is the tagging, but Amazon charges you for using it. So much so that we had to impoverish our tags until they provided only a small factor of improvement in observability that was ultimately not worth the cost of migration. If I had straight ported, we would have generated about 40x as much traffic as we got to in the end by cutting corners.
This bit is really unfortunate. Hopefully not too unhelpful, but there are several other vendors out there that don't limit you like this. Other tradeoffs to be sure (no such thing as a perfect o11y tool), but I think most dedicated observability backends have moved away from this kind of pricing structure.
We are also collecting on 30 second intervals instead of 10, which makes evidence of incidents a little more ambiguous. Not impossible, just slightly more stressful to handle a production incident. Like that guy you don’t hate, but are bummed he showed up to your party.
The way billing is done in observability offerings everywhere is so frustrating. CloudWatch metrics is such a footgun that I'm terrified when any engineer on our team submits a PR that includes them. It's so easy to accidentally 100-1000x your bill.
I feel like you're always left to choose between obscene pricing models and AbstractSingletonProxyFactoryBeanProvider level "enterprise" configuration.
Hmmm, I wouldn't say this is everywhere. There are tools that aren't metrics-based don't charge based on attributes and/or their cardinality. I'm avoiding shilling the place I work for since some of the messaging around "observability 1.0 vs. 2.0" comes from us, but that's kind of the gist of it.
If an attribute is important for debugging something later you shouldn't have to pay 100x the cost to be able to use it. Unfortunately, when you're using a "1.0" type tool such as CloudWatch or DD Metrics you end up needing to guess the economic cost of data you include and measure it against perceived economic value later down the line, which is a terrible experience.
(Switching observability tools is no joke though, so I won't say "just switch tools!" -- if the current pains are high it may be worth it, but there's no simple way to switch that I've seen)
The collector or protocol doesn't care that you have duplicate entries (from different processes). It's your telemetry/storage backend that drops the data, like we have the same issue with datadog, fuck them for silently dropping data.
I’m pretty sure otelcollector is where it’s getting dropped. It sees data with the same tags coming from a few dozen processes, doesn’t know how to aggregate them, and so just does last writer wins about once a second. Makes for some very peculiar rate charts.
It wasn't dropped or not dropped that was the problem, it was clobbering.
If one connection said it had seen 200 events, and another 190, it ping ponged between them instead of deciding 200+190 = 390. What keeps them from clobbering is distinct tags per connection. If you're running Ruby or Python or Node, that's one connection per thread, and that's ridiculously expensive.
> my god are the reference implementations lacking.
Can you share some of your experience, what do you mean by that? Are there edge cases causing problems, or major missing features? Easy or difficult to use?
As an example, Exemplars are part of the metrics spec [1]. The official python library says metrics status is 'stable' [2]. But there's an approximately 2-year old issue with no work on it, titled 'Metrics: Add support for exemplars', where the latest update is that no work has begun [3]. Nothing at a top-level of the opentelemetry-python project indicates that the project does not implement everything in the metrics spec, so if you wanted to use that capability, you are apt to discover it relatively late.
In the Elixir library, we don't even have metrics. I went into the OTel rabbit hole for two days trying to understand how is it better to Prometheus, just to learn it doesn't even do the basic thing, just traces.
I've mentally decided to just go Prometheus and ignore OpenTelemetry for the foreseeable future.
It's one of those things big players are hyping to preemptively lock you in their solution, but it's actually just alpha-quality new tech and "boring" "old" tech like Prometheus or statsd are simply more functional and better supported in the wild.
Elixir opentelemetry works quite well with Tempo. Tempo does the metric generation [1] and writes it to Prometheus. Tempo also does Service Graphs which works great with context propagation [2].
Btw, metric generation is not enabled in Tempo by default.
Metrics are implemented in the `opentelemetry_experimental` application. Last time I tried them, they were still a bit buggy but working (not complete, thiugh).
> Can you share some of your experience, what do you mean by that? Are there edge cases causing problems, or major missing features? Easy or difficult to use?
Just the general problem you get with big, slow moving OSS projects like this. Mostly just docs not current and a massive delta between certain languages; a feature is `stable` for some languages but not others which makes it hard to push for consistent otel roll out in a mixed-language environment.
Some other "misc" points:
- Google how to do $thing and you might find the proposed spec which gives example code ... that isn't what actually got implemented. That's a different link further down on your google results.
- Python auto-instrumentation is ... fragile at best.
It's not super clear if instrumentation is supported only with well known frameworks or just ... in general.
I'd sure love some docs that explain how it works, too.
- certain things require the collector use GRPC, others work with grpc or http... and I only found this out after googling an obscure error and reading through a _very_ long GH issue thread.
The example presented seems to also log, they just annotated the logs with span data
> What we ended up implementing was a little tee inside the o11y library. As well as sending events to Honeycomb, we also converted them to JSON, and wrote them to stdout. That way, after sending to stdout, we then pumped off to our standard log aggregation system. This way, we've got a fallback. If Honeycomb is not working, we can just see our logs normally. We could also send these off to S3 or some other long term storage system if we wanted.
I'd like to go a step further, and say that in addition to being worried about honeycomb being down, sometimes you just want to check with kubectl to get an idea what is going on.
Our current projects are very log light because of the heavy tracing instrumentation, but it'd be nice to integrate this with the otel paradigms as they were originally intended
Agree, there being an open standard for instrumentation is a big win. Lots of work still needs to be done on showing more examples and making it more accessible to users & implementors.
One other key area is resources which can help get engineers/implementors to get organizational buy-in
Meh, I work in metrics observability and there's very little support for otel. Most new open source products are still based on Prometheus, which has much better SDK support than otel.
I think it's a mistake for Otel to do its own thing instead of just building on top of Prometheus.
I don’t agree with the communication patterns of either Prometheus or OpenTelemetry, but I’ll pick Prometheus next time I have to do telemetry. Unless there’s some fork of StatsD with tags that makes a resurgence.
But OTLP is still hot garbage right now. If you send otlp to Prometheus it might get there, or it might all end up being dropped by a parsing error, because otelcollector is dumber than a box of hammers that have been through a rock tumbler.
OpenTelemetry is a great concept, but in my experience not quite there yet. Docs especially fall into the common trap of handling the happy path hello world quickstarts, then become increasingly useless as you want to get beyond that to real life use cases. Given the inherent tradeoff of complexity that comes from trying to unify different approaches around one standard, sometimes it seems like things that should be simple are more difficult than they should be. I'm sure it will keep improving.
> Docs especially fall into the common trap of handling the happy path hello world quickstarts, then become increasingly useless as you want to get beyond that to real life use cases.
Yeah, Java is what I'm most familiar with. The "Getting Started" shows how to do some basic manual instrumentation and collect the output with curl. Then the "Next Steps" are just random things with no guidance about why I would or wouldn't choose any of them for my next step.
But, ok, I choose "Automatic Instrumentation", that sounds promising. And it actually is really easy to set up auto instrumentation. But then at the end it says
> After you have automatic instrumentation configured for your app or service, you might want to annotate selected methods or add manual instrumentation to collect custom telemetry data.
Uh... no... after I have automatic instrumentation enabled I want to do something with the output
The two major flaws in the docs seem to be
1. The common failure of docs to explain to users why they might choose one thing or another. "If you want to do x.. If you want to do y.." what if I don't know?
2. Because otel is agnostic to the consumer of the output, there's very little in the way of explaining how to get value out of what otel produces. To connect the dots, you really need to use the docs of your observability tool. Which I understand, but then most of them have their own setup directions because they want some extra fields included in the data, or they have their own fork, so not everything in the otel docs is actually usable.
I'm not sure what the answer is. It's not like I expect otel to document how to build a dashboard in Grafana. And a lot of frustration I've experienced has been with the observability tools themselves. But at the same time, I always feel like the otel docs just don't get you anywhere close to getting value out of the library. Which is a shame, because turning on auto-instrumentation and seeing all your traces with literally no extra work is a magical moment.
> 1. The common failure of docs to explain to users why they might choose one thing or another. "If you want to do x.. If you want to do y.." what if I don't know?
Observability docs in general struggle with this. So many data sources can emit so many types of metrics in so many formats, and every tool makes this impossible promise of consolidating it all into one space seamlessly. But tools like Grafana pride themselves so much on visualizing _anything_ that they paint themselves into a corner where they can't be prescriptive about common uses or methods without excluding or confusing others.
So a lot of the prescriptive answers to "what if I don't know?" gets chucked onto account and support teams of commercial vendors, because the docs can't anticipate every possible context in which an observability tool will get deployed. Each solution ends up being custom tailored and poorly portable to anyone else's, often not even to other customers with the same data sources and goals at the same scale due to wacky labelling differences or legacy requirements or some internal stakeholder demand.
More narrowly focused tools don't have as many of these problems, but not many organizations want narrowly focused observability tools. (Lots of _people_ do, but orgs don't want to pay out deals to multiple vendors for what looks like different flavors of the same result. And hey look it's Grafana Cloud or Datadog or whatever, it can do _anything_, so you devs and also bizops and SRE and IT and hey sales wants a dashboard too and so does the company cafeteria, why not, you all can just use this one tool and we just deal with one bill with a volume discount, right? Right??)
Smarter tools don't have as many of these problems by papering over the docs limitations by being better able to anticipate or surface connections between data sources, metrics, logs, traces, events, etc., and does so with better interfaces. But especially for high-cardinality data the usability of those tools either seems to fall apart or their companies charge Datadog-sized invoices.
Are there narrowly focused tools in the observability space even?
I was shopping for one after being outside of this field for a while, and they all do the 101 features and the kitchen sink model, which adds onto the complexity. DataDog, Grafana, but also the open source ones like SigNoz itself.
Ages ago it was all about metrics, today it's metrics traces logs APM alerting exceptions and a dozen other acronyms, on top of the protocols (statsd, Prometheus, OpenTelemetry), paired with crazy complicated yet unwieldy graph building UIs. Let's not even talk about pricing models. The entire business model is based around having one more checkmark in the feature list than the competition. The wire format (OpenTelemetry) has never been the pain point in this space.
For a moment, I seriously considered just going back to the 2000s and using RRDtool.
Most new observability tools start narrow but every economic incentive is to expand. Which makes sense, really, because most production systems people have are complicated as all hell and have tons of different needs. Some tools are better than others at containing the chaos -- I will humbly submit that the one I work for, Honeycomb, is one of the best at doing this -- but support for several telemetry signals, visualization tools, alerting systems, dashboarding systems, etc. are all what people eventually ask for as they roll out observability to more of their production systems.
Put differently, when you have sufficient observability of your entire system, you now have a complete abstraction of that system represented in some other UI and data streams. There's just no way out of the fact that for larger systems, this will be complicated, and the tools that can represent this reality must also be complex.
Hmmm... Yeah I setup open telemetry for a couple personal projects this year was pleased with the ease of setup, but by and large I knew what I was doing specifically I had my application, and I had Grafana and I wanted to get traces from A to B.
Relooking at the docs from the eyes of a newcomers if you don't already have a destination in mind they don't really help you. It's a little tricky because my setup with Grafana will be somewhat different (but similar) from someone using honeycomb or signoz or what have you, but even just having a "want to visualize your data? Check out the list of compatible vendors", with a link that direction would probably go a long way.
By comparison, I wanted to use opentelemetry for a series of projects, but could find absolutely no useful documentation on how to do anything else other than "send data from a webapp to a server / other cloud service that some vendor wants to sell you".
All I wanted to do was instrument an application and write its telemetry data to a file in a standard way, and have some story regarding combining metrics, traces, and logs as necessary. Ideally this would use minimal system resources when idle. That's it.
It doesn't read from files unfortunately, but https://openobserve.ai/ is very easy to set up locally (single binary) and send otel logs/metrics/traces to.
Also linked from that README is an Ansible playbook to start OpenObserve as a systems service on a Linux VM.
Alternatively, see the shovel codebase I linked above for a "stdout" TracerProvider. You could do something like that to save to a file, and then use a tool to prettify the JSON. I have a small script to format json logs at https://github.com/bbkane/dotfiles/blob/2df9af5a9bbb40f2e101...
That's actually a neat little analysis platform, thanks!
Amusingly I can run my application, if I generate custom formatted .json and write it to a file, I can bulk ingest it... which is pretty much what I do now without the fancy visualization app. I think this speaks to my point that the OpenTelemetry part of the pipeline wouldn't be doing much of anything in this case. (The reason I care about files is that applications run in places where internet connectivity is intermittent, so generating and exporting telemetry from an application/process needs to be independent from the task of transferring the collected data to another host.)
For that use-case, you almost want the file to be rotated daily and just ... never sent ... at least until a customer has an issue, or you're investigating that hardware.
maybe part of the issue is that all the vendors working on it usually have time limits for ingesting data into their backends (like timestamps must be no more than -18/+2h from submission time) so they don't really care about it.
The major tracing library in Rust suggests a consumer that prints to stdout, but it's at the end of the introductory documentation; https://docs.rs/tracing/latest/tracing/
EDIT: it's what I've used when bridging between "this is a CLI app for maybe 3 people" and "this will need to be monitored"
Time and again I ran into two or three examples in different docs, and search engines sending me to the nonfunctional or ambiguous ones, complaining about it and having someone send me to a whole other doc I’ve never seen that is 3 clicks away from the overview doc while the broken ones are 0-2 clicks away.
Give me something that isn't based on protobufs at wire / request level. CBOR with CDDL for a fully standards based approach that can work at any size of the stack
I used protobufs for a short while and came to the realization that they’re just Go’s opinionated idioms forced on other languages via awkward SDKs. Particularly did not like having to use codegen, rely on Google’s annotations for basic functionality or deal with field masks that are a sort of poor man’s GraphQL.
I get it, Google made trade offs that work for them, and I agree with their position - but for someone at a smaller company working in a non-Go/Java/C programming language it was just a ton of friction for no benefit.
Annotations are somewhat cursed agree. The code generation part does not have to be painful though.
In fact, the tooling in Go isn't even an example of the easiest way to do it and requires you to do more steps than for example .NET where getting server boilerplate, fully working client or just generated POCO contracts from .proto boils down to
dotnet add package Grpc.Tools
<Protobuf Include="MyApiContracts.proto" /> (in .csproj)
So many issues, primarily the non-idiomatic (even in golang!) code generation and the requirement for special tooling to do any troubleshooting, and after all that you still don't end up with anything particularly interoperable.
At best it works tolerably in a monorepo with tightly controlled deployment scenarios and great tooling.
But if you don't have a Google-like operations environment, it's a lot of extra overhead for a mostly meaningless benefit.
The first issue is that protobufs arent a standard. That inherently limits anything built on top of them to not be a standard either, and that limits their applicability
Also depending on the environment you run in, can code size bloat vs alternatives can matter
You mean like an IETF standard? That is true, although the specification is quite simple to implement. It is certainly a de-facto standard, even if it hasn’t been standardized by the IETF or IEEE or ANSI or ECMA.
> inherently limits anything built on top of them to not be a standard either
I’ve had several projects that ran on wimpy Cortex M0 processors and printf() has generally taken more code space in flash than NanoPB. This is generally with the same device doing both encoding and decoding.
If you’re only encoding, the amount of code required to encode a given structure into a PB is very close to trivial. If I recall it can also be done in a streaming fashion so you don’t even need a RAM buffer necessarily to handle the encoded output.
Do I love protobufs? Not really. There’s often some issue with protoc when running it in a new environment. The APIs sometimes bother me, especially the callback structure in NanoPB. But it’s been a workhorse for probably 15 years now and as a straightforward TLV encoding it works pretty darned well.
Sounds like Stockholm Syndrome. I've work mostly with JSON/CSV/Thrift in the last 10 years, and xml/soap before that, and just recently started interacting with protobuf, so I'd disagree that it is a "de-facto standard."
My largest complaint: observability. With almost literally any other protocol, if you can mitm on the wire, your human brain can parse it. You can just take a glance at it and see any issues. With grpc/pbuf ... nope. not happening.
Also, I really don't like how it tries to shim data into bitmasks. Going back to debugging two systems talking to each other, I'm not a computer. Needing special tooling just to figure out what two systems are saying to each other to shave a quarter of a packet is barely worth it, if at all, IMHO.
> You can just take a glance at it and see any issues. With grpc/pbuf ... nope. not happening.
Sure, but on the other hand, the number of times I’ve needed to do this, compared to JSON/string/untyped/etc systems is precisely zero. There’s just a whole lot of failure that are just non-issues with typed setups like protobufs. Protobuf still has plenty of flaws and annoying-google-isms, but not being human readable isn’t one of them IMO.
I haven’t worked with protobufs, but I’m old enough to know why people thought they needed protobufs (because hand writing terse wire protocols is painfully dumb).
Be careful about “need”. When people are avoiding doing something painful they invent all sorts of rationalizations to try to avoid cognitive dissonance. You don’t reach for to tool that hurts to pick up. You reach for something else, and most do it subconsciously.
Nobody is going to try to read protobuf data. Doesn’t mean they don’t need to understand why the wire protocol fucked up.
Nothing is preventing a system from sending you a un-deserializable message disguised as a protobuf, just like with any other encoding. In these cases, you need to diagnose the issue, no? Having something human-readable is simply straightforward.
If you haven't needed to do this, perhaps you aren't working on big enough systems? I've primarily needed to do this when dealing with hundreds of millions of disparate clients, not so much on smaller systems.
> Nothing is preventing a system from sending you a un-deserializable message disguised as a protobuf,
I guess it depends on where you come down on Postel’s law. If you’re an adherent, and are prepared to be flexible in what you accept, then yeah, you will have extra work on your hands.
Personally, I’m not a fan of Postels law, and I’m camp “send correct data, that deserializes and upholds invariants, or get your request rejected”. I’ve played enough games with systems that weren’t strict enough about their boundaries and it’s just not worth the pain.
When you have hundreds of millions of clients, there’s a good chance the client thinks it’s sending you the right data (especially when the customer says everything looks right on their end). You need to figure out if there is packet corruption (misbehaving switch), an outdated client, an egress load balancer screwing with packet contents, an office proxy trying to be smart, etc.
This requires looking at what is going over the wire in a lot of cases. It has nothing to do with Postel’s Law, but more to telling a customer what is wrong and making your own software more resilient in the face of the world.
Implement a human readable protocol, then use standardized streaming compression on the wire to get your message size down. Something LZ family because there are tools everywhere that speak them. And consider turning off transport encoding for local development.
Being able to scan data saves so much time on triage. And using zgrep and friends on production data is almost as easy. You will spend tons of effort trying to make something 10% more efficient than zlib or for certain zstd, and the cost is externalized onto your team.
This is true with any non-self-describing format. Which includes the vast majority of JSON depending on who you ask - if you aren't specifying a schema in the request, what does `name` really mean?
Self-describing comes with rather large costs over a compact format in essentially all cases, there are lots of good reasons to prefer it. Particularly in internal infrastructure, like telemetry tends to involve.
I think the issue is that the ergonomics of protobuf kinda suck and the ergonomics of gRPC really suck and having to interface with a gRPC API is likely everyone's introduction to protobufs.
Protobufs are a really great idea that's hampered by heinously subpar tooling for everything but Go.
As a relative outsider to the observability space, I have always wondered this:
Is observability/telemetry only about engineering-related issues (performance, downtimes, bottlenecks etc.) or does it include the "phone-home" type of telemetry (user usage statistics, user journeys)? Looking through the websites of most of the observability SaaSes it seems to only talk about the first. Then how do people solve the second? Is it with manual logging to the server from the client?
I think usage statistics tend to require more retention time to discover user behavior and understand how to optimize revenue. In the general case people probably won't care much whether their widget was running at X% CPU on Dec 5th 2019 but they might care more about what percent of users did Y action on that date. When I worked on an observability team (not as an expert but as a general swe) we had two metrics pipelines; one was strictly usage statistics which came from the client, the other was purely server metrics but a subset of them were considered usage metrics which were aggregated and sent down with the client metrics for the folks upstairs.
It is sometimes the second. Apollo (the GraphQL one) uses OpenTelemetry for tracing and monitoring reasons but also for usage tracking. When was a field last used, what frequency is it included in queries, etc.
I would think anyone trying to put an OTEL source on an embedded device (IoT) was out of his goddamned mind. OTEL assumes data sources have ample hardware and particularly memory. It periodically summarizes all traffic since startup instead of streaming things as they happen.
- Stability in the specification
- Stability in semantic conventions
- Stability in the protocol representation
- Stability in SDKs that can generate data
- Stability in the Collector that can receive, process, and export that data
Unfortunately, for many people, they may interpret "stable" in one of those categories as "stable for everything", and then get really annoyed when they find their language doesn't actually have stable support (or any support!) for that concept.
What I'm most proud of in 2023 is all of the little things we made progress on with components that engineers have to materially deal with. On the website, we documented what feels like a million little things and clarified tons of concepts that people told us were confusing. Across all the SDKs, we fixed tons of little bugs, added more and more instrumentations, and completed the unsexy work to make metrics generation stable across most of our 11+ languages. The Collector added oodles and oodles of support for different data sources, and OTTL went from a neat component to a rock-solid general-purpose data transformation tool.
There's so much more work to do, but I'm really happy about the progress.