The biggest issue with OpenTelemetry is how aggressively it's being pushed, desp...

codexb · on Jan 13, 2024

I agree that it gives off an over-engineered vibe. I think part of that is that a lot of the "Getting Started" docs don't give you a feel for how to use the framework. It's more along the lines of "install this package and create this esoteric config file and this very particular telemetry logging will work". That's not a particularly useful walkthrough unless I have that exact same use case and want nothing more.

Too · on Jan 13, 2024

OTEL is a great concept with lots of potential. But current state of the project, reading about it just makes me dizzy.

There are too many layers of indirections across the documentation, the specification, instrumentation libraries, collectors, protocols, backends, github issues, enhancement proposals, distributions, api vs sdk. Try to file an issue and you will be bounced betweeen at least three of them.

I get it, the idea is to be vendor agnostic. But the central hub needs to be a lot more refined. Especially the language instrumentation api don’t need to pretend they are independent of the central project.

onlyrealcuzzo · on Jan 13, 2024

> performance and resource overhead of OTel collector for Lambda is just awful right now

Presumably this means it's costly, which would be a reason for them to recommend it.

bilalq · on Jan 14, 2024

I mean, sure, you can improve performance a bit by increasing the RAM/compute capacity on the Lambda. But it always adds a pretty steep overhead right now, no matter how much capacity you throw at it.

https://github.com/open-telemetry/opentelemetry-lambda/issue...

https://github.com/aws-observability/aws-otel-lambda/issues/...

hinkley · on Jan 13, 2024

I resent that you are right.

sporkland · on Jan 14, 2024

Depending on the language it definitely has some "over engineering" issues. I love the API/SDK split and I think folks undervalue that.

I dislike: 1. Context/Scope: super complicated and too much abstraction/generalization here. 2. Span->activate returns a scope that needs to be deactivated manually. Which is different from ending the Span.

It gets very complicated and I assume supports folks swapping parts of sdk out within scope down the call stack. I'm curious how much that is used and if it justifies it's conceptual weight.

Again some of this stuff is lang specific problems. In java, mostly fine. In PHP (and I assume other environment per request dynamic langs) deep overkill.

atomicdalton · on Jan 13, 2024

This is less likely an issue with OTel itself but just that AWS still doesn't natively support it. The x-ray daemon is similar to the collector, but it's run for free in the background while the OTel collector is in the user's lambda resources.

It was fun having freedom to work on OTel within x-ray to provide better instrumentation users but was always frustrating how whenever pushing for more native support within the team such as ingesting otlp directly, the answer was always no since it meant losing control. Note that I don't think the actual reason is to overcharge users (otherwise why invest in things like graviton?), though the result does end up being that.

hinkley · on Jan 13, 2024

Ours ends up in Prometheus and I wish I had just used that instead.

But one of our coworkers was gung ho about tracing, so I obliged not knowing what I had gotten myself into. And the moment I turned tracing on is got quenched because in a mature app the amount of tracing you want to do is about ten times the limit for messages per second per sender. It’s a toy and that functionality is now turned off, though the call graph changes to support it are still there and make our stack traces ugly.

bilalq · on Jan 14, 2024

Tracing is pretty life-changing when it works well though. It's worth spending some engineering resources to try and get it if you can, even if it's at a lower sampled rate.

phillipcarter · on Jan 13, 2024

I'm definitely curious how data volume was a big concern for traces. I work with several customers with enormous trace volume, and while they absolutely need to do tail sampling so their bill doesn't explode, the actual export of spans from their services isn't typically the problem to solve.

hinkley · on Jan 13, 2024

We don’t have a monorepo so I can’t tell you how many lines of code we have anymore, but if it’s less than 300k lines I’ll eat my hat. It’s gotta be half a million. It’s a beast. No insane fanout, but unless I very carefully cherry picked a couple of points in the code instead of piggy backing in our already pretty good telemetry choke points, then I’d hit the limit even in preproduction.

We don’t historically do a lot of things well, but everything does correlationID propagation properly and some have pretty good telemetry, and have done since before I got here. Most other things in the “well” column were heavy lifting by myself and a handful of other instigators, some of whom gave up and left.

That’s probably what upsets me so much about OTEL. It made one of our strengths into another thing to complain about.

hinkley · on Jan 13, 2024

I get “designed by Java developers” vibes, which is a kind of overengineering to be sure, but a particularly nasty one, and one none of us should have to deal with in 2024.

We moved a project from statsd to otel last year, and I really wish we had spent that time on something else. I really wish I had gotten to spend my time in something else. Statsd makes it the aggregator’s problem to deal with stats, so application lifetime (you mention lambda) is not a problem. They can fire and forget, and at a much higher data rate than we achieved with OTEL.

The main feature of OTEL is the tagging, but Amazon charges you for using it. So much so that we had to impoverish our tags until they provided only a small factor of improvement in observability that was ultimately not worth the cost of migration. If I had straight ported, we would have generated about 40x as much traffic as we got to in the end by cutting corners.

OTEL is actively hostile to programming languages with a global interpreter lock. Languages where you run processes proportional to the number of cores on the machine need to tag each process separately. If you don’t, then the stats interfere with each other. And that brings us back to pricing by tag, because now you have 16-128x times as many combinations of tags because each process has a unique tag per machine.

We ended up having to put our own aggregator sidecar in place that could merge the counters and histograms from the same machine. If you cross your eyes it just looks like one of the statsd forks that adds tags. Which would have been so much easier for us to do.

And each process remembers every stat it has ever seen, otherwise it will self clobber. So again, 16-128x as much memory for bookkeeping unless you send them to a bookkeeping process.

They had weird memory leaks that only got sorted out in the summer. There’s a workaround, but it nearly caused a production outage for us and we violated our SLAs, which costs us money.

We also had total stats loss because the JavaScript implementation is Typescript and their typescript implementation did not assert that numeric inputs were numbers instead of numeric strings. That lead to number + string arithmetic bugs, which lead to giant numbers that OTRL choked on and dropped.

It also took a lot of work to get our sidecar to consume input from 30+ processes without dropping any. Most of our boxes run about cpucount + 1-2. We don’t have an obscene amount of telemetry, but it’s a mature app with years of “we need to track X” conversations. It wasn’t until September that I was confident we could ingest from 64 cores at once, and I have no idea how we’d handle 128. Because again, OTEL does not like more than one process that thinks it’s the “same” app, so you have to tag or centralize to disambiguate.

And the thing I hate the most about OpenTelemetry: it has One Bad Apple Syndrome. Because the stats are accumulated and sent in bulk, if it does not like one value in the update, it drops the entire message. One poison pill stat causes 100% loss of telemetry from that machine. That is a stupid fucking design decision and I want the author of that particular level of hell to feel my anger about this jackass decision with every fiber of my being. Fuck you, sir. You have no business working in standards track software. Get out.

phillipcarter · on Jan 13, 2024

> The main feature of OTEL is the tagging, but Amazon charges you for using it. So much so that we had to impoverish our tags until they provided only a small factor of improvement in observability that was ultimately not worth the cost of migration. If I had straight ported, we would have generated about 40x as much traffic as we got to in the end by cutting corners.

This bit is really unfortunate. Hopefully not too unhelpful, but there are several other vendors out there that don't limit you like this. Other tradeoffs to be sure (no such thing as a perfect o11y tool), but I think most dedicated observability backends have moved away from this kind of pricing structure.

hinkley · on Jan 13, 2024

We are also collecting on 30 second intervals instead of 10, which makes evidence of incidents a little more ambiguous. Not impossible, just slightly more stressful to handle a production incident. Like that guy you don’t hate, but are bummed he showed up to your party.

bilalq · on Jan 14, 2024

The way billing is done in observability offerings everywhere is so frustrating. CloudWatch metrics is such a footgun that I'm terrified when any engineer on our team submits a PR that includes them. It's so easy to accidentally 100-1000x your bill.

I feel like you're always left to choose between obscene pricing models and AbstractSingletonProxyFactoryBeanProvider level "enterprise" configuration.

phillipcarter · on Jan 14, 2024

Hmmm, I wouldn't say this is everywhere. There are tools that aren't metrics-based don't charge based on attributes and/or their cardinality. I'm avoiding shilling the place I work for since some of the messaging around "observability 1.0 vs. 2.0" comes from us, but that's kind of the gist of it.

If an attribute is important for debugging something later you shouldn't have to pay 100x the cost to be able to use it. Unfortunately, when you're using a "1.0" type tool such as CloudWatch or DD Metrics you end up needing to guess the economic cost of data you include and measure it against perceived economic value later down the line, which is a terrible experience.

(Switching observability tools is no joke though, so I won't say "just switch tools!" -- if the current pains are high it may be worth it, but there's no simple way to switch that I've seen)

arccy · on Jan 13, 2024

The collector or protocol doesn't care that you have duplicate entries (from different processes). It's your telemetry/storage backend that drops the data, like we have the same issue with datadog, fuck them for silently dropping data.

hinkley · on Jan 14, 2024

I’m pretty sure otelcollector is where it’s getting dropped. It sees data with the same tags coming from a few dozen processes, doesn’t know how to aggregate them, and so just does last writer wins about once a second. Makes for some very peculiar rate charts.

arccy · on Jan 14, 2024

might depend on your exporter but for all the ones I cared about I was inspecting the network traffic for it and it definitely sent everything

hinkley · on Jan 15, 2024

It wasn't dropped or not dropped that was the problem, it was clobbering.

If one connection said it had seen 200 events, and another 190, it ping ponged between them instead of deciding 200+190 = 390. What keeps them from clobbering is distinct tags per connection. If you're running Ruby or Python or Node, that's one connection per thread, and that's ridiculously expensive.