Grafana Tempo, a scalable distributed tracing system

RedShift1 · on Oct 28, 2020

I really wish Grafana would spend more time on fixing the 2.7k bug reports that are open for their core product...

ATsch · on Oct 28, 2020

I disagree with the idea that they can't work on both, but I have to agree that this is my #1 wish for Grafana. The frontend is filled with an incredible amount of ux papercuts, performance issues and silent breakage that make me resent having to edit dashboards at all. It's a perpetual state of being scared my next click will trigger "a script is not responding" or randomly break something I won't know how to fix.

It feels like Grafana Labs is stepping into the same "if it doesn't help sales tick checkboxes it's not getting development time" trap so many B2B companies hit as they grow and it's a shame. I wish there was something else I could switch to.

aprdm · on Oct 28, 2020

Yeah, I too would prefer that they fixed their bugs and focused in their UX for the dashboards. There are a lot of tickets in the backlog on github which have been there for years and aren't being prioritized... I can't imagine wanting to associate distributed tracing with Grafana which is a data visualizer...

davkal · on Oct 28, 2020

Which version are you using? And if you could tell us which datasources, that would help us as well. Narrowing this down helps us prioritise bugs.

ATsch · on Oct 28, 2020

I/We are using the latest 7.3.0 with Prometheus, however most of the issues are more general. I think they'd be hard to prioritize as they are so minor (outside of perhaps a "minor bugs only" sprint or similar, as some projects do). But they do add up very quickly.

To illustrate, here's some of the paper cuts that instantly come to mind: mod-s not working in panel editor, gradient fill incompatible with series overrides, various missing tooltips, missing query error reporting and syntax highlighting in variable editor, variable editor needing many clicks for most common actions, hidden variables can only be changed by repeatedly opening and closing the settings, some dashboard settings pages being easy to accidentally close without saving, g+[key] keyboard shortcuts randomly not working, random scroll jumps with multiple queries in a panel, switching between cursor modes with mod+o is super slow, Prometheus autocomplete not working when editing existing queries, clicking the dashboard name filters by folder while clicking the folder name shows all dashboards, adhoc filters silently don't work, etc.

The "not responding" one specifically happens a lot while editing queries. It'll randomly decide to evaluate the query while I'm typing, get thousands of results and lock up.

RedShift1 · on Oct 28, 2020

Oh man "random scroll jumps with multiple queries in a panel" that just hit me so hard, it drives me absolutely insane. I've now learned to stop using the mouse when navigating through the query text. Also annoying with the graph panel, when you apply a series override and want to change it later, you have to navigate the whole way to the setting again to change it. Makes you lose so much time just to try out some new colors for example.

torkelo · on Oct 29, 2020

can you report an issue on github for this and how to replicate? We have no reports of this scroll jumping and that sounds super annoying and something we want to fix asap.

RedShift1 · on Nov 2, 2020

Upgraded to 7.2.2 today and hasn't happened today, so fingers crossed.

formerly_proven · on Oct 28, 2020

No way to set a minimum axis extents: https://github.com/grafana/grafana/issues/979

Open since 2014, hitting it on every other panel...

Sodman · on Oct 28, 2020

There are certainly some real bugs in there, but saying there are 2.7k "bug reports" is a little disingenuous. These are "GitHub Issues", many of which are feature requests, and many more are new users trying out this free open source software for the first time. Many of these "bugs" are from users asking why their setup isn't working, and then failing to produce any kind of technical debug info when asked by the maintainers.

marcinzm · on Oct 28, 2020

There are 461 tagged with `type/bug` and from a quick eyeball they seem mostly legitimate issues (of varying severity).

netingle · on Oct 28, 2020

PRs welcome!

corford · on Oct 28, 2020

You just raised $50 million....

buro9 · on Oct 28, 2020

We also have open roles https://grafana.com/about/careers/#jobs

corford · on Oct 29, 2020

Not sure why this is being downvoted... Hiring people to fix things is a lot more reasonable (and a good idea!) than asking for free labour when you have $50 million in the bank.

nullsense · on Oct 28, 2020

Bug fixes appreciated

cbsmith · on Oct 28, 2020

Oh dear... this seems like a move in the exact opposite direction of what one would hope for. It's basically shoving in structured data into what is at best semi-structured log data, storing all of it, and then using that for things that a tracing system would be better suited to.

I can't help but feel they've learned the wrong lessons from their challenges with tracing.

annanay · on Oct 28, 2020

The semi structured nature of logs works to the advantage of Tempo, because as developers we have the flexibility to log _anything_, high cardinality values like cust-id, request latency, gobble-de-gook .. the equivalent of span tags. Instead of indexing these as tags, we get advance search features through a powerful query language landing in Loki (LogQLv2).

cbsmith · on Oct 28, 2020

But the data starts out structured. It becomes semi-structured when you log it.

I'm telling you, from first hand experience, this does not end well.

There's no reason that your tracing system should not be indexing your tags in an engine that provides advanced search features through a powerful query language.

pushrax · on Oct 28, 2020

I agree, if anything the eventual goal should be to invert it. In applications I work on right now, trace tags contain the richest and best-described request metadata. Tags are indexed differently depending on their cardinality, and there is no cardinality limit.

Tempo's implementation seems pragmatic as a short to medium term solution though. Log engines still have a lot more investment and maturity than trace engines. In my work, even though the trace tags contain the best data quality, the tracing system is currently worse at answering a good deal of my questions. It's simply that Splunk has many tools that work well, and the tracing system is behind.

cbsmith · on Oct 30, 2020

But Jaeger, as an example, will let you choose what back-end engine you want to store your traces in. There is no need to reinvent the wheel just for tracing. You can just leverage what is already out there.

tikkabhuna · on Oct 28, 2020

Slightly off-topic, but how are users of Grafana and other monitoring tools justifying the investment? And I'm not necessarily talking about monetary amount, its also people cost.

What features are you looking for and how do you rank them?

lawrjone · on Oct 28, 2020

We run Prometheus with 60d retention (2TB), and an Elasticsearch cluster for logs (also 60d, about 30TB cluster) with Kibana on top.

Prometheus is by far the cheapest, both for infrastructure spend and for human cost. The only time we spend really working on Prometheus is configuring our service discovery, which is the shipping component you’d have with any monitoring tool. I estimate we spend about 1 developer day a month un Prometheus upkeep.

Logging is much more fiddly, and required a large up front investment. Now it’s running though, we get a lot of value from our setup. It costs much more though, think about $10k/month in infrastructure spend and about 3 developer days a month to maintain.

Grafana is effortless to run, really dead simple. You can consider this to be ‘free’, both for infra and maintenance.

Hope that gives you a sense of this? I think we come out equivalent to managed services for cost when you account for infra and human time, but have far more flexibility in how we use the tools and develop the skills to properly leverage each product along the way.

atombender · on Oct 29, 2020

Same setup here.

My company uses ELK, and I personally really dislike it. It's okay when it works — filtering logs with queries is the primary use case, and it's decent at it when it's not returning 503s. But it's also based on Elasticsearch, inheriting all its warts. I really wish someone would build a better competitor. Loki/Grafana isn't anywhere close yet.

One of my pet peeves with ELK is how the indexer assumes all log entries have the same schema. So if one app has the "error" field as a string, and other logs it as an object, then ELK will reject the second one. It boggles my mind that someone thought this was a good technical design. It could easily suffix internal field names with its schema (type, analyzer, etc.) and then unify the fields at the UI level. But no, instead it discards log data.

Sluggish performance (ELK requires enormous machine researches for no particular reason, though the JVM is a major driver) and lack of support for tailing are two other pet peeves. (I have many more.)

lawrjone · on Oct 29, 2020

Yep, as you point out, the idea of Elasticsearch having a uniform type for each field can get problematic.

That can be fixed by:

- logging to different indexes

- preprocessing your logs so the keys have their own schema prefix, as you mention

The way we've tackled this is to have an official company-wide logging schema. It's just a GitHub repo at gocardless/logging that has an exhaustive list of logging keys, with an explanation of what they should contain.

This has the benefit of encouraging consistent logging practices over many teams, as well as improving the chance your logs will get indexed correctly. If the field type doesn't match, we won't index that field but it will appear in the _log field, where you can do a full text search as a fallback if you really needed to find your log.

It's not perfect though, and I still hate the dynamic type assignment.

atombender · on Oct 29, 2020

We log probably from 30+ services, plus third party apps. Standardizing on a schema is to some extent possible (but hard to actually enforce and test; many of our services are written in JS/TS, not Go), but not for third-party apps.

Could one set up FileBeat or whatever to ingest into separate indexes based on some source label? Most of our logs are streamed from Kubernetes, so there is an application label we could use. Elasticsearch is pretty good at unioning a single query over multiple indexes.

But it'd have to be an automatic mapping — I wouldn't want the ops team to have to maintain a mapping between apps and indexes that could (would!) get out of sync.

axw · on Oct 30, 2020

Yes, you can do this using Filebeat or in Elasticsearch itself using an Ingest node script processor.

Filebeat: https://www.elastic.co/guide/en/beats/filebeat/current/elast... (You can incorporate variables in the destination index; variables may refer to labels and other fields.)

Ingest node: https://www.elastic.co/guide/en/elasticsearch/reference/curr... (search for "_index")

tikkabhuna · on Oct 28, 2020

Thanks, great to see some numbers as well.

My initial draft of this stated that we are using Prometheus/Grafana + ELK across the company, although its still being standardised and turned from individuals creating their own deployments to a proper managed, strategic system with SLAs/documentation.

We're looking at the feature lists of SaaS offerings like Datadog, Logz.io, but I think we're too big to get a sensible price. That said, I don't think we're big enough to justify something like Thanos that turns Prometheus from a really simple application, into a big time investment.

Out of interest, are you storing 2TB of Prometheus data in a single instance? Or spread across multiple?

lawrjone · on Oct 29, 2020

We run about 10 Kubernetes clusters, and each cluster comes with our company's 'kernel'. That kernel includes an HA Prometheus deployment of two replicas and all the monitoring components that to with it.

The 2TB of data is what we have stored across all those Prometheus instances. We use thanos as the entry point that Grafana speaks to, so you get aggregated results.

Thanos as a querier is very simple to setup, and is very low maintenance. We have intended to configure long term storage using the GCS backend for years now, but sadly this project always ended up losing to other (genuinely!) more impactful work.

We hope to do this within the next 6 months though, and reckon the project will take about 2 weeks of our teams time.

For the monitoring angle then, I can recommend Prometheus and thanos as very easy systems to configure. Even for a small team with no prior experience, you'll probably have a good time.

The one to watch out for is Elasticsearch, as that is a fundamentally more complex system in which you plan to store much more data. Loki looks much easier to setup and benefits from the Grafana ecosystem integration, if you're looking for a shorter/cheaper/less featureful option.

tikkabhuna · on Oct 29, 2020

Interesting that you find Thanos simple to setup and low maintenance. One vendor we spoke to suggested that their other customers found Thanos complex. I'll make a note to keep it in consideration.

You comment on standardising the logging schema is a great idea as well. I'll circulate that around.

Thanks for your help!

aprdm · on Oct 28, 2020

This also mirrors my experience. Both Prometheus & Grafana require very little effort. We also have a similar scale to you.

I think ELK is the most challenging bit as you outlined yourself but is still very doable.

syoc · on Oct 28, 2020

I use a telegraf, influxdb, grafana stack. I find the cost in maintainance and initial setup quite low. Telegraf is super easy, just uncomment the things you want it to collect from the config file. Influxdb is just adding the correct users and a database, never touched it since. Grafana can be a time sink if you want to bikeshed your dashboards, but there are a lot of pre-made ones that handle common usecases.

It's really nice to have some metrics when for instance a service goes down. It's super easy to spot a OOM situation or other vertical scaling issues.

tikkabhuna · on Oct 28, 2020

Thanks. I completely agree with the bikeshedding. We're playing with Prometheus/Grafana + ELK, and being able to visualise all this data, its hard to work out what is useful and what is just fun to play with.

I'm now wondering where the line is between necessary monitoring and bikeshedding? I could look at introducing distributed tracing, but will it actually add any meaningful value?

gjulianm · on Oct 28, 2020

I run Prometheus + Grafana on a handful of machines we manage, with different software and setups. The cost of setting up a Prometheus instance and monitoring something is not too high, but things around it depend on your actual setup

- Service discovery can get tricky. In our case we manage the physical machines by ourselves so at the beginning we just had to write down static configs, although by now we have automatic discovery (took a few days of development).

- Depending on what you want to monitor, you might need to write your own exporters to Prometheus. However, mtail [1] has been really useful to create metrics from logs without too much work. In any case, you'll have to put time in deploying and configuring those exporters.

- Dashboards and alerts. There are dashboards for a lot of exporters, and there are collections of alerts too [2], but you will need to put time and effort in modifying/creating dashboards and writing down alerts. However, it's a productive effort because it helps in having a better understanding of which metrics are important and how do they relate to the workings of the software you use. Also, PromQL is a pretty nice query language for the purposes of Prometheus.

- Notification integrations. In my case we had to put some time to properly configure a Microsoft Teams integration and a deadman switch channel, but in most cases it will be pretty straightforward.

All in all, you'll need to invest some time in the integrations, but those are things that you need to do in any case. Prometheus itself is pretty easy to set up and maintain, and doesn't stand in your way. No tweaks, no undocumented settings, no bugs. I'm pretty happy in that regard, once you get it running you don't have to worry about it. Storage usage is pretty low even with a high amount of exporters and metrics per node, maybe around 10GB for 60 days of data of a single node? (I'm not sure because Prometheus does some compression and it's not exactly linear with the time or the number of nodes).

And that relatively low investment pays off quickly. The machines we manage use various tools and programs to deal with quite a lot of data at high bandwidths, so performance problems and bugs can be difficult to debug. The Prometheus + Grafana setup has made several times easier the debugging of issues and performance problems, the alerting system helps us prevent outages and we have even discovered issues that were unknown to us. For me, the moment you manage machines with even just a little bit of complexity in the software or setup, it's already worth it to look into monitoring.

1: https://github.com/google/mtail 2: https://awesome-prometheus-alerts.grep.to/

sna1l · on Oct 28, 2020

Ingestion/storage of the volume is one thing, but what about the additional overhead for every request?

strofcon · on Oct 28, 2020

Good question! Depends on the instrumentation libraries for sure, but in our Go services we've been using Jaeger protocol for our traces, and "sampling" at 100%, with negligible impact to request / response times.

We spit out some 30k+ spans per second, FWIW. :-)

Edit: Disclaimer, we're not using Tempo.

number101010 · on Oct 28, 2020

We are ingesting 170k+ with Tempo. It is 100% of our read/query path.

Disclaimer: I am using Tempo :) (and from Grafana)

RhodesianHunter · on Oct 28, 2020

Even with batching from Tempo, wouldn't that cost many thousands per month in S3 PUT costs alone?

number101010 · on Oct 28, 2020

We batch up traces in a block and write a block at a time. Internally we are currently configured to write 100k traces in one batch.

atombender · on Oct 29, 2020

Doesn't this cause explosive memory usage? What happens if there's some congestion? Is there a circuit breaker to start dumping (discarding) log entries past a certain limit?

I was testing Google Pub/Sub's Go client for publishing internal API event data for later ingest to BigQuery, and it turns out Pub/Sub publishing is not that much faster than writing directly to BigQuery. The buffer sizes we'd need to avoid adding latency to our APIs would have to be ridiculously high; the Pub/Sub client buffers and submits batches in the background (its default buffer size is 100MB!). I don't like the idea of having huge buffers that increase with the request rate.

Conversely, pushing the data to NATS in recent time without any buffering or batching turned out to be fast enough to not add any latency. You have to be able to receive messages very fast on the consumer side (as NATS will start dropping messages if consumers can't keep up), but you can simply run a few big horizontally autoscaled ingest processes that can sit there ingesting as fast as they can, which never impacts API latency at all.

marcinzm · on Oct 28, 2020

S3 PUTs are $0.005 per 1000. If you're writing twice a second that comes out to $25/month.

RhodesianHunter · on Nov 5, 2020

Yeah, but the person I'm responding to is suggesting 170k+ spans per second so how is twice per second relevant?

sna1l · on Oct 28, 2020

Thanks for the reply!

I think I must be missing something, but it seems like the big difference between Tempo and a traditional tracing system is the storage indexing & database (ES/C* vs object store and index all fields vs key/value lookup by ID). I vaguely remember reading something that latency even from EC2 -> S3 can be around 200-300ms. Wouldn't this cause the overhead to rise?

Feel free to point me to any documentation that clear this up!

netingle · on Oct 28, 2020

Writes are batches up and committed asynchronously to s3 - this should add much if any latency to your services.

Axsuul · on Oct 28, 2020

Is this a good fit for storing every network request we make (so that we can trace all requests made to an API on behalf of a specific customer for debugging) or is Loki better for that?

number101010 · on Oct 28, 2020

it depends on your needs. if a single log line is sufficient to log all the information you need to do your debugging then Loki is a great choice.

if you need to see the full request as it passed through your system then distributed tracing/Tempo is a great fit!

sciurus · on Oct 29, 2020

Honeycomb also stores traces in S3 and supports searching them via AWS lambda. I wonder if a model like that is useful for tempo, which doesn't seem to support search at all now.

https://www.honeycomb.io/blog/secondary-storage-to-just-stor...

SEJeff · on Oct 28, 2020

I'd love to see how this compares to Jaeger.

number101010 · on Oct 28, 2020

The headline would be:

Jaeger supports native search, but requires Elastic or Cassandra.

Tempo relies on discovery from logs/exemplars, but puts everything in object storage (s3/gcs).

Tempo is cheaper and easier to operate but lacks native search.

strofcon · on Oct 28, 2020

I'm hopeful that with it effectively being a KV store pointed at a "simple" object store (S3, GCS, etc), it will be dramatically simpler to manage than Jaeger, and much more performant.

Backing Jaeger with Elasticsearch or Cassandra was a nightmare. :-|

gouthamve · on Oct 28, 2020

See some more details around the motivations and history here: https://gouthamve.dev/tempo-a-game-of-trade-offs/

It will be easier and cheaper to manage compared to Jaeger but doesn't yet have any ability to search inbuilt (which is one of the reasons Jaeger is expensive).

netingle · on Oct 28, 2020

That’s our motivation and experience over the last 6 months running it internally at Grafana Labs. Much cheaper and easier to operate.

Not to bash Jaeger though, its more powerful than tempo in that it allows you to search for traces. Tempo is about integrating with Grafana, Loki and Prometheus for finding traces.

SEJeff · on Oct 29, 2020

So if you're not running Loki, you're not going to be able to find traces using tempo? We use grafana and prom, but not loki.

annanay · on Oct 29, 2020

Since Tempo is a k/v store that can retrieve traces given a traceID, we need either a metric system that can store traceIDs in exemplars OR any logging framework to log traceIDs that can be copied over to the Tempo Query UI.

candiddevmike · on Oct 28, 2020

Jaeger has a local storage option (badger) that is OK for "ephemeral" tracing. I use it for development environments and in production (where each app can have it's own Jaeger instance). The data gets wiped periodically, prometheus metrics are used for trending.

retzkek · on Oct 28, 2020

> Backing Jaeger with Elasticsearch or Cassandra was a nightmare.

Unless you're already running one of these, in which case deploying Jaeger is very easy (at least, that was my experience with our Elasticsearch backend).

> much more performant.

You expect an object store to be more performant than Elasticsearch?