The Way(tm) that I was taught/experienced is as follows:
o Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit, but they are never verbose enough when you need them.
o Using logs to derive metrics is an expensive fools errand pushed by splunk and the cloud equivalents(ie cloudwatch and the like). Its slow, inaccurate and horrendously expensive.
o using logs for monitoring is a fools errand. Its always too slow, and really really fucking brittle.
o metrics are king.
o pull model metrics is an antipattern
o Graphite + grafana is still actually quite good, although time resolution isn't there.
o You need to raid your metrics stores
o We had a bunch of metrics servers in a raid 1, which were then in a raid 0 for performance, all behind loadbalancers and DNS Cnames with a really low TTL.
o Cloudwatch metrics are utterly shite
o Cloudwatch is actually entirely shit.
o tracing is great, and brilliant for performance monitoring.
o Xray from AWS is good, but only really for lambdas.
o tracing is fragile and doesn't really plug and play end to end, unless you have the engineering discipline to enforce "the one true" tracing system everywhere
In short, everything should have a minimum set of graphs, CPU, Memory, connections, upstream service response times hits per second and query time, at a minimum.
You can then aggregate those metrics into a "service health" gauge, where you set a minimum level of service (ie no response time greater than 600ms, and no 5xx/4xx errors or similar) red == the service isn't performing within spec, yellow == its close to being outside spec, green == its inside spec.
if you are running a monolith, then each subsection needs to have a "gauge". for microservice people, every microservice. You can aggregate all those gauges into "business services" to make a dashboard that even CEOs can understand.
I think it hasn't really been settled if pushing or pulling metrics is the anti-pattern, it seems to change every 5 to 10 years which one is currently hot.
There is no need to choose between push and pull, since both methods have their pros and cons [1]. Just use monitoring system, which supports both methods [2].
I use VM both at home and at work for 10s of millions of active time series and it's great.
It just runs and the default config is sustainable unlike some of the other solution.
Pull/push mostly doesn't matter other than config, it's the number of metrics and series of the prometheus ecosystem that's the real problem and being able to handle them without OOMing to 0 availability that the problem.
And sadly somehow whole Prometheus/Cloud is built on idea of pulling GET /metrics
I personally also think it's an antipattern, yet such design is dominant.
Streaming telemetry via GRPC is rarity.
Pulling/polling isn't suitable for high throughput (e.g. network flows, application profiling, any sub-second sampling frequency), but it's totally fine for 99% of observability use cases. In fact, I would argue pushed metrics are an anti-pattern for most environments, where the performance upsides are not worth the added complexity & reduced flexibility.
There is real value to the observability system being observable by default. It is so nice to be able to GET /metrics endpoint from curl and see real-time metrics in human-readable format.
Pull by default, and consciously upgrade to push if you need more throughput.
I think the issue for me is that "pull" requires me to open up lots of services/hosts/sidecars to allow inwards connections. Thats a lot more things to monitor and test to see if its broken.
having a single dns record that I can route based on location/traffic/load/uptime autonomously is, I think, super convenient.
For example, if I want to have a single metrics config for a global service I can say: "server: metrics-host" and depending on the dns search path it'll either get the test, regional or global metrics server. (ie .local, .us-east-1 or *.company.tld)
However for most people its a single DNS record with a load balancer. When a host stops pushing metrics, you check the host aliveness score and alert.
I'd still argue that it's easier to scale Pulling than it is a distributed push. It's kind of why prometheus went that route in the first place.
Back in the days of puppet or nagios, which would take requests and not pull. It was very common to hear about them cascading and causing huge denial of service issues and even causing massive outages because of it. For the simple fact that it's way harder to control thousands of servers sending out data on a timeframe versus a set of infrastructure designed to query them.
If I recall correctly facebook in the early days had a full on datacenter meltdown due to their puppet cluster pushing a bad release causing every host to check in, they were offline for a full day I think, they couldn't update the thousands of hosts because things were so saturated.
However in the case of polling, you'd dictate that from the monitoring servers themselves, you can control and dictate that without causing sprawls of outages and calls from everything.
> puppet cluster pushing a bad release causing every host to check in,
It was probably chef, but yeah I totally can see that happening.
in terms of scaling, nowadays everything either shards or can sit behind a load balancer, so partitioning is much more simple nowadays.
for network layout though, having hosts that can get access to a large number of machines is something I really don't like. Traditional monitoring, where you have an agent running as root and can execute a bunch of functions is also a massive security risk, and has largely moved to other forms of monitoring.
The most environments for what pushing is an anti-pattern (I do agree those are most) also should avoid complex monitoring tools, complex cloud architectures, and most of the troublemakers on the entire discussion here.
So, if you need to architect your metrics, the odds are much higher that you are one of the exceptions that also need to think about pulling or pushing them. (Or you are doing resume-driven architecture, and will ignore every advice here anyway.)
I don't know a lot about monitoring, why do you say pulling is an anti pattern?
My understanding is that with the push pattern you submit a metric when it's available, with the pull pattern you make them available via an interface.
I've read the first is not as performant, as it leads to submitting lots of metrics. Although I can think of an alternative, which is storing them and pushing batches periodically?
Pulling metrics from your service is easy and simple. Pulling metrics from thousands and thousands of services, spread out over many clouds/regions/environments, puts a HUGE strain on the single pulling server. Especially if it then evaluates its list of rules for each pull, etc.
But for pushing, if you don't get a message from service X in the last hour, is it down, or is it just not being used. So things like heartbeats intervals need to be configured, etc.
> Pulling metrics from your service is easy and simple. Pulling metrics from thousands and thousands of services, spread out over many clouds/regions/environments, puts a HUGE strain on the single pulling server. Especially if it then evaluates its list of rules for each pull, etc.
Oh god if you're trying to do that with a single pulling server you're doing something terribly wrong.
> But for pushing, if you don't get a message from service X in the last hour, is it down, or is it just not being used. So things like heartbeats intervals need to be configured, etc.
The inverse is true on this situation, if a server is attempting to accept traffic from thousands of services it's got the potential to get DOS'd pretty damn quick, and you can't even control that flow easily. At least from a pulling server you can dictate that on one host, or scale that out and aggregate in another layer. (Thanos/Cortex).
I'll always prefer pulling versus pushing because you gain more control and prevent yourself from blasting a server into space with a thousand calls it can't refuse.
I'm not sure we are talking about the same thing, because I see no need at all to backfill monitoring data. In fact, it's one of my guidelines to decide if something is monitoring or logging; logs can not have holes.
You push as a best effort. It's up to the receiving party to react to a lack of data.
the general's problem is about trust and authentication.
however, to your point about not being sure if the metrics ever get there, there is another way.
If you just have metrics, then its a single point of failure. so you need another basic "alive" check. for web services most people already have some sort of /__health or other check to allow loadbalancers to work. marrying up the two sources of data allows you to work out if its the service, metrics or loadbalancer thats not working.
This matches my experience in Azure. I'd add a few more:
o Log program inputs, outputs, and state changes. If you log something like "We are in function X", that's useless without a stack trace showing how we got to that state.
o Assume your developers are going to forgo logging "the right thing" and instrument the ** out of your application infrastructure (request/response processing, DB accesses, 3rd-party API calls, process lifecycle events, etc...).
o Make logging easy. I may get flamed but I think dependency injecting loggers is an anti-pattern. Logging is such a cross-cutting concern that its easier to have a static Logger or Metric object that is configured at startup. You do need some magic to prevent context from spilling from one request to another, but at least C# has these language features (AsyncLocal<T>) and I assume others do too.
o Your monitoring alert configuration should be in source control.
> Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit
good logs are very valuable both when things are still working and when there is an outage.
E. g. I with nginx even on a loaded servers error.log is small and it is possible to skim it from time to time to ensure there are no unexpected problems. access log is useful for ops tasks to - when possible I use tab separated access log which can be queried using clickhouse-local.
Sometimes software write bad (useless) logs, but this is not a problem with logs in general.
Also it may be useful to collect metrics for number of messages on different levels (info/warn/error/crit) - sudden changes in rate of errors is something which should be investigated.
While I agree generally, I think it's useful to distinguish between freeform logs and structured logs. Freeform logs are those typical logs developers form by string concatenation. Structured logs have a schema, and are generally stored in a proper database (maybe even a SQL database to allow easy processing). Imagine that each request in a RPC system results in a row in Postgres. Those are very useful and you can derive metrics from them reasonably well.
Structured logging has saved my ass so many times. Make a call to an API? Log it. Receive a call to an API? Log it. Someone starts pointing fingers, pull out the logs.
I love to attach logs to other rows in the database: (tableName, rowID, logID, actionID?). Something goes wrong in the middle of the night? Here's all the rows affected.
Log retention is easy. Use foreign keys with cascade delete, removing a row from the Log table removes all the log data. If you need to keep a log, add a boolean flag. Need to be sure? Use a rule to block deletion if the flag is set. (You're using Postgres, right?)
That’s all fine until you are running at reasonably large volume. Logging using an rdbms breaks down very quickly at scale, it’s very expensive tool for the job.
Agreed that storing structured logs into a relational database can be very expensive. But you can store structured logs into analytical databases such as ClickHouse. When properly configured, it may efficiently store and query trillions of log entries per node. Google for Cloudflare and Uber cases for storing logs in ClickHouse.
There are also specialized databases for structured logs, which efficiently index all the fields for all the ingested logs, and then allow fast full-text search over all the ingested data. For example, VictoriaLogs [1] is built on top of architecture ideas from ClickHouse for achieving high performance and high compression rate.
sure. but thats not what the OP was saying. A pipeline of evented/ETLd/UDPd logs to an analytical db like clickhouse is a fairly standard and reasonable thing to do at scale. Not putting it in postgres.
I strongly disagree with the link you’ve provided saying that queue bases systems are hard to measure the health of. I think the queue size is a good way to monitor the health of things, assuming you have some not super pathological workload. If your queues have a bounded size then alerting at various points towards that upper bound makes perfect sense
I’ve found alerting on queue length to be a path to false alarms, what most people care about is queue latency. How long does a message sit in the queue before being processed?
For some jobs the upper bound for that can be measured in days, for others it’s milliseconds, either way no one on the business side cares whether you have zero or a million jobs on the queue so long as they leave the queue quickly enough.
i agree with your statement on logs and i see where you're coming from with cloudwatch. i think if you're coming from graphite and grafana then it makes total sense why you would dislike cloudwatch. cloudwatch requires a bit more planning than the intuitive metrics you can typically throw at and query from graphite. it also definitely does feel weird to have to ship multiple dimensions of seemingly the same metric to do the queries that you really want. however, once you design the rollups of metrics, you get everything you need, plus you don't have to worry about operating a mission-critical monitoring stack. and you can still build your dashboards in grafana and alert as you like.
i've never really found tracing to be useful - i've used pprof and an internally developed version which uploads profiles from production instances and shows you a heatmap for stuff that require deeper digging than metrics
- not scalable, even at a moderately sized level; at one work we had graphite workers trying to flush-write the data for dozens of hours on a regular basis, and numerous engineers spent a lot of time trying to optimize that with no good outcome
- no labels, so you must flatten them into a metric name
Yeah sharding it is a pain. We got its to a million active metrics (ie a metric that was updated in the last 5 mins) on a single instance, with the wisper backend.
I suspect the newer DB based backend are better scalability wise.
> no labels, so you must flatten them into a metric name
I _personally_ don't like labels, I believe that metrics should be one dimensional. I know that puts me at odds with others. I can see the appeal as it allows you too easier group instance metrics into service level ones.
For some reason junior devs / SREs always go nuts putting things into labels, I guess it's the same insecurity and/or brain rot of only-one-class-per-file or three-package deep source hierarchies for a total 1k line project.
The rule of thumb I give them is that if it doesn't make dimensional sense to average across across them, never use a label.
If a service which provides metrics supports multiple tenants wouldn't those metrics at the very least need a label to indicate which tenant the metrics are for?
The temporal resolution is unpredictable (as in the reduction of resolution with age is separate from the view resolution.)
Its slow as shit as in it (used) to take a good while to update with this time bucket's value, and that value seems to be able to change wildly (more than just an average)
The min/max/average/count/$other parameters are poorly documented and inconsistently applied across services. therefore what chance do you have for implementing them properly.
It's expensive(well can be)
deletion is different from hiding. its less of a problem with grafana/custom renderers, but its nasty.
the lack of functions for slicing/dicing and combining annoy me. (again that might have changed in the last year or so)
o Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit, but they are never verbose enough when you need them.
o Using logs to derive metrics is an expensive fools errand pushed by splunk and the cloud equivalents(ie cloudwatch and the like). Its slow, inaccurate and horrendously expensive.
o using logs for monitoring is a fools errand. Its always too slow, and really really fucking brittle.
o metrics are king.
o pull model metrics is an antipattern
o Graphite + grafana is still actually quite good, although time resolution isn't there.
o You need to raid your metrics stores
o We had a bunch of metrics servers in a raid 1, which were then in a raid 0 for performance, all behind loadbalancers and DNS Cnames with a really low TTL.
o Cloudwatch metrics are utterly shite
o Cloudwatch is actually entirely shit.
o tracing is great, and brilliant for performance monitoring.
o Xray from AWS is good, but only really for lambdas.
o tracing is fragile and doesn't really plug and play end to end, unless you have the engineering discipline to enforce "the one true" tracing system everywhere
but what do you monitor?
http://widgetsandshit.com/teddziuba/2011/03/monitoring-theor... this still is canonical.
In short, everything should have a minimum set of graphs, CPU, Memory, connections, upstream service response times hits per second and query time, at a minimum.
You can then aggregate those metrics into a "service health" gauge, where you set a minimum level of service (ie no response time greater than 600ms, and no 5xx/4xx errors or similar) red == the service isn't performing within spec, yellow == its close to being outside spec, green == its inside spec.
if you are running a monolith, then each subsection needs to have a "gauge". for microservice people, every microservice. You can aggregate all those gauges into "business services" to make a dashboard that even CEOs can understand.