The Way(tm) that I was taught/experienced is as follows: o Logs are there for ig...

perlgeek · on June 27, 2023

I think it hasn't really been settled if pushing or pulling metrics is the anti-pattern, it seems to change every 5 to 10 years which one is currently hot.

preseinger · on June 27, 2023

yep

while they have different sets of pros and cons, neither is generally preferable to the other, they both get the job done with basically the same cost

valyala · on June 28, 2023

There is no need to choose between push and pull, since both methods have their pros and cons [1]. Just use monitoring system, which supports both methods [2].

[1] https://docs.victoriametrics.com/keyConcepts.html#write-data

[2] VictoriaMetrics + vmagent

preseinger · on June 28, 2023

nobody would use victoria metrics in earnest, i hope!

vsz · on June 30, 2023

I use VM both at home and at work for 10s of millions of active time series and it's great.

It just runs and the default config is sustainable unlike some of the other solution.

Pull/push mostly doesn't matter other than config, it's the number of metrics and series of the prometheus ecosystem that's the real problem and being able to handle them without OOMing to 0 availability that the problem.

valyala · on June 28, 2023

dengolius · on June 28, 2023

betaby · on June 27, 2023

> o pull model metrics is an antipattern

And sadly somehow whole Prometheus/Cloud is built on idea of pulling GET /metrics I personally also think it's an antipattern, yet such design is dominant. Streaming telemetry via GRPC is rarity.

pphysch · on June 27, 2023

Pulling/polling isn't suitable for high throughput (e.g. network flows, application profiling, any sub-second sampling frequency), but it's totally fine for 99% of observability use cases. In fact, I would argue pushed metrics are an anti-pattern for most environments, where the performance upsides are not worth the added complexity & reduced flexibility.

There is real value to the observability system being observable by default. It is so nice to be able to GET /metrics endpoint from curl and see real-time metrics in human-readable format.

Pull by default, and consciously upgrade to push if you need more throughput.

KaiserPro · on June 27, 2023

I think the issue for me is that "pull" requires me to open up lots of services/hosts/sidecars to allow inwards connections. Thats a lot more things to monitor and test to see if its broken.

having a single dns record that I can route based on location/traffic/load/uptime autonomously is, I think, super convenient.

For example, if I want to have a single metrics config for a global service I can say: "server: metrics-host" and depending on the dns search path it'll either get the test, regional or global metrics server. (ie .local, .us-east-1 or *.company.tld)

However for most people its a single DNS record with a load balancer. When a host stops pushing metrics, you check the host aliveness score and alert.

llama052 · on June 27, 2023

I'd still argue that it's easier to scale Pulling than it is a distributed push. It's kind of why prometheus went that route in the first place.

Back in the days of puppet or nagios, which would take requests and not pull. It was very common to hear about them cascading and causing huge denial of service issues and even causing massive outages because of it. For the simple fact that it's way harder to control thousands of servers sending out data on a timeframe versus a set of infrastructure designed to query them.

If I recall correctly facebook in the early days had a full on datacenter meltdown due to their puppet cluster pushing a bad release causing every host to check in, they were offline for a full day I think, they couldn't update the thousands of hosts because things were so saturated.

However in the case of polling, you'd dictate that from the monitoring servers themselves, you can control and dictate that without causing sprawls of outages and calls from everything.

Pull model can obviously scale(a lot of the shortcomings are now addressed in Thanos/Cortex): https://promcon.io/2016-berlin/talks/scaling-to-a-million-ma...

KaiserPro · on June 28, 2023

> puppet cluster pushing a bad release causing every host to check in,

It was probably chef, but yeah I totally can see that happening.

in terms of scaling, nowadays everything either shards or can sit behind a load balancer, so partitioning is much more simple nowadays.

for network layout though, having hosts that can get access to a large number of machines is something I really don't like. Traditional monitoring, where you have an agent running as root and can execute a bunch of functions is also a massive security risk, and has largely moved to other forms of monitoring.

marcosdumay · on June 27, 2023

The most environments for what pushing is an anti-pattern (I do agree those are most) also should avoid complex monitoring tools, complex cloud architectures, and most of the troublemakers on the entire discussion here.

So, if you need to architect your metrics, the odds are much higher that you are one of the exceptions that also need to think about pulling or pushing them. (Or you are doing resume-driven architecture, and will ignore every advice here anyway.)

pnt12 · on June 27, 2023

I don't know a lot about monitoring, why do you say pulling is an anti pattern?

My understanding is that with the push pattern you submit a metric when it's available, with the pull pattern you make them available via an interface.

I've read the first is not as performant, as it leads to submitting lots of metrics. Although I can think of an alternative, which is storing them and pushing batches periodically?

briffle · on June 27, 2023

Pulling metrics from your service is easy and simple. Pulling metrics from thousands and thousands of services, spread out over many clouds/regions/environments, puts a HUGE strain on the single pulling server. Especially if it then evaluates its list of rules for each pull, etc.

But for pushing, if you don't get a message from service X in the last hour, is it down, or is it just not being used. So things like heartbeats intervals need to be configured, etc.

llama052 · on June 27, 2023

> Pulling metrics from your service is easy and simple. Pulling metrics from thousands and thousands of services, spread out over many clouds/regions/environments, puts a HUGE strain on the single pulling server. Especially if it then evaluates its list of rules for each pull, etc.

Oh god if you're trying to do that with a single pulling server you're doing something terribly wrong.

> But for pushing, if you don't get a message from service X in the last hour, is it down, or is it just not being used. So things like heartbeats intervals need to be configured, etc.

The inverse is true on this situation, if a server is attempting to accept traffic from thousands of services it's got the potential to get DOS'd pretty damn quick, and you can't even control that flow easily. At least from a pulling server you can dictate that on one host, or scale that out and aggregate in another layer. (Thanos/Cortex).

I'll always prefer pulling versus pushing because you gain more control and prevent yourself from blasting a server into space with a thousand calls it can't refuse.

marcosdumay · on June 27, 2023

Pulling implies you go storing the bare data, and run through the data when it's needed.

Pushing implies that you run through the data as it appears, and sends the results to the log aggregator for storage as soon as they exist.

Pulling has a lot of problems with storage and storage bandwidth.

baq · on June 27, 2023

Pushing has the Byzantine generals problem where you have no idea if your data actually ended up anywhere and still need to pull to backfill.

marcosdumay · on June 27, 2023

I'm not sure we are talking about the same thing, because I see no need at all to backfill monitoring data. In fact, it's one of my guidelines to decide if something is monitoring or logging; logs can not have holes.

You push as a best effort. It's up to the receiving party to react to a lack of data.

KaiserPro · on June 29, 2023

> Pushing has the Byzantine generals

not really.

the general's problem is about trust and authentication.

however, to your point about not being sure if the metrics ever get there, there is another way.

If you just have metrics, then its a single point of failure. so you need another basic "alive" check. for web services most people already have some sort of /__health or other check to allow loadbalancers to work. marrying up the two sources of data allows you to work out if its the service, metrics or loadbalancer thats not working.

guhidalg · on June 27, 2023

This matches my experience in Azure. I'd add a few more:

o Log program inputs, outputs, and state changes. If you log something like "We are in function X", that's useless without a stack trace showing how we got to that state.

o Assume your developers are going to forgo logging "the right thing" and instrument the ** out of your application infrastructure (request/response processing, DB accesses, 3rd-party API calls, process lifecycle events, etc...).

o Make logging easy. I may get flamed but I think dependency injecting loggers is an anti-pattern. Logging is such a cross-cutting concern that its easier to have a static Logger or Metric object that is configured at startup. You do need some magic to prevent context from spilling from one request to another, but at least C# has these language features (AsyncLocal<T>) and I assume others do too.

o Your monitoring alert configuration should be in source control.

klysm · on June 27, 2023

DI logging is continent to just add extra context to the logs but I agree on static Log being always accessible in the same way everywhere

mike22 · on June 28, 2023

In Java, Lombok @Slf4j on a class is very convenient. Adds a static field for the logger.

citrin_ru · on June 28, 2023

> Logs are there for ignoring. You need them precisely twice: once when you are developing, and once when the thing's gone to shit

good logs are very valuable both when things are still working and when there is an outage.

E. g. I with nginx even on a loaded servers error.log is small and it is possible to skim it from time to time to ensure there are no unexpected problems. access log is useful for ops tasks to - when possible I use tab separated access log which can be queried using clickhouse-local.

Sometimes software write bad (useless) logs, but this is not a problem with logs in general.

Also it may be useful to collect metrics for number of messages on different levels (info/warn/error/crit) - sudden changes in rate of errors is something which should be investigated.

kccqzy · on June 27, 2023

While I agree generally, I think it's useful to distinguish between freeform logs and structured logs. Freeform logs are those typical logs developers form by string concatenation. Structured logs have a schema, and are generally stored in a proper database (maybe even a SQL database to allow easy processing). Imagine that each request in a RPC system results in a row in Postgres. Those are very useful and you can derive metrics from them reasonably well.

kayodelycaon · on June 27, 2023

Structured logging has saved my ass so many times. Make a call to an API? Log it. Receive a call to an API? Log it. Someone starts pointing fingers, pull out the logs.

I love to attach logs to other rows in the database: (tableName, rowID, logID, actionID?). Something goes wrong in the middle of the night? Here's all the rows affected.

Log retention is easy. Use foreign keys with cascade delete, removing a row from the Log table removes all the log data. If you need to keep a log, add a boolean flag. Need to be sure? Use a rule to block deletion if the flag is set. (You're using Postgres, right?)

dalyons · on June 28, 2023

That’s all fine until you are running at reasonably large volume. Logging using an rdbms breaks down very quickly at scale, it’s very expensive tool for the job.

valyala · on June 28, 2023

Agreed that storing structured logs into a relational database can be very expensive. But you can store structured logs into analytical databases such as ClickHouse. When properly configured, it may efficiently store and query trillions of log entries per node. Google for Cloudflare and Uber cases for storing logs in ClickHouse.

There are also specialized databases for structured logs, which efficiently index all the fields for all the ingested logs, and then allow fast full-text search over all the ingested data. For example, VictoriaLogs [1] is built on top of architecture ideas from ClickHouse for achieving high performance and high compression rate.

[1] https://docs.victoriametrics.com/VictoriaLogs/

dalyons · on June 28, 2023

sure. but thats not what the OP was saying. A pipeline of evented/ETLd/UDPd logs to an analytical db like clickhouse is a fairly standard and reasonable thing to do at scale. Not putting it in postgres.

lelanthran · on June 28, 2023

Why does it break down?

Too many row insertions? Too many dB clients at the same time?

I'm genuinely curious here.

klysm · on June 27, 2023

I strongly disagree with the link you’ve provided saying that queue bases systems are hard to measure the health of. I think the queue size is a good way to monitor the health of things, assuming you have some not super pathological workload. If your queues have a bounded size then alerting at various points towards that upper bound makes perfect sense

jon-wood · on June 27, 2023

I’ve found alerting on queue length to be a path to false alarms, what most people care about is queue latency. How long does a message sit in the queue before being processed?

For some jobs the upper bound for that can be measured in days, for others it’s milliseconds, either way no one on the business side cares whether you have zero or a million jobs on the queue so long as they leave the queue quickly enough.

klysm · on June 28, 2023

Good point, it depends on the dynamics of the queue but that’s also a good thing to measure.

kbar13 · on June 27, 2023

i agree with your statement on logs and i see where you're coming from with cloudwatch. i think if you're coming from graphite and grafana then it makes total sense why you would dislike cloudwatch. cloudwatch requires a bit more planning than the intuitive metrics you can typically throw at and query from graphite. it also definitely does feel weird to have to ship multiple dimensions of seemingly the same metric to do the queries that you really want. however, once you design the rollups of metrics, you get everything you need, plus you don't have to worry about operating a mission-critical monitoring stack. and you can still build your dashboards in grafana and alert as you like.

i've never really found tracing to be useful - i've used pprof and an internally developed version which uploads profiles from production instances and shows you a heatmap for stuff that require deeper digging than metrics

kunley · on June 27, 2023

Two big minuses of Graphite:

- not scalable, even at a moderately sized level; at one work we had graphite workers trying to flush-write the data for dozens of hours on a regular basis, and numerous engineers spent a lot of time trying to optimize that with no good outcome

- no labels, so you must flatten them into a metric name

KaiserPro · on June 27, 2023

> not scalable

Yeah sharding it is a pain. We got its to a million active metrics (ie a metric that was updated in the last 5 mins) on a single instance, with the wisper backend.

I suspect the newer DB based backend are better scalability wise.

> no labels, so you must flatten them into a metric name

I _personally_ don't like labels, I believe that metrics should be one dimensional. I know that puts me at odds with others. I can see the appeal as it allows you too easier group instance metrics into service level ones.

deepsun · on June 27, 2023

Check victoriametrics, they got Graphite query language and ingestion protocol, but on top of their storage.

That thing can scale to a billion of active metrics, not just million.

mdaniel · on June 27, 2023

And a new(?) logging ingestion thing: https://news.ycombinator.com/item?id=36429526 https://docs.victoriametrics.com/VictoriaLogs/ https://github.com/VictoriaMetrics/VictoriaMetrics/releases/... (Apache 2)

morelisp · on June 27, 2023

For some reason junior devs / SREs always go nuts putting things into labels, I guess it's the same insecurity and/or brain rot of only-one-class-per-file or three-package deep source hierarchies for a total 1k line project.

The rule of thumb I give them is that if it doesn't make dimensional sense to average across across them, never use a label.

IsopropylMalbec · on June 27, 2023

If a service which provides metrics supports multiple tenants wouldn't those metrics at the very least need a label to indicate which tenant the metrics are for?

morelisp · on June 28, 2023

Why doesn’t it make dimensional sense to average across tenants?

VectorLock · on June 27, 2023

>pull model metrics is an antipattern

While I agree there are a lot of Prometheus people who would disagree.

brad0 · on June 27, 2023

Could you elaborate on Cloudwatch metrics?

KaiserPro · on June 27, 2023

A few things that I dislike:

The temporal resolution is unpredictable (as in the reduction of resolution with age is separate from the view resolution.)

Its slow as shit as in it (used) to take a good while to update with this time bucket's value, and that value seems to be able to change wildly (more than just an average)

The min/max/average/count/$other parameters are poorly documented and inconsistently applied across services. therefore what chance do you have for implementing them properly.

It's expensive(well can be)

deletion is different from hiding. its less of a problem with grafana/custom renderers, but its nasty.

the lack of functions for slicing/dicing and combining annoy me. (again that might have changed in the last year or so)