Hacker News new | past | comments | ask | show | jobs | submit login
Postgres Observability (pgstats.dev)
420 points by fforflo on Nov 11, 2020 | hide | past | favorite | 45 comments



I don’t like the devaluation of the word observability. A chart like this used to be called something like ‘tools to monitor and understand postgres’. Observability is the ability to derive the internal state of a system from the outside. Usually done through traces.

It would be great if Postgres could emit a trace per query, showing in real-time which internal components were hit by this query. A sort of continuous query explain service.

Combine these traces to database clients and other front end services and you’ll be able to point to the front end service version which causes cache misses deep inside postgres.


https://www.postgresql.org/docs/current/dynamic-trace.html:

“PostgreSQL provides facilities to support dynamic tracing of the database server. This allows an external utility to be called at specific points in the code and thereby trace execution.

A number of probes or trace points are already inserted into the source code. These probes are intended to be used by database developers and administrators. By default the probes are not compiled into PostgreSQL; the user needs to explicitly tell the configure script to make the probes available.

Currently, the DTrace utility is supported, which, at the time of this writing, is available on Solaris, macOS, FreeBSD, NetBSD, and Oracle Linux. The SystemTap project for Linux provides a DTrace equivalent and can also be used. Supporting other dynamic tracing utilities is theoretically possible by changing the definitions for the macros in src/include/utils/probes.h.”


Observability is not just about traces. This great article[1] by @copyconstruct describes it in detail, and lists monitoring, alerting/visualization, tracing, and log aggregation/analytics as key components of an observability stack.

[1] https://copyconstruct.medium.com/monitoring-and-observabilit...


What kind of in information would you like to see in that trace?

Not planning to work on it directly, but I am planning to do some larger executor work as one of the next bigger projects, and having ideas about what kind of information people would like to see could make it easier to implement them later.


A trace should answer why a query is fast or slow. It should provide enough information to a dba (which I’m not) to determine in which domain the most likely optimization is possible. I.e. the query itself, db configuration and tuning or underlying infra.


Unfortunately some of the data for that isn't cheap to collect. Postgres' EXPLAIN (ANALYZE, SETTINGS) <query>, which shows a good bit of what you want, can slow queries down substantially (in some workloads). Primarily because of timestamp overhead.

Just always collecting that information just in case it may get accessed thus isn't really feasible. It'd be good to make it possible to query cheaper information on-demand though (e.g. asking for the EXPLAIN of a query running in another session, without analyze, should be doable with some effort).

You can already set up things in a way that allows to correlate connections / queries with distributed tracing. But it's a more work than it should be. Postgres' pg_stat_activity shows queries, and it can include information that allows to correlate in the connection's 'application_name'.


The point the OP was trying to make I believe, is not that it should do all of that, but that if it doesn't do that a different name to observability should be chosen such as monitoring.


"It would be great if Postgres could emit a trace per query, showing in real-time which internal components were hit by this query. A sort of continuous query explain service."


I think you can achieve this with pg_stat_statements and auto_explain.


Not super familiar with PG but I was under the impression it did have traces - https://www.postgresql.org/docs/9.1/dynamic-trace.html


For those interested in similar chart but for linux processes: http://www.brendangregg.com/linuxperf.html

Amazing stuff.


I would love to see PostgreSQL integrate with something like Jaeger, specifically to have spans for triggers. Database triggers are terrible to introspect and debug.


I added integration for Zipkin in via Opencensus some time ago into a project I work on. This has been running in production for a while now: https://github.com/Netflix/titus-executor/tree/master/vpc/se...

You need to:

1. Have an Opencensus tracing context

2. Have query logging setup on your PG server. There are a bunch of ways to do this with minimal overhead. You can log slow queries (say queries >5ms), or log queries that fail, or sample queries.

Your query / log ends up getting something like:

  2020-11-11 21:00:00 UTC:titusvpcservice@titusvpcservice:    
  [60294]:HINT: The transaction might succeed if retried.
  2020-11-11 21:00:00 UTC:titusvpcservice@titusvpcservice: 
  [60294]:STATEMENT: /* md: {"spanID":"34c1a9f38fb44cad"} */
  INSERT INTO assignments(branch_eni_association, assignment_id) VALUES ($1, $2) RETURNING id
You can then look at a Zipkin, and use the value within MD (spanID) to get the trace. I did this originally, because I wanted to transparently wrap the PG SQL Driver for Zipkin. Postgres can be oblivious to the fact there's "Zipkin Inside", because it's a terminal node. You can get the span ID from the BEGIN TRANSACTION / first query, and then tie that to the pid and timestamp, and then use that to go back and look through things with standard(ish) postgres introspection, since almost all of it has the pg_backend_pid + timestamp in it.


Have you considered to add the span ID to the application name instead? Combined with including the application name in the log_line_prefix that will make the logs easier to inspect in quite a few cases. Also makes querying pg_stat_activity etc easier, since you can see the current application_name for each connection in there.


Isn’t the application name set at connect time? Or can you change that after startup? (Re)connecting to the database and not having a pool is kind of a nightmare.


Can be changed using SET application_name = ...


Did you check https://github.com/open-telemetry/opentelemetry-js-contrib/t... ?

I'm not sure if it's the only one but opentelemetry.io does list PostgreSQL.


Correct me if I'm wrong, but this just seems like using the opentelemetry library in node to send query info in a span. This doesn't seem like a full PostgreSQL integration?


Yeah it looks like it runs on the client side right? While I was expecting a proper plugin running on the postgresql side.


Why would you need any "server side" logic in PG?


They want to trace the goings-on inside PG, e.g. if you have a query that runs in 3 seconds and joins five different tables, it might be interesting to see information like "spent 0.467 seconds joining table B and F". Like what EXPLAIN ANALYZE does, but live.


yeah -- I wonder how DBs will change over the next few years. Their performance advantage is partially code quality, but partially that they can run complex searches 'close to the data'. You could do the latter with a stateful REST server too, and it would support a lot of modern tooling.


I find the UX kind of jarring; the fade-out/in feels too fast for me, but that may be pretty subjective.


Yeah that was my one element of constructive feedback. The flashing literally made my eyes uncomfortable, to the point that I don't really want to explore too much.


I'm not a professional UX designer, just a Postgres DBA ))) I'm not promising night mode, but try to experiment with colors and make animation more smooth.


The problem for me is that by changing what I'm not looking at, it draws me to look at what was taken away. We are drawn to look at movement. It may be better to leave the not-relevant stuff alone, and instead highlight what the user is currently hovering over. Something as simpler as making the relevant text bold, increasing the thickness of the lines, and maybe making the text box a darker shade may help.


too much contrast is part of the problem i think


This is a handy tool, big thanks to the dev who put this together. This is going to be an excellent educational tool. My only suggestion I have is to add links to the docs for the detail page's relevant function.


InfluxDB has a template to aid with monitoring a postgres instances[0]. I wonder if the Telegraf Plugin[1] could be enhanced to use more of these metrics, and then the template could be enhanced as well?

[0] https://github.com/influxdata/community-templates/tree/maste... [1] https://github.com/influxdata/telegraf/tree/master/plugins/i...


pgmetrics [1] is an open source tool that can collect information from most of the views/functions described here. Can emit that information in text or JSON formats.

[1] https://pgmetrics.io


If anyone is using Flask I wrote an extension that helps get at some of these stats by adding a bunch of Flask CLI commands to your project: https://github.com/nickjj/flask-pg-extras

It's super inspired (read: mostly copy / pasted) from Heroku's "PG Extras" tool, except in this case it all works without Heroku. You just need to be using SQLAlchemy.


Are these usually used for monitoring (like a read-only api)? Or do people build on top of this? Like say, you only need the write ahead log, so you build out a key value store, ignoring all the other stuff postgres built on top of the write ahead log.

I guess I'm asking if you can take postgres and "turn it inside out" by cherry-picking parts of it to build other storage software with it.


> Are these usually used for monitoring (like a read-only api)?

Yes.

> I guess I'm asking if you can take postgres and "turn it inside out" by cherry-picking parts of it to build other storage software with it.

Unfortunately many subsystems of postgres are too interdependent to easily be used independently. Including the WAL.

Part of that is historical / unintentional. But there's also plenty places where generalizing subsystems would make the whole slower and/or more complicated.

There definitely are bits that I'd like to have used outside of postgres many times. If you have to write C, something like postgres' memory context API makes it so much more convenient (and often more performant too!).



Should have dug out the link earlier... The relevant README is https://github.com/postgres/postgres/blob/master/src/backend...


Lovely and interesting tool. If submitted by the author; one small probable bug detected. When you mouse over "Stats Collector" it doesn't highlight any packages


Unfortunately, it's not a bug )). There is no any view or functions for inspecting stats collector.


This is very interesting. Another valuable information, albeit involving quite a bit of work, would be to show parameters associated with that specific area.


Does anyone know which js framework can be used to create diagrams like this one? Or is it all manually coded ?


Very useful, thanks for sharing.


What am I looking at here?


Postgres views and functions that contain info you might want to know about a Postgres instance

You can click on them and it will tell you more.


I was searching for something like this. Thanks-


I'm curios whether all the metrics mentioned here are exported by postgres_exporter [1]? Nowadays Prometheus becomes de-facto standard for monitoring, so it would be great if postgres_exporter could provide all these metrics to Prometheus. Moreover, there are other monitoring solutions such as VictoriaMetrics [2] or vmagent [3], which can scrape metrics from Prometheus-compatible exporters.

[1] https://github.com/wrouesnel/postgres_exporter

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/

[3] https://victoriametrics.github.io/vmagent.html


As a side note, .dev is Such a great domain — great to see its popularity increasing!




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: