Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL Features to Try (pgdash.io)
378 points by rodmena on April 27, 2019 | hide | past | favorite | 77 comments



There's full text search, but that only does exact matches. To make it more versatile, take a look at trigrams, which forgives you if you don't spell something exactly right. It breaks everything into three letter segments, and searches with those.

https://www.postgresql.org/docs/current/pgtrgm.html


I have used this and combined it with a small autocomplete engine built on top of Redis a while back. It worked amazingly well. Users type three words and we fetch the entire list of completions from Redis, then we pass this through to Postgresql. We even built synonyms into the Redis cache, so that users who look for "car" and we have a "vehicle" in the DB, it still works. For whatever reason if the Redis instance is not up (less than 0.1% chance), then whatever the query is goes to PG's full text search backed by the trigrams. Beautiful.

Granted for this to work the DB needs to be somewhat stable in size. For us, it was a list of products that rarely changes, so the Redis cache can be built on server bootstrap.


or, the best of both worlds: like or ilike using a trigram index

    CREATE INDEX trigram_index ON words_table USING GIN (sentence GIN_TRGM_OPS)


Yeah that index is amazing if you are doing a lot of sliding LIKE queries (i.e. % placeholders at start and end). Speeds them up massively without having to switch to a full text index or change your queries at all.


I remember I read DuckDuckGo are still using Postgre for their Backend fairly recently. Couldn't find any source on it though Google didn't gave any useful information.


I had a case where I needed to build some denormalized redundant models to handle certain queries. I also had a write-heavy, append-only event log portion of my data model. I figured it would scale very differently, and could be partitioned very differently than the rest of the data, so I wanted to try putting it in a separate DB and using a dblink/foreign data wrappers. I use DB triggers and dblink to produce the denormalized data from the event stream, and I consume the denormalized data from my application using pg notify. So, even if I do an insert or update with a SQL statement directly to the DB, it will produce the denormalized data without relying on application logic/hooks.

It's definitely air tight, and nice knowing that no version of my data can currently be inconsistent, but there are definitely costs. For one, it's just slow. Bear in mind, I'm on commodity hardware, like t2, but it still seems slow, as some writes take up to 1-2s. It's also not great that it's basically magic, I can't really make alterations to change or improve what it's doing. Lastly, it's hard to version control. I have the triggers checked in with my application code, but it would be easy to not even be aware of their existence. On application startup,I override existing triggers with the ones from the code, just to make sure they're in sync, and I explicitly log out this process, but I could still see this whole process happening without it being at all clear to another developer.

Would I use these features again? Probably not.


You could use pglogical[1][2] and sync to an updatable view - it ought to be faster and it should be easier to maintain (the updatable view would act as an abstraction layer).

[1] https://www.2ndquadrant.com/en/resources/pglogical/ [2] https://github.com/2ndQuadrant/pglogical


> some writes take up to 1-2s

How big are the inserts? I put my event-sourcing log in a regular table that fully duplicate the rows of the inputs as json. Is not that slow ever to me.


Missing from the list:

---

1. json functions: json_build_object, json_agg, etc.

Ever want to product a hierarchical result set?

   {
     "groups": [
       {
         "name": "Beatles",
         "users": [
           {
             "name": "John"
           },
           {
             "name": "Paul"
           }
         ]
       },
       {
         "name": "Stooges",
         "users": [
           {
             "name": "Moe"
           }
         ]
       }
     ]
   }
Then

   SELECT json_build_object(
     'groups', json_agg(
       json_build_object(
         'name', g.name
         'users', coleasce(u.users, '[]'::json)
       )
   ) value
   FROM group g
     LATERAL LEFT JOIN (
       SELECT json_agg(
         json_build_object('name', u.name)
       ) users
       FROM "user" u
       WHERE g.id = u.group_id
    ) u ON true
Produce your entire JSON result in SQL with full power of aggregations, joins, etc., extensible to any nesting structure.

At the extreme, you could use the PostgREST extension make an entire REST JSON API with just PostgreSQL.

---

2. With PostgreSQL's record-level security, end users can get their own connections to the database.

You can even have an instant GraphQL server via Postgraphile. [1]

[1] https://github.com/PostgREST/postgrest

[2] https://www.graphile.org/postgraphile/


I love this way of getting data out of dbs. Before the json functions existed, I was doing a similar thing with the SQL/XML functions on both Postgres and Oracle (which had practically the same set of functions thanks to the SQL/XML ISO standard). Single round trips to grab all the data needed is awesome.


A nice list but I'd also add row-level security (RLS): https://info.crunchydata.com/blog/a-postgresql-row-level-sec...

It's amazing how much effort we put into restricting access on the server, but ignore user roles on the data layer.


IME at scale you need a connection pool. Pools usually cache by connection credentials. And maintaining a large variety of roles with disparate access levels is cumbersome. (I've tried.)

The end result is that column, much less row, level controls end up unused in most cases.

Very security conscious places may just use stored procedures for everything, and limit access to those.


Note that you don’t have to use the connection credentials to make use of row level security in pg. the security policy can be based on session variables (or anything really) which can be changed during the connection.

Edit: it’s not the same thing but fwiw pg row level security is completely decoupled from the dbms access control.


> It's amazing how much effort we put into restricting access on the server, but ignore user roles on the data layer.

That's one of the reasons PostgREST seems promising to me. http://postgrest.org


I've also been a happy user of Sandman: https://github.com/jeffknupp/sandman2




I've looked into it as well and it looks really interesting. The biggest downside for me was that at some point I still needed a backend server to hold secrets or run business logic. At that point, Postgresql was just another server / dependency that I couldn't justify as all of the nifty RLS was done inside Postgres.

It definitely introduced me to RLS though.


Couldn’t agree more ... row level security is brilliantly simple too. Access defined by query.



Yep, in conjunction with TimescaleDB, lateral joins make joining time-series in the database very powerful: https://docs.timescale.com/v1.2/using-timescaledb/reading-da...


dude, so cool



It’s funny... I’ve been a Postgres user for a long time, and two years ago I had to work with MySQL and Oracle databases for a project with a client. I was absolutely horrified that they couldn’t do transactional DDL! I’d just been using that feature for so long that I assumed everything done in a transaction was atomic.

That is such a huge feature!


FYI, Postgresql also has DDL statements that are not transaction safe. The list is quite small compared to software like MySQL, but they still exist. I lost the list I had gathered from a handful of their developers in IRC once, so unfortunately I can't share. Just know that Postgres isn't 100% safe with DDL statements.


To me, one of the neatest features was the ability to send notifications from triggers:

   PERFORM pg_notify("send_sms", <RECORD>::TEXT);
Amazing.


Putting it here because it isn't well known and bit me once.

NOTIFY payloads are deduped over the course of a transaction, with a O(n^2) algorithm if all payloads are different.

If you do a NOTIFY on delete or update and hit a few million rows, the query can go from seconds to hours.

The solution would be to write your payloads to a temp table and send them in batch when the transaction ends. Notifying in chunks of 100 will make the process 10'000 faster.


The best part of this is that the notifications are transactional! If other work in the transaction causes a rollback then the notification is never sent.


re Inheritance -- used very well to solve a "Pretty Big Data" problem, making 60 child tables with two fields to differentiate.. some craft is required when building indexes, and in loading the data. Overall this inherited tables solution works very well and had many small positive side effects over time. As always, PG is rock-solid, everytime.


Did anyone ever do something really interesting with table inheritance? This is one feature that looks like I might have use cases for all the time, but they never quite fit or offer a good benefit.

Foreign data wrappers on the other hand are used by me all the time, as you can have a "common" stuff database and just inject that into all kinds of related ones (like per-tenant setups etc.). I've ran in some issues with joins and performance, but if need be you can just materialize them locally and work from that.


That is the one feature that you shouldn't use.

At $work we use table inheritance - each client gets their own schema with their own copies of the table. The idea is cute, but the problem is that the structure of indexing doesn't cooperate with it. Therefore even the simplest query against a single parent table turns into thousands of index lookups on thousands of different tables. As you can imagine, this is not exactly a scalable situation.

If they fixed the index structure, if need be only for a few index types, it would make a huge difference. But it doesn't sound like that is easy internally or will happen any time soon.


I looked into it for a bit, but it always felt like there were caveats that I didn't like. I think in the end, avoiding explicit hierarchies in modeling is better.


I could see it being useful for a lot of situations where people might otherwise be looking for a 'schemaless' solution.


How about SKIP LOCKED, which is useful for building advanced queuing setups.



The one thing I wish that NOTIFY had as a feature would be the ability to do NOTIFY [N], or at least NOTIFY 1. I was writing a very very basic job pool previously and had to try to overcome the thundering herd problem but at the end of it couldn't. I know it's a niche case and maybe there is a clever way of dealing with it, but the only way that I could find to avoid re-doing work and/or trying to select jobs that weren't taken was to allow each listener to be notified and lock the row, where each query for a job would be roughly SELECT * FROM jobs WHERE .... FOR UPDATE SKIP LOCKED and then bail out if I couldn't find any jobs. I could have been using NOWAIT instead, it's been a few years. We weren't looking at a huge number of async jobs to occur but needed reliability and auditing. Ultimately I believe that we scrapped this whole idea so it's all moot but it definitely would have made prototyping a bit easier


A possible workaround: have multiple channels so that each channel only has a small number of listeners, and NOTIFY a random channel.


Beware of using FDW with extensions in Postgres (like PostGIS), since they might be written under the assumption that data is available locally and not over the network (I.e. full table pulls)


I wrote a little post on doing fast text search in postgres:

https://austingwalters.com/fast-full-text-search-in-postgres...

I use it regularly and it may help people. Makes use of triggers and some of the other items mentioned.


What I'd like is a postgres specific migration scheme that let me take advantage of a lot of cool (and unique) features postgres has without fighting with general purpose ORMs or having to manually edit migration files.


This was discussed here recently: https://github.com/djrobstep/migra

I haven't used it, but it fits your request of a postgresql specific migration tool where you don't have to futz with manually writing migration scripts.


Thanks!

I did look at that, think I was participant in last discussion but somehow it got lost in the bookmarks.

It doesn't have rollbacks (that I can see) but maybe it is good enough (aka better than Alembic migrations). It appears to deal with actual SQL rather than models which is nice. Will check it out.


I wrote a simple solution for this which is used in production at several businesses now: https://github.com/purcell/postgresql-migrations/


I like the idea of this approach (for anyone else interested in the topic).

https://news.ycombinator.com/item?id=19105159

But it does seem like a lot more work when rapidly iterating schema.


Have you looked at Sqitch? It’s the most flexible, but also low level.


Had not. Thank you!


Very neat list. If I had the know-how I would do 2 things to Postgres to make some of these items more useful:

- Foreign key support for inherited tables (i.e. a foreign key referencing a table with inheritance, the other way round works just fine). Table inheritance is super useful but lack of full foreign key support means we have to give up on some integrity (or replace them with hand-written constraints).

- Bloom indexes on arrays and bigint (int8) types - would make it feasible to (almost) 'index all things' when working mostly with numeric data.


Foreign keys for partitions is coming in Postgres 12: https://www.depesz.com/2019/04/24/waiting-for-postgresql-12-...


Table inheritance sounds neat, but the limitations are so severe I've never found a case where it's better than just adding a JSON column.

The manual says "These deficiencies [with table inheritance] will probably be fixed in some future release" -- and has since at least 2002, so I think it's safe to assume the situation will never improve.


I have looked into table inheritance, and came to the same conclusion as you.

Later I learned that table inheritance wasn't meant to be used for polymorphism, but rather to solve performance problems. I guess that sharded tables now fill that niche in a better way.


The table inheritance one is something I wasn't aware of. I can think of so many great uses for it that it makes me kind of weary. There has to be some downside, its too good.


Well, there’s this blurb in the documentation:

“Although inheritance is frequently useful, it has not been integrated with unique constraints or foreign keys, which limits its usefulness.”


Isn't it just sugar for what you would normally do, which is add another table with the new columns and FK into the base table? I assume it adds some protection around type exclusivity, ie a cat row cannot be a dog row, where as you do not have that protection with cat and dog tables that both reference the animal table.

Seems nice but usually you actually would prefer composition over inheritance.


It would simplify the reflection of the schema in code I think. Also not having to do a JOIN, or make a FK, or worry about the cardinality of the FK, or worry about ON DELETE CASCADE. It seems like it wraps it up pretty succinctly.


Haha , subtle marketing bite. Did not see it coming.

One thing is not like the others.

Interesting the pub/sub. Anyone tried it? It s it ok for production loads?


NOTIFY and LISTEN appeals a lot but my understanding is LISTEN needs to be on the same connection, which isn’t realistic with connection pooling etc? Anyone got any experience with this feature?


On the same connection with the NOTIFY-er? I probably misunderstand somethings. Can you point to the part in the documentation that states this? That would defeat the purpose of it entirely.

Edit: In fact the second sentence in the documentation is "[...] NOTIFY channel is invoked, either by this session or another one connected to the same database, all the sessions currently listening on that notification channel are notified, and each will in turn notify its connected client application." [0]

[0] https://www.postgresql.org/docs/9.4/sql-listen.html


The LISTEN and NOTIFY don’t shouldn’t be on the same connection. In fact, many libraries require a separate dedicated connection to LISTEN on.

One feature of PubSub that is also a caveat is that it is transactional. That is great for production code, but makes transactional tests impossible.


Why not just setup multiple test machines and have each machine run a test or write the tests such that you used multiple transactions.


I use logical replication with a modified version of Debezium (https://debezium.io) - I prefer it to NOTIFY/LISTEN as it means I can have my consumers completely decoupled.


Debezium is an awesome project. I use it to maintain an audit log and use it as the event source for downstream microservices. It's also handy for building data analytics or warehousing pipelines.

The best thing is that it can also track deletes and updates unlike simple polling of the database using select queries.


They do not need to be on the same connection, but LISTEN will need its own dedicated connection that is kept alive, it shouldn’t be drawn from the connection pool. Very useful if you want to send messages only and exactly once a transaction is committed.


Let's say you have apps that cache rows with the pk as the cache key. You could notify your app servers that you changed a row and they should drop it from their cache... for instance.


Or tell the web server to push out the updated version of the record to browsers via web sockets. Very useful feature.


It works pretty well. I use NOTIFY in triggers and have various Node.js processes listen for changes. It's a lot easier to leverage the feature when you use plv8.



Should also add advisory locks to this list. They are a little known but incredibly useful feature.


What for? A lock doesn't sound that useful if it's only advisory.


I suggest you go look at https://www.postgresql.org/docs/9.4/explicit-locking.html.

TLDR as you didn't even bother to Google: Allows you to create application defined locks. Useful for building job queues and other systems where row locks aren't ideal.


> application defined locks

Ah, that makes more sense. Thanks!


I’ve seen 3 or 4 of these posts over the last few months and they all seem to have half in common with each other. At this point do we need more posts calling attention to FDW and Table Inheritance as things nobody’s using/heard of? They’re certainly nifty, but I’d like more deep dives on PG internals and tuning to gain a bit more publicity on HN.


Now I'm curious: how far can you get in writing a simple REST API using just PostgreSQL?



How about UPDATE LIMIT 1 instead of fancy edge case features?


I'm curious why you need this? As a safety mechanism?

You could just do a select with the limit then update that row from the primary key...


Sure, with a Turing compete system everything is a "you could just". It seems a lot easier to update limit 1 than to involve a subquery. If that is what GP meant it for, of course.


UPDATE LIMIT 1 is an edge case




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: