The many faces of DISTINCT in Postgres (2017)

codeflo · on May 22, 2023

> A classic job interview question is finding the employee with the highest salary in each department.

Here’s a cheat code, in case you need to write a database query and don’t remember all the fancy join tricks: Problems like this often have a very straightforward solution with subselects. In this case, the main select gets the departments, and a subselect (with limit 1) fetches the top employee for each department.

That’s a very natural, compositional way of thinking. Granted, it’s not the most optimized way to do it, but more often then not, the resulting query plan is perfectly fine.

paol · on May 22, 2023

> It’s not the most optimized way to do it, but more often then not, the query plan is perfectly fine.

Depending on the details, the query plan might be exactly the same. I remember once (in SQL Server 2000, so long time ago) writing the same query in 3 completely different ways, only to see the DB generate the same plan for each one.

It would surprise most people how sophisticated the query transformations databases can do.

dunham · on May 22, 2023

I had one case, with the same SQL server, where rewriting to a join made the query much faster. That was a data changing statement though, and I think the semantics demanded it rerun the subselect.

More recently, I've found that mysql sometimes behaves poorly with subselects, so I tend to avoid them out of habit.

paulddraper · on May 22, 2023

> Granted, it’s not the most optimized way to do it, but more often then not, the resulting query plan is perfectly fine.

It usually is the exactly same. (MySQL has a potato for a query optimizer tho.)

The biggest case it is suboptimal is when you need to produce multiple fields from the sub-select. Because then you need multiple sub-selects.

foldr · on May 22, 2023

Also easy to do this with a lateral join (and a limit 1 in the subquery).

giraffe_lady · on May 22, 2023

If you can do a lateral join off the top of your head this isn't the kind of question you need to have a strategy for anyway.

foldr · on May 22, 2023

I think everyone tends to learn their own subset of SQL. I was unaware of some of the more 'obvious' solutions in the article (e.g. I had only a very vague idea of how window functions work), but I use lateral joins all the time.

The nice thing about lateral joins is that they're very easy to understand conceptually once you get over the weird syntax.

andrenotgiant · on May 22, 2023

How would you explain them to someone who knows the syntax but still finds them counterintuitive?

dagss · on May 22, 2023

To me a lateral join is the simplest construct you could have, like a for loop in a programming language.

For each row in my query so far, do this other thing.

Since I learned them, everything went from being a puzzle on its own, "I know what I want to do but how to express it in SQL" -- to simply being writing things out naturally..

magicalhippo · on May 22, 2023

TIL lateral joins exist, and I struggled to understand what exactly they did because for some reasons all the blog posts and whatnot were so convoluted. Then I found this[1] SO answer, which lit my bulb.

We're using such subqueries a lot, often for many columns in the same child table, and we're moving to MSSQL so will definitely change to using lateral join (or cross apply as MSSQL calls it).

[1]: https://stackoverflow.com/a/28550962

magicalhippo · on May 26, 2023

Sadly, it seems that CROSS APPLY in MSSQL can be slow. Not sure why, but it was ~3x slower for typical queries we have. Oh well...

foldr · on May 22, 2023

I just think of it as performing a subquery for every row of the main query.

mwexler · on May 22, 2023

Fwir, BigQuery doesn't yet support lateral joins, though they do use the same concept in their UNNEST and array expansion features.

It would be nice to have this feature; its a handy hammer for the right type of nail.

antifa · on May 23, 2023

> Granted, it’s not the most optimized way to do it

I've discovered a small number of queries actual do become faster when switching a column to a subselect.

andy800 · on May 23, 2023

Sorry, how would you write this query just with LIMIT 1 and not window functions or MAX or a join?

emptyfile · on May 22, 2023

100% agree. If it's a one off SQL query I'll use subselects 99% of the time.

If its production code and important for performance, I'll write a prototype with subselects and then rewrite it with the proper joins.

It's just so much faster and natural to think in subselects (IMHO).

neallindsay · on May 22, 2023

The author mentions having to get over the lack of upsert when moving from Oracle. But readers might like to know this isn’t a problem anymore since Postgres got “INSERT … ON CONFLICT UPDATE …”.

inferiorhuman · on May 22, 2023

Oh I'll add to the pile: the Postgres ON CONFLICT/MERGE stuff doesn't capture the "select or insert if row doesn't exist" behavior some folks might want to use an upsert type thing for.

Why? You cannot access the row id or any values within the row as a return value unless the row changes. The top stackexchange answer goes into some detail about why you really don't want to simply make a dummy update in Postgres.

You can access these old row values in some scope for the no-update case, and this allows you to leverage CTEs instead of dummy updates. However… you can't use CTEs with prepared statements (and it's a non-obvious fail for the uninitiated). So yeah there are still lots of little sharp edges.

paulddraper · on May 22, 2023

It's true.

Fortunately the records are locked, so a second query will pick them up. Still sad tho.

Prepared statements work fine with CTEs.

inferiorhuman · on May 22, 2023

  Prepared statements work fine with CTEs.

My experience was that this was not the case. Basically the CTE was materialized and the variables were resolved as part of PREPARE, so each invocation got the same values. I couldn't find the blurb I'd read previously but I did a little digging today and it looks like maybe NOT MATERIALIZED would fix the problem?

paulddraper · on May 23, 2023

CTEs used to be an optimization fence, with or without prepared statements.

PG 14 or something changed that.

jadbox · on May 22, 2023

> "select or insert if row doesn't exist" behavior

I've hit this many, many times. It's such a common need, and yet I keep having to write it out as multiple queries to "select row" and if no result, do then insert query.

cdogl · on May 22, 2023

I am a huge Postgres fan, but “ON CONFLICT UPDATE” does not always cut the mustard. You can’t target conflicts against multiple constraints, so if there are multiple constraints that could cause conflicts you’re stuck either figuring out whether you can drop a constraint or doing something clever in your client.

This is often a smell, but it’s a little inflexible.

richbell · on May 22, 2023

In addition to this, `ON CONFLICT` is still susceptible to race-conditions if you have a highly concurrent application. This may be obvious to some, however, a lot of people expect upsert to be atomic and get bit by this.

I also dislike that `ON CONFLICT ... DO NOTHING` increments numerical primary keys. I understand why it happens but it seems counter-intuitive given the name "DO NOTHING". (And, yes, relying on the values of primary keys to never change is an anti-pattern. However, if you have a table of 10 values and the primary keys have massive gaps like 1, 500_211, 2_521_241, 15_631_121, etc., it feels weird nonetheless.)

jack_squat · on May 22, 2023

From the Postgres docs,

ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — “UPDATE or INSERT”.

https://www.postgresql.org/docs/current/sql-insert.html

What are you referring to?

richbell · on May 22, 2023

If I recall correctly (and it has been a while, so I'm not saying I am), the issue was with concurrent transactions inserting a record into tableA and another into tableB which has a foreign key constraint to tableA. The issue was likely specific to `ON CONFLICT DO NOTHING` and not `ON CONFLICT DO UPDATE`.

For example, let's saying you're building an index of open source packages and have two tables: package_type(id, name) and package(id, type_id, namespace, name).

If you receive two concurrent requests for `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a naive implementation to insert both "maven" and the packages if they don't exist might look something like this:

   WITH type_id AS (
     INSERT INTO package_type(name)
     VALUES (:type)
     RETURNING id
     ON CONFLICT DO NOTHING
   )
   INSERT INTO package (type_id, namespace, name)
   SELECT type_id, :namespace, :name
   FROM type
   ON CONFLICT DO NOTHING;

However, one or both inserts can fail intermittently because the primary key for `package_type` will be auto-incremented and thus the foreign key won't be valid. Also, as mentioned in another comment[0] this won't work if `maven` already exists in the `package_type` table.

[0]: https://news.ycombinator.com/item?id=36031790

jack_squat · on May 22, 2023

Atomicity doesn't mean "doesn't fail", it means "either fails or succeeds, but does not succeed halfway".

There is nothing about what you are describing that is different from the behavior you'd get from a regular insert or update. If two transactions conflict, a rollback will occur. That isn't violating atomicity. In fact, it is the way by which atomicity is guaranteed.

The behavior of sequence values getting incremented and not committed, resulting in gaps in the sequence, is a separate matter, not specific to Postgres or to upsert.

paulddraper · on May 22, 2023

I don't think this has to with concurrency; this query is fundamentally broken (besides typos), in that the CTE won't return anything if the package_type already exists.

You have two options:

(1) ON CONFLICT DO UPDATE, with dummy update:

    WITH
      type AS (
        INSERT INTO package_type (name)
        VALUES ($1)
        ON CONFLICT (name) DO UPDATE SET name = excluded.name
        RETURNING id
      )
    INSERT INTO package (type_id, namespace, name)
    SELECT id, $2, $3
    FROM type
    ON CONFLICT (type_id, namespace, name) DO UPDATE SET name = excluded.name
    RETURNING id;

(2) Separate statements with ON CONFLICT DO NOTHING (could be in a UDF if desired):

    INSERT INTO package_type (name)
    VALUES ($1)
    ON CONFLICT DO NOTHING;

    INSERT INTO package (type_id, namespace, name)
    SELECT type_id, $2, $3
    FROM package_type
    WHERE name = $1
    ON CONFLICT DO NOTHING;

    SELECT id
    FROM package
    WHERE (type_id, namespace, name) = ($1, $2, $3);

hn_throwaway_99 · on May 22, 2023

I think that actually more to the root of the problem, as other folks have noted, is that `ON CONFLICT DO NOTHING` means the RETURNING clause returns no rows if there is a conflict, which in my experience is rarely what people want. So instead people do `ON CONFLICT DO UPDATE` with a no-op update which has performance/locking implications, otherwise they need to do a complicated query (search stack overflow).

I wish that postgres would add some sort of backwards compatible option like `ON CONFLICT DO NOOP` or `ON CONFLICT DO RETURN` so that you got the semantics of `DO NOTHING` except that the conflicted rows are returned.

neallindsay · on May 22, 2023

It increments the sequence associated with with a "serial" or "bigserial" field—usually used for primary keys. People are often surprised by this because they expect their primary keys to be sequential and sequences are designed to leave gaps in order to avoid a lot of locking.

hn_throwaway_99 · on May 22, 2023

That just seems like a fundamental misunderstanding of sequences to me. They are guaranteed to be increasing but not necessarily sequential.

richbell · on May 22, 2023

I think it's more confusion of how `ON CONFLICT DO NOTHING` works. Incrementing sequences is not "doing nothing", even if it's valid and sensible behaviour, which leads to confusion.

cdogl · on May 27, 2023

In fairness to Postgres, “DO NOTHING” is accurate within the SQL domain if you squint. It’s reasonable for queries to have side-effects on the underlying implementation as long as the invariants defined by SQL are met. Surprising, yes, but the alternative would be leaking implementation details in the syntax.

On the other hand, it’s so easy to use and rely on autoincrement IDs and to assume they are monotonic and predictable based on naive testing. If we could do it all again I’d fuzz the IDs even during normal operation (occasionally increment by a random extra amount) so that they don’t become such a foot gun for developers.

paulddraper · on May 22, 2023

It's as resilient to race conditions as it could be. Does exactly what you'd expect, no?

That is annoying about sequences, I agree.

richbell · on May 22, 2023

> It's as resilient to race conditions *as it could be*. Does exactly what you'd expect, no?

While that that may be true, based on the hours I've spent scouring Stack Overflow and GitHub issues it seems that many people don't realize that it's only 99.999% resilient.

vore · on May 22, 2023

It is 100% resilient short of your hard drive going through a microwave. It is almost definitely an application logic error if you are observing non-atomicity. What would non-atomicity even look like here? Only some columns being written?

paulddraper · on May 22, 2023

Could you describe what you mean? What phenomenon? For the default transaction isolation level, or a different one?

paulddraper · on May 22, 2023

> You can’t target conflicts against multiple constraints,

I've never needed that, and it's not clear why I would.

Like, my new record is a duplicate of record A according to one constraint and record B according to another constraint, so it updates both A and B?

One "upsert" but two resulting records?

Can someone fill me in on when this would crop up?

richbell · on May 22, 2023

In my experience this tends to be an annoyance when you want to upsert against a table that has multiple unique constraints.

For example, let's say you're tracking GitHub repositories and have a table `repository(id, gh_id, gh_node_id, owner, name, description)` where `gh_id` and `gh_node_id` are both unique.

If you want to insert or update a repository you might want to do something like below, however, this is not a valid syntax and as you need to define a separate `DO UPDATE` for `gh_id` and `gh_node_id`:

    INSERT INTO repository (gh_id, gh_node_id, owner, name, description)
    VALUES (:id, :node_id, :owner, :name, :description)
    ON CONFLICT DO UPDATE
       SET name = excluded.name, 
           description = excluded.description;

------

To my knowledge there's no way to define a single constraint `UNIQUE(gh_id || gh_node_id)` instead of `UNIQUE(gh_id && gh_node_id)`.

paulddraper · on May 22, 2023

So....what is gh_id and gh_node_id?

Like, you have

    gh_id | gh_node_id
    ------------------
      1   |     2    
      2   |     1

And then you want to "upsert" (1, 1).

You want both records to be updated?

I'm still struggling to see why you would want to do a upsert relative to multiple unique constraints at once.

cropcirclbureau · on May 23, 2023

Ah, I'd run into the lack of multi constraint ON CONFLICT in postgres yesterday and was wondering about the omission. I'd not considered this case and GP's code smell comment makes sense now.

In my specific case, business logic dictates uniqueness of both columns is tied together i.e. if an incoming tuple has a value (1, 2) today, all subsequent expected tuples with `gh_id` 1 will also have `gh_node_id` of 2. What's the best practice to model this constraint?

paulddraper · on May 23, 2023

Then upsert based on gh_id.

I'm not saying that multiple unique identifiers is a code smell; I am claiming that an upsert is only sensibly done in the context of one constraint.

mattashii · on May 22, 2023

Yes, that's been a thing since 9.5, which was released in early 2016.

paulryanrogers · on May 22, 2023

And MERGE

zzzeek · on May 22, 2023

I've never gotten into DISTINCT ON and it's always confused me, I'd rather see the query with MAX and GROUP BY if I'm looking for "the highest X in groups of Y". For the same reason I don't prefer RSA-in-three-lines.

ak39 · on May 22, 2023

I’ve never truly used DISTINCT and felt comfortable for using it. Always felt using it revealed a design smell in my query. (Been doing SQL for 30 years and still!)

netcraft · on May 29, 2023

yes, thats exactly the way I feel. It might be the right answer, but oftentimes its a smell.

Theres very few cases where you shouldnt be doing a group by with aggregates instead. (window functions count here)

cpursley · on May 22, 2023

I only learned about the ranked query approach last week thanks to ChatGTP. Helped me solve a hairy query that rolled up activity events grouped by time periods. Before that I was struggling with distinct (and it was slow).

I’ve avoided ChatGTP until recently and at least for SQL refactoring, it’s great. The interesting part is the ranked example ChatCTP gave me was almost identical to the one in this post. I wonder if they’re (ChatGTP) is training up on technical blog posts.

swader999 · on May 22, 2023

I throw it lines of logs of sql from Paper trail and ask it to extract the SQL statements. Then I give it the analyse explain output. Sometimes ChatGPT will give me great advice on optimizing.

cpursley · on May 22, 2023

Dang, that's a good idea (as long as not production data).

It seems like eventually database automation is going be self-optimizing (more than the query planner already is).

maweki · on May 22, 2023

"(easy CS students, I know it's not normalized…)"

Sure it is. As long as you're not storing any other information on the department in the employee table.

daigoba66 · on May 22, 2023

And as long as you never rename the departments.

maweki · on May 22, 2023

That has no bearing on normalization.

Izkata · on May 22, 2023

It's an almost verbatim example of getting to 1NF, the first and most basic normalization. The value (department name) is repeating and should be extracted and given its own ID.

jack_squat · on May 22, 2023

1NF bans relation-valued attributes, not repetition of attribute value across tuples in a relation. Mainstream SQL databases don't support relation-valued attributes, so any table you make in a relational database is 1NF.

You can push back on this a little - for instance maybe you consider an array-valued attribute to be enough like a relation to argue array-valued attributes violate 1NF. But if you do that you must also explain what makes arrays different from strings, since strings are pretty similar to arrays of characters and can be treated the same way in most respects (for instance characters in a string can be addressed by index or split into substrings).

AlphaSite · on May 23, 2023

Arguably it’s a structured vs unstructured data issue

maweki · on May 22, 2023

> The value (department name) is repeating and should be extracted

Then the id would be repeating. Furthermore, the department name would make a fine primary or alternate key for the new relation you're proposing.

Also, that's not what 1NF is. 1NF means there should be no table-valued attributes. And neither is any column list-valued nor does any subset of columns form a subtable.

The other normal forms talk about functional dependencies and there aren't any.

The only possible violation of 1NF could be not splitting the name in given name and family name. Other than that, the table is normalized.

Izkata · on May 22, 2023

> Then the id would be repeating.

Yes, but it's an ID, not a value. No problems there.

> Furthermore, the department name would make a fine primary or alternate key for the new relation you're proposing.

They're called "natural keys" and there's a lot of problems with them not actually mapping to identity. Like for one example, if a department's name is changed, it isn't just a change to the database column, you have to update all associated code as well - which is why you should have an ID and use that in the code anyway.

> Also, that's not what 1NF is. 1NF means there should be no table-valued attributes. And neither is any column list-valued nor does any subset of columns form a subtable.

If there's only one such column and you switch perspective to the inner table, the transformation is the same. That's why I said "almost" - it's not exactly the same, there are additional conditions, but if you hit them it's the same thing and the result is extracting the duplicate values into their own table, linking them with an ID instead.

maweki · on May 22, 2023

> Yes, but it's an ID, not a value.

In normalization theory and relational algebra an ID is just a value. DBMSs make no difference between this column and any other primary key column(s) and they make no difference between PK indexes over strings or numbers or any other supported data type.

You're just cargo culting here instead of applying database theory or actually looking at implementations.

dymk · on May 22, 2023

Only if there's a separate Departments table, which for this very simple example, there isn't

maweki · on May 22, 2023

Even if there were a separate Departments table, who is to say that the Department Name is not its primary key? The Department name certainly is A key.

thehappypm · on May 24, 2023

Author means it should be a department_id with a table mapping that to a name to be normalized like a student is taught

pophenat · on May 22, 2023

Microsoft SQL Server now also has IS [NOT] DISTINCT FROM. https://learn.microsoft.com/en-us/sql/t-sql/queries/is-disti...

xupybd · on May 22, 2023

MSSQL does have something closer to array agg.

https://www.mssqltips.com/sqlservertip/5542/using-for-xml-pa...

veddan · on May 22, 2023

There's also JSON_ARRAYAGG.

https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions....

t00 · on May 23, 2023

> We can immediately see that everyone in the support department are making the same salary.

Bonnie Robertson is a thriving 10X support high earner

justinclift · on May 22, 2023

ayhanfuat · on May 22, 2023

I was thinking to myself "there is no way Haki Benita has just discovered `distinct on`".

cpursley · on May 22, 2023

And still helpful as ever.

kubota · on May 22, 2023

Oracle listaggs do support distinct, since 19c.