Hacker News new | past | comments | ask | show | jobs | submit login
The many faces of DISTINCT in Postgres (2017) (hakibenita.com)
187 points by rsecora on May 22, 2023 | hide | past | favorite | 70 comments



> A classic job interview question is finding the employee with the highest salary in each department.

Here’s a cheat code, in case you need to write a database query and don’t remember all the fancy join tricks: Problems like this often have a very straightforward solution with subselects. In this case, the main select gets the departments, and a subselect (with limit 1) fetches the top employee for each department.

That’s a very natural, compositional way of thinking. Granted, it’s not the most optimized way to do it, but more often then not, the resulting query plan is perfectly fine.


> It’s not the most optimized way to do it, but more often then not, the query plan is perfectly fine.

Depending on the details, the query plan might be exactly the same. I remember once (in SQL Server 2000, so long time ago) writing the same query in 3 completely different ways, only to see the DB generate the same plan for each one.

It would surprise most people how sophisticated the query transformations databases can do.


I had one case, with the same SQL server, where rewriting to a join made the query much faster. That was a data changing statement though, and I think the semantics demanded it rerun the subselect.

More recently, I've found that mysql sometimes behaves poorly with subselects, so I tend to avoid them out of habit.


> Granted, it’s not the most optimized way to do it, but more often then not, the resulting query plan is perfectly fine.

It usually is the exactly same. (MySQL has a potato for a query optimizer tho.)

The biggest case it is suboptimal is when you need to produce multiple fields from the sub-select. Because then you need multiple sub-selects.


Also easy to do this with a lateral join (and a limit 1 in the subquery).


If you can do a lateral join off the top of your head this isn't the kind of question you need to have a strategy for anyway.


I think everyone tends to learn their own subset of SQL. I was unaware of some of the more 'obvious' solutions in the article (e.g. I had only a very vague idea of how window functions work), but I use lateral joins all the time.

The nice thing about lateral joins is that they're very easy to understand conceptually once you get over the weird syntax.


How would you explain them to someone who knows the syntax but still finds them counterintuitive?


To me a lateral join is the simplest construct you could have, like a for loop in a programming language.

For each row in my query so far, do this other thing.

Since I learned them, everything went from being a puzzle on its own, "I know what I want to do but how to express it in SQL" -- to simply being writing things out naturally..


TIL lateral joins exist, and I struggled to understand what exactly they did because for some reasons all the blog posts and whatnot were so convoluted. Then I found this[1] SO answer, which lit my bulb.

We're using such subqueries a lot, often for many columns in the same child table, and we're moving to MSSQL so will definitely change to using lateral join (or cross apply as MSSQL calls it).

[1]: https://stackoverflow.com/a/28550962


Sadly, it seems that CROSS APPLY in MSSQL can be slow. Not sure why, but it was ~3x slower for typical queries we have. Oh well...


I just think of it as performing a subquery for every row of the main query.


Fwir, BigQuery doesn't yet support lateral joins, though they do use the same concept in their UNNEST and array expansion features.

It would be nice to have this feature; its a handy hammer for the right type of nail.


> Granted, it’s not the most optimized way to do it

I've discovered a small number of queries actual do become faster when switching a column to a subselect.


Sorry, how would you write this query just with LIMIT 1 and not window functions or MAX or a join?


100% agree. If it's a one off SQL query I'll use subselects 99% of the time.

If its production code and important for performance, I'll write a prototype with subselects and then rewrite it with the proper joins.

It's just so much faster and natural to think in subselects (IMHO).


The author mentions having to get over the lack of upsert when moving from Oracle. But readers might like to know this isn’t a problem anymore since Postgres got “INSERT … ON CONFLICT UPDATE …”.


Oh I'll add to the pile: the Postgres ON CONFLICT/MERGE stuff doesn't capture the "select or insert if row doesn't exist" behavior some folks might want to use an upsert type thing for.

Why? You cannot access the row id or any values within the row as a return value unless the row changes. The top stackexchange answer goes into some detail about why you really don't want to simply make a dummy update in Postgres.

You can access these old row values in some scope for the no-update case, and this allows you to leverage CTEs instead of dummy updates. However… you can't use CTEs with prepared statements (and it's a non-obvious fail for the uninitiated). So yeah there are still lots of little sharp edges.


It's true.

Fortunately the records are locked, so a second query will pick them up. Still sad tho.

Prepared statements work fine with CTEs.


  Prepared statements work fine with CTEs.
My experience was that this was not the case. Basically the CTE was materialized and the variables were resolved as part of PREPARE, so each invocation got the same values. I couldn't find the blurb I'd read previously but I did a little digging today and it looks like maybe NOT MATERIALIZED would fix the problem?


CTEs used to be an optimization fence, with or without prepared statements.

PG 14 or something changed that.


> "select or insert if row doesn't exist" behavior

I've hit this many, many times. It's such a common need, and yet I keep having to write it out as multiple queries to "select row" and if no result, do then insert query.


I am a huge Postgres fan, but “ON CONFLICT UPDATE” does not always cut the mustard. You can’t target conflicts against multiple constraints, so if there are multiple constraints that could cause conflicts you’re stuck either figuring out whether you can drop a constraint or doing something clever in your client.

This is often a smell, but it’s a little inflexible.


In addition to this, `ON CONFLICT` is still susceptible to race-conditions if you have a highly concurrent application. This may be obvious to some, however, a lot of people expect upsert to be atomic and get bit by this.

I also dislike that `ON CONFLICT ... DO NOTHING` increments numerical primary keys. I understand why it happens but it seems counter-intuitive given the name "DO NOTHING". (And, yes, relying on the values of primary keys to never change is an anti-pattern. However, if you have a table of 10 values and the primary keys have massive gaps like 1, 500_211, 2_521_241, 15_631_121, etc., it feels weird nonetheless.)


From the Postgres docs,

ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This is also known as UPSERT — “UPDATE or INSERT”.

https://www.postgresql.org/docs/current/sql-insert.html

What are you referring to?


If I recall correctly (and it has been a while, so I'm not saying I am), the issue was with concurrent transactions inserting a record into tableA and another into tableB which has a foreign key constraint to tableA. The issue was likely specific to `ON CONFLICT DO NOTHING` and not `ON CONFLICT DO UPDATE`.

For example, let's saying you're building an index of open source packages and have two tables: package_type(id, name) and package(id, type_id, namespace, name).

If you receive two concurrent requests for `maven://log4j:log4j` and `maven://io.quarkus:quarkus`, a naive implementation to insert both "maven" and the packages if they don't exist might look something like this:

   WITH type_id AS (
     INSERT INTO package_type(name)
     VALUES (:type)
     RETURNING id
     ON CONFLICT DO NOTHING
   )
   INSERT INTO package (type_id, namespace, name)
   SELECT type_id, :namespace, :name
   FROM type
   ON CONFLICT DO NOTHING;
However, one or both inserts can fail intermittently because the primary key for `package_type` will be auto-incremented and thus the foreign key won't be valid. Also, as mentioned in another comment[0] this won't work if `maven` already exists in the `package_type` table.

[0]: https://news.ycombinator.com/item?id=36031790


Atomicity doesn't mean "doesn't fail", it means "either fails or succeeds, but does not succeed halfway".

There is nothing about what you are describing that is different from the behavior you'd get from a regular insert or update. If two transactions conflict, a rollback will occur. That isn't violating atomicity. In fact, it is the way by which atomicity is guaranteed.

The behavior of sequence values getting incremented and not committed, resulting in gaps in the sequence, is a separate matter, not specific to Postgres or to upsert.


I don't think this has to with concurrency; this query is fundamentally broken (besides typos), in that the CTE won't return anything if the package_type already exists.

You have two options:

(1) ON CONFLICT DO UPDATE, with dummy update:

    WITH
      type AS (
        INSERT INTO package_type (name)
        VALUES ($1)
        ON CONFLICT (name) DO UPDATE SET name = excluded.name
        RETURNING id
      )
    INSERT INTO package (type_id, namespace, name)
    SELECT id, $2, $3
    FROM type
    ON CONFLICT (type_id, namespace, name) DO UPDATE SET name = excluded.name
    RETURNING id;
(2) Separate statements with ON CONFLICT DO NOTHING (could be in a UDF if desired):

    INSERT INTO package_type (name)
    VALUES ($1)
    ON CONFLICT DO NOTHING;

    INSERT INTO package (type_id, namespace, name)
    SELECT type_id, $2, $3
    FROM package_type
    WHERE name = $1
    ON CONFLICT DO NOTHING;

    SELECT id
    FROM package
    WHERE (type_id, namespace, name) = ($1, $2, $3);


I think that actually more to the root of the problem, as other folks have noted, is that `ON CONFLICT DO NOTHING` means the RETURNING clause returns no rows if there is a conflict, which in my experience is rarely what people want. So instead people do `ON CONFLICT DO UPDATE` with a no-op update which has performance/locking implications, otherwise they need to do a complicated query (search stack overflow).

I wish that postgres would add some sort of backwards compatible option like `ON CONFLICT DO NOOP` or `ON CONFLICT DO RETURN` so that you got the semantics of `DO NOTHING` except that the conflicted rows are returned.


It increments the sequence associated with with a "serial" or "bigserial" field—usually used for primary keys. People are often surprised by this because they expect their primary keys to be sequential and sequences are designed to leave gaps in order to avoid a lot of locking.


That just seems like a fundamental misunderstanding of sequences to me. They are guaranteed to be increasing but not necessarily sequential.


I think it's more confusion of how `ON CONFLICT DO NOTHING` works. Incrementing sequences is not "doing nothing", even if it's valid and sensible behaviour, which leads to confusion.


In fairness to Postgres, “DO NOTHING” is accurate within the SQL domain if you squint. It’s reasonable for queries to have side-effects on the underlying implementation as long as the invariants defined by SQL are met. Surprising, yes, but the alternative would be leaking implementation details in the syntax.

On the other hand, it’s so easy to use and rely on autoincrement IDs and to assume they are monotonic and predictable based on naive testing. If we could do it all again I’d fuzz the IDs even during normal operation (occasionally increment by a random extra amount) so that they don’t become such a foot gun for developers.


It's as resilient to race conditions as it could be. Does exactly what you'd expect, no?

That is annoying about sequences, I agree.


> It's as resilient to race conditions *as it could be*. Does exactly what you'd expect, no?

While that that may be true, based on the hours I've spent scouring Stack Overflow and GitHub issues it seems that many people don't realize that it's only 99.999% resilient.


It is 100% resilient short of your hard drive going through a microwave. It is almost definitely an application logic error if you are observing non-atomicity. What would non-atomicity even look like here? Only some columns being written?


Could you describe what you mean? What phenomenon? For the default transaction isolation level, or a different one?


> You can’t target conflicts against multiple constraints,

I've never needed that, and it's not clear why I would.

Like, my new record is a duplicate of record A according to one constraint and record B according to another constraint, so it updates both A and B?

One "upsert" but two resulting records?

Can someone fill me in on when this would crop up?


In my experience this tends to be an annoyance when you want to upsert against a table that has multiple unique constraints.

For example, let's say you're tracking GitHub repositories and have a table `repository(id, gh_id, gh_node_id, owner, name, description)` where `gh_id` and `gh_node_id` are both unique.

If you want to insert or update a repository you might want to do something like below, however, this is not a valid syntax and as you need to define a separate `DO UPDATE` for `gh_id` and `gh_node_id`:

    INSERT INTO repository (gh_id, gh_node_id, owner, name, description)
    VALUES (:id, :node_id, :owner, :name, :description)
    ON CONFLICT DO UPDATE
       SET name = excluded.name, 
           description = excluded.description;

------

To my knowledge there's no way to define a single constraint `UNIQUE(gh_id || gh_node_id)` instead of `UNIQUE(gh_id && gh_node_id)`.


So....what is gh_id and gh_node_id?

Like, you have

    gh_id | gh_node_id
    ------------------
      1   |     2    
      2   |     1
And then you want to "upsert" (1, 1).

You want both records to be updated?

I'm still struggling to see why you would want to do a upsert relative to multiple unique constraints at once.


Ah, I'd run into the lack of multi constraint ON CONFLICT in postgres yesterday and was wondering about the omission. I'd not considered this case and GP's code smell comment makes sense now.

In my specific case, business logic dictates uniqueness of both columns is tied together i.e. if an incoming tuple has a value (1, 2) today, all subsequent expected tuples with `gh_id` 1 will also have `gh_node_id` of 2. What's the best practice to model this constraint?


Then upsert based on gh_id.

I'm not saying that multiple unique identifiers is a code smell; I am claiming that an upsert is only sensibly done in the context of one constraint.


Yes, that's been a thing since 9.5, which was released in early 2016.


And MERGE


I've never gotten into DISTINCT ON and it's always confused me, I'd rather see the query with MAX and GROUP BY if I'm looking for "the highest X in groups of Y". For the same reason I don't prefer RSA-in-three-lines.


I’ve never truly used DISTINCT and felt comfortable for using it. Always felt using it revealed a design smell in my query. (Been doing SQL for 30 years and still!)


yes, thats exactly the way I feel. It might be the right answer, but oftentimes its a smell.

Theres very few cases where you shouldnt be doing a group by with aggregates instead. (window functions count here)


I only learned about the ranked query approach last week thanks to ChatGTP. Helped me solve a hairy query that rolled up activity events grouped by time periods. Before that I was struggling with distinct (and it was slow).

I’ve avoided ChatGTP until recently and at least for SQL refactoring, it’s great. The interesting part is the ranked example ChatCTP gave me was almost identical to the one in this post. I wonder if they’re (ChatGTP) is training up on technical blog posts.


I throw it lines of logs of sql from Paper trail and ask it to extract the SQL statements. Then I give it the analyse explain output. Sometimes ChatGPT will give me great advice on optimizing.


Dang, that's a good idea (as long as not production data).

It seems like eventually database automation is going be self-optimizing (more than the query planner already is).


"(easy CS students, I know it's not normalized…)"

Sure it is. As long as you're not storing any other information on the department in the employee table.


And as long as you never rename the departments.


That has no bearing on normalization.


It's an almost verbatim example of getting to 1NF, the first and most basic normalization. The value (department name) is repeating and should be extracted and given its own ID.


1NF bans relation-valued attributes, not repetition of attribute value across tuples in a relation. Mainstream SQL databases don't support relation-valued attributes, so any table you make in a relational database is 1NF.

You can push back on this a little - for instance maybe you consider an array-valued attribute to be enough like a relation to argue array-valued attributes violate 1NF. But if you do that you must also explain what makes arrays different from strings, since strings are pretty similar to arrays of characters and can be treated the same way in most respects (for instance characters in a string can be addressed by index or split into substrings).


Arguably it’s a structured vs unstructured data issue


> The value (department name) is repeating and should be extracted

Then the id would be repeating. Furthermore, the department name would make a fine primary or alternate key for the new relation you're proposing.

Also, that's not what 1NF is. 1NF means there should be no table-valued attributes. And neither is any column list-valued nor does any subset of columns form a subtable.

The other normal forms talk about functional dependencies and there aren't any.

The only possible violation of 1NF could be not splitting the name in given name and family name. Other than that, the table is normalized.


> Then the id would be repeating.

Yes, but it's an ID, not a value. No problems there.

> Furthermore, the department name would make a fine primary or alternate key for the new relation you're proposing.

They're called "natural keys" and there's a lot of problems with them not actually mapping to identity. Like for one example, if a department's name is changed, it isn't just a change to the database column, you have to update all associated code as well - which is why you should have an ID and use that in the code anyway.

> Also, that's not what 1NF is. 1NF means there should be no table-valued attributes. And neither is any column list-valued nor does any subset of columns form a subtable.

If there's only one such column and you switch perspective to the inner table, the transformation is the same. That's why I said "almost" - it's not exactly the same, there are additional conditions, but if you hit them it's the same thing and the result is extracting the duplicate values into their own table, linking them with an ID instead.


> Yes, but it's an ID, not a value.

In normalization theory and relational algebra an ID is just a value. DBMSs make no difference between this column and any other primary key column(s) and they make no difference between PK indexes over strings or numbers or any other supported data type.

You're just cargo culting here instead of applying database theory or actually looking at implementations.


Only if there's a separate Departments table, which for this very simple example, there isn't


Even if there were a separate Departments table, who is to say that the Department Name is not its primary key? The Department name certainly is A key.


Author means it should be a department_id with a table mapping that to a name to be normalized like a student is taught


Microsoft SQL Server now also has IS [NOT] DISTINCT FROM. https://learn.microsoft.com/en-us/sql/t-sql/queries/is-disti...


MSSQL does have something closer to array agg.

https://www.mssqltips.com/sqlservertip/5542/using-for-xml-pa...



> We can immediately see that everyone in the support department are making the same salary.

Bonnie Robertson is a thriving 10X support high earner


2017


I was thinking to myself "there is no way Haki Benita has just discovered `distinct on`".


And still helpful as ever.


Oracle listaggs do support distinct, since 19c.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: