Don't use DISTINCT as a "join-fixer"

thehours · on Oct 21, 2023

> There was another interesting use case I wrote about a few years ago that showed how changing DISTINCT to GROUP BY – even though it carries the same semantics and produces the same results – can help SQL Server filter out duplicates earlier and have a serious impact on performance.

I recently learned this is also what Amazon recommends when querying Spectrum. [1]

[1] https://aws.amazon.com/blogs/big-data/10-best-practices-for-...

aib · on Oct 22, 2023

This is ridiculous! Someone should fix it and send a p—oh.

victorNicollet · on Oct 21, 2023

This is such a pervasive issue with the relational SQL query model ! The fact that joins always create brand new tables is powerful, but it's rarely what is needed.

The example in this article can manage to do without a join because it's only using the other tables to filter the customer table. The proposed solution breaks down as soon as a column needs to be brought in from the other tables (for example, "the total number of products purchased from the Beauty category"). A notion as simple as "I want to return a list of customers with additional properties" does not map to any dedicated SQL construct ! You can JOIN then GROUP BY, but making sure that the result only has exactly one value per customer is not something you can explicitly say in your SQL, it's a consequence of how you set up your JOIN and GROUP BY keys, and it's both easy to make a mistake and hard to understand the intent.

croes · on Oct 21, 2023

You could CROSS APPLY or OUTER APPLY with the Top (1) of a subquery to get exactly one value per customer per row.

dragonwriter · on Oct 21, 2023

> A notion as simple as "I want to return a list of customers with additional properties" does not map to any dedicated SQL construct ! You can JOIN then GROUP BY, but making sure that the result only has exactly one value per customer is not something you can explicitly say in your SQL.

You can, you say it by grouping on only columns from the customers table, and those columns including a candidate key. Necessarily and equivalently, this means everything not from the customers table must be only referenced in an aggregate expression, but that's trivially what “I want a single row per customer, with some data not from the customers table” is asking.

> it's a consequence of how you set up your JOIN and GROUP BY keys,

Well, its a consequence of the GROUP BY. which is literally “what do you want one row for each of”, so... it's kind of weird to complain that it should be something else. GROUP BY is the dedicated construct in SQL that soecifies the thing you are looking for.

victorNicollet · on Oct 21, 2023

I've gone into more detail about this a few years ago in https://nicollet.net/blog/vanishing-schema-paradox.html but here's a short summary. Suppose you have the following analytics query:

    SELECT Customer.Name, sum(Sales.Quantity) 
    FROM Customer
    INNER JOIN Sales ON Customer.A = Sales.A
    GROUP BY Customer.B

Can you tell whether this will return exactly one line for each customer ? If INNER JOIN + GROUP BY was the dedicated construct to do so, then the answer would be "yes", because by definition it is the dedicated construct to do so. That's what a dedicated construct does: it is dedicated to doing that thing.

But both INNER JOIN and GROUP BY are much more versatile than that. In order to return exactly one line per customer, the following must be true: 1. column Customers.B must be a unique key of the Customers table (otherwise you'll get duplicate lines), and 2. each value of column Customer.A must also appear in column Sales.A (otherwise you'll have missing lines). Neither of these properties can be ascertained by looking at the query alone.

A dedicated construct would be something like (imaginary syntax):

   SELECT Customer.Name, sum(Sales.Quantity)
   FROM Customer 
   INNER JOIN Sales ON Customer.A = Sales.A
   GROUP INTO Customer

dragonwriter · on Oct 21, 2023

> Can you tell whether this will return exactly one line for each customer ?

I can tell it will return one row for each Customer.B.

If Customer.B is a candidate key of Customer, that will also be one eow per Customer.

> If INNER JOIN + GROUP BY was the dedicated construct to do so

The dedicated construct to say what you want one row per is GROUP BY. Yes, it operates by columns, not tables, so what it means in table terms is schema-dependent.

> each value of column Customer.A must also appear in column Sales

Well, yes, that’s what INNER JOIN means. The dedicated construct to assure that every row from the first source but not the second source table is included in the result set before filtering by WHERE is LEFT [OUTER] JOIN.

RHSeeger · on Oct 21, 2023

As I read what you wrote, my first thought is that you don't have any such thing as a "customer" in that data model. You can ask for "one line for each customer.B" (which is what you're doing). But the idea that you can ask for "one line per customer" relies on some amount of non-db, domain knowledge.

If you can't define what a "customer" is via the information in your database alone, then you can't query based on "a customer". And if the answer is "each row in Customer with a unique B", then that's part of the definition and reasonable to use as knowledge in getting "one row per customer".

I didn't explain that well, I think... but that's the general thought that was running through my head as I read your writing.

victorNicollet · on Oct 21, 2023

I agree ! This creates a situation where the SQL query does not represent the domain knowledge, but makes assumptions about it and cannot be understood without it. And while this will always be true for the more unusual parts of the domain, it is quite disappointing to be unable to properly represent as simple a concept as "what is a customer?" in SQL.

It's the same as a language forcing you to use `c & ~0x20` because it doesn't have a `Char.toUpper(c)` function. The code works (under the right assumptions) and produces the same result, but it does not convey the concept of converting a letter to uppercase.

What makes it so frustrating is that the DDL portion of SQL spends significant effort on representing such concepts in the database schema ! I can create a Customers table, with its primary key, and its foreign keys into and from other tables, and so on. I can represent "these are all the customers" in DDL, I can represent "every sale must be associated to a customer", and so on. But after the first join, I'm no longer using the Customers table, I'm using a new in-memory relation with no primary or foreign keys, and the concept that "this is the customers table, but with extra fields" is something I need to keep track of in my head, instead of in the language.

civilized · on Oct 21, 2023

In some sense, what you're asking for is straightforward with the tools we already have. It suffices to set a convention that each table always has a column named `id` which is the primary key for the table.

The harder part is how to enforce that within an organization. It sounds like you'd like technology to enforce it.

Existing tools already do the math part: you can set a constraint on a table so that the database maintains the primary key property and throws an error if a transaction would change the table in a way that violates the property.

What you're left to do is get everybody on board with the "every table has a primary key column named id" plan. Some syntactic sugar like GROUP INTO might help with that.

dragonwriter · on Oct 22, 2023

Looking at this again, I think the actual complaint isn't so much about base tables (though those were used in the illustration) but intermediate derived relations created in deeply nested queries (or even regular views), where even though their may in effect be primary/unique keys, they aren’t declared and recognizing them depends on tacit knowledge (and because the functional dependencies aren't recognized by the DB engine, they can’t be leveraged in GROUP BY to omit redundant non-key columns so a GROUP BY needs to specify all the non-aggregate columns with the domain understanding being opaque.

A primary key convention for base tables doesn’t help with this; I also don't think the propsed GROUP INTO solves it, though it requires it to be solved first to work (i.e., unless you are only using it to GROUP INTO base tables rather than intermediate tables formed by arbitrary joins, it requires first having the engine infer, or provide a way of declaring and having the engine validate, keys for those tables.)

civilized · on Oct 22, 2023

Honestly, I'm not sure what all this means. Maybe an example would help?

It sounds like there's an interest in the database inferring something subtle and making some kind of automated decisions based on that. Business stakeholders often make this kind of request - "can't an AI just figure all this out?" kind of thing. It often doesn't go anywhere because it's too far removed from the level of detail needed for a machine to automatically solve a problem.

timacles · on Oct 21, 2023

First of all, all code requires domain knowledge to understand. Some domains are just simple. Business domains never are. Even with your upper case example, if you don’t know what upper case letters are, you are in the same position.

Second, data can be organized in infinite permutations and sql has to accommodate that. People have been complaining about sql since the Dawn of time, but all proposed solutions only fix a subset of problems.

matwood · on Oct 21, 2023

MySQL aside, your first query will end up an error if Customer.Name is not functionally dependent on Customer.B. This highlights that there could be different customers with the same name and that it was a poor query to start.

For number 2, that's exactly what inner join means otherwise use a left join. The first question someone should ask themselves is if they want all customers or only ones that have had sales.

mattxxx · on Oct 21, 2023

Agreed - I think SQL is great, but the JOIN operator has been problematic for me with analytic queries and queries where you want to enrich data across different tables.

I think the problem is fundamentally that if you take a record, and split it across 2 tables, then there's no easy operation to merge them and guarantee distinctness.

Separately, even though there are some %-breakdowns in the article, I don't think the `DISTINCT` sort is so much of a problem in most cases. Usually the actually joining is the problem for any truly large dataset, so while I trust the diagrams shown, I think those %'s change dramatically based on the underlying data.

marcosdumay · on Oct 21, 2023

> I don't think the `DISTINCT` sort is so much of a problem in most cases

I train people new to SQL at work, and I can tell you with complete certainty that you only get this impression because people learn very quickly to never use `distinct`.

It's by far the most common cause of performance problems for people new to SQL.

sgarland · on Oct 22, 2023

Imagine you’re a SaaS company with millions of customers. Each of those customers will have tens or hundreds of recorded payments (orders).

Even if you scale down on customers by two orders of magnitude, the intermediate table starts growing rapidly.

nice-try-fletch · on Oct 21, 2023

But the example is exactly the type of query where I often see DISTINCT used where it shouldn't: they just want to know the customer with at least one order or the author who wrote at least one book or the user who upvoted at least one comment. I think that was the point of the post (not to never use DISTINCT or that this is the only example where anyone ever needs to join). DISTINCT isn't likely to be used in the more elaborate example you mention (total products from some category) because DISTINCT doesn't fix a problem that the necessary GROUP BY to get the answer has already prevented appearing anyway.

sgarland · on Oct 22, 2023

If you don’t need the result from the joined table, a semijoin is often the fastest way. In the given example, if there was more data in the tables, it’s likely the planner would choose that. Not as familiar with MSSQL as MySQL and Postgres so I’m not positive.

Also, the good news is the planners are increasingly better at converting your desired end state into queries utilizing features like this, without you explicitly telling them to.

freddotclem · on Oct 24, 2023

> A notion as simple as "I want to return a list of customers with additional properties" does not map to any dedicated SQL construct.

a temp table with unique/primary key on custumer really helps if uniqueness is a requirement.

arichard123 · on Oct 21, 2023

I think you want a lateral join.

nice-try-fletch · on Oct 21, 2023

Well, SQL Server (seemingly the author's area of expertise) doesn't have a lateral join. You can simulate this to some extent using CROSS APPLY, but that wouldn't really be any better than EXISTS in the case like the example, where you don't actually need any data from the secondary table.

tuukkah · on Oct 21, 2023

Next question: Why doesn't the query optimizer make this change automatically? The version with DISTINCT is much nicer to write and read.

Diggsey · on Oct 21, 2023

Nobody actually answered your question helpfully, so here goes:

When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.

The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.

As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.

At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...

Sesse__ · on Oct 21, 2023

> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.

This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).

> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.

There are absolutely query optimizers that know this (try such a query in MySQL, for instance).

marcosdumay · on Oct 21, 2023

Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.

And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.

I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.

Diggsey · on Oct 21, 2023

You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.

> I know that I would be deeply surprised if I found a database doing it, and not in a good way.

SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.

gigatexal · on Oct 21, 2023

Query optimizers are already black boxes of magic I guess one has to stop somewhere.

A top level distinct to a query like this is a huge red flag. Even in a sub select or a join clause id rather have a group by than a distinct.

Sesse__ · on Oct 21, 2023

There's no good reason why you couldn't rewrite DISTINCT to GROUP BY, but remember that it wouldn't work in all cases; you could have both. E.g. SELECT DISTINCT COUNT(*) FROM t1 GROUP BY a;. Similarly, HAVING and window functions happen after GROUP BY but before DISTINCT, so a simple move won't always be correct and you'd have to account for that.

_a_a_a_ · on Oct 21, 2023

> There's no good reason why you couldn't rewrite DISTINCT to GROUP BY

Exactly. People post a lot of things about SQL optimisation without knowing anything about it. Frankly, I'm getting really fed up with it on HN.

slotrans · on Oct 21, 2023

The optimizer doesn't know your semantic intent.

tuukkah · on Oct 21, 2023

Does it need it to do its job?

phanimahesh · on Oct 21, 2023

Yes. As pointed out above the intent is not always obvious in complex queries, and the optimiser is already a large black box.

twoodfin · on Oct 21, 2023

Actually, I’d think using “ANY” would be the clearest translation of the semantic intent.

If your query processor is even marginally competent, ANY & EXISTS are indeed optimized exactly the same way.

A slightly smarter QP will perform as well for the equivalent IN condition, too.

sgarland · on Oct 22, 2023

Tangentially related, how IN and ANY can produce wildly different results [0] due to type casting. I have personally seen this, and at least for my data, the fastest query / fewest buffers scanned was to shift the matching ids out to a CTE with VALUES, and join with IN. It wasn’t a minor speed up, either, it was literally twice as fast.

[0]: https://dba.stackexchange.com/a/125500

cwillu · on Oct 21, 2023

I'm pretty sure postgres will in fact do this in at least some cases.

chaps · on Oct 21, 2023

Yeah, using `SELECT 1` language is a big nah from me. I want my queries to be readable by both new and experienced.

slotrans · on Oct 21, 2023

where exists (select 1...) is completely standard and normal.

We should write our code to be readable, but it is reasonable to expect that the reader knows the language.

chaps · on Oct 21, 2023

Has nothing to do with the language. 1 (in its isolated presentation) exists outside of my set of data and that's not something that gives me comfortable feelings. It forces me to think (less think more good) why some random value needs to float around my data and maybe needs to be appended or summed or whatever.. because reasons. As far as I'm concerned you might as well make it 200 or 1648194. And why would those ever be needed in any context around my data?

naikrovek · on Oct 21, 2023

1 isn't data, here. it's just a flag to return that row.

I am sure you know this, so I'm wondering why you consider it data which is outside of your data set. it isn't data when used like this.

chaps · on Oct 21, 2023

In the sense that code is data: yes, it absolutely is data.

naikrovek · on Oct 22, 2023

don't be obtuse. it isn't in the dataset and it isn't returned in the results.

> code is data, so it's data

dude, don't. this is the weakest and most grasping argument I think I've ever heard.

chaps · on Oct 22, 2023

No need to be a dismissive dickhole about it, friend.

Maybe I'm not wording it right then? Like I said in other posts, I copy paste SQL statements all the time. If I were to copy/paste that statement then all I'd get is a bunch of ones and that's useless to me. The SQL itself is data to me in the same way that when I view a lazy list comprehension, I view it (and SQL) as something that's just waiting to be run. Maybe not now, and maybe not in the current SQL, but there's a non-zero chance that I'll copy/paste it. So in that context, a bunch of ones is useless to me and IS data because the ones are literally the output of the SQL statement. Better to generalize my code writing process in a way so that I can copy/paste a "select *" or "select rownum" because those are more useful down the line.

Really think we all just code differently.

naikrovek · on Oct 23, 2023

> No need to be a dismissive dickhole about it, friend.

name-calling. it's a weak, last resort move.

you don't get a bunch of 1s from that query. And, I think I know why you're copying and pasting queries all the time instead of writing them.

chaps · on Oct 23, 2023

"you don't get a bunch of 1s from that query."

You've seriously missed the point if this is what you're still saying. Take a step back, breathe, and try to consider that you've missed something. Whether that's a point I've made, or a lack of perspective, I don't know. For example, we almost definitely work in different fields with different practices and reasons for doing things differently. And that's fine. But your continued dismissiveness isn't helpful. Like another post said, "select *" is helpful in data analytics work. If it's not helpful in your field, that's also fine. But for me and my colleagues, it is. And I promise you're wrong in your thinking of why I copy paste SQL. What a weird fucking conversation.

Also, calling someone obtuse is name calling, so pot kettle black and all that, ya obtuse weirdo.

Good luck, friend.

hobs · on Oct 21, 2023

Yes, and all those constants would be treated the same, just like in many languages things are conventions - you don't need to use _ to name a variable that you throw away, but you do, because conventions.

There's so many people that want to complain about the conventions of SQL but praise the conventions of other languages and I think its often just "which did you see first?"

chaps · on Oct 21, 2023

You're missing the point that I don't really care about how they're treated. I want to glance at it and know what it does. For me, the 1 gets in the way. Maybe it doesn't get in the way for you because you don't use it the same way as I do, as a practice.

rrrrrrrrrrrryan · on Oct 21, 2023

> Has nothing to do with the language.

...but... this is literally how you write an EXISTS clause in SQL.

> you might as well make it 200 or 1648194

Yes, you can put anything there.

There's lots of funky inconsistent syntax in SQL because it evolved organically over literally the past half century. It has everything to do with the language.

chaps · on Oct 21, 2023

  Yes, you can put anything there.

And yes, you missed the point.

kbenson · on Oct 21, 2023

Would it make you feel better about it to think of it as (or write it as) "select true" ? Because that's essentially what it's communicating, that there is something to return and that's what we're checking for.

chaps · on Oct 21, 2023

Nope, for the same exact reason. Select * makes the most sense because * is in the context of my data, not something I'm working into it. Pretty sure it's the same speed.

Also mind you, I use a lot of CTEs, so this would look weird in that context -- hence why using row number sometimes makes more sense and achieves the same thing.

hobs · on Oct 21, 2023

Filter out everything, Project only what you need, Transform it as lightly as possible.

In any context I understand a row number would never "make sense" if a constant of 1 would be the same output, it would be a lot more code that does... nothing?

Any code using select * just breaks in the future with any new columns being added, no thanks.

chaps · on Oct 21, 2023

  Any code using select * just breaks in the future with any new columns being added, no thanks.

For you, maybe. In my workflows this is really a non-issue for me.

Maybe consider that we use SQL differently and your goals and challenges are different from mine.

(Edit: What's with the downvotes from people just disagreeing about preferences? So weird.)

rrrrrrrrrrrryan · on Oct 21, 2023

I've spent most of my career working with teams that mostly write SQL code, at multiple companies.

Except for very rare fringe cases, using "SELECT *" in production code is universally considered bad practice.

OoooooooO · on Oct 21, 2023

Select * is pretty standard for wide table queries like they are used in Data Analytics teams regularly. Because you really want ALL columns, even new ones without going back to fix all 300+ dashboards.

chaps · on Oct 21, 2023

Yep. I think the others commenting here aren't using SQL for analytics and don't recognize the importance of select * in that context.

__jem · on Oct 21, 2023

you're getting downvoted because it's not really a preference, it's pretty widely known to be bad practice and unhygienic in production queries. select * would get your pr rejected and chewed out by dba at every place i've ever worked at. so you kinda just look like you don't know what you're talking about.

croes · on Oct 21, 2023

In an exists query select * is harmless, select 1 and select * result in the same execution plan at least in MS SQL.

In a query than returns result rows it could break the query as soon as you add columns with names that already exist in other tables you joined in the query.

__jem · on Oct 21, 2023

yes, it's harmless in this position but it provides no additional benefits to the select 1 idiom and is suggestive of poor query discipline. it's far easier to say just don't ever use select * in queries.

RHSeeger · on Oct 21, 2023

I would add to this a bit in that

1. Given that "select " is considered something to avoid except when necessary in edge cases

2. And "select 1" will accomplish the same goal

Anyone reading the "select " version of the code will have to stop and consider whether it is using "select " for a reason, because "select 1" would be the normal choice. Using "select " is assumed to be conveying some intent (that isn't there) _because_ it's not the expected way to do it.

I kind of see it like

    if (thisField == thatField) ...

vs

    if ( (( true || false )) && ( 11 == 11 ) && thisField == thatField ) ...

Sure, they do the same thing... but you have to stop and look at the second one to make sure you're understanding what it does and if there's some reason its weird.

n4r9 · on Oct 21, 2023

I haven't downvoted anyone, but have followed this argument with interest as an intermediate SQL user.

If I was to guess why someone would downvote you, it wouldn't be for disagreeing with you, but more because you've subtly shifted from quite a strong objective stance ("this is not readable") to a subjective one ("this is not how I prefer to write it"), without really conceding anyone else's points.

chaps · on Oct 21, 2023

1 think my point makes more sense when you consider that I 1. Don't work with production code (more analysis, ad hoc code in an investigatory capacity) and 2. that when 1 mention someone is "new" what 1 mean is someone actively learning and not from a technical background. 1ME, folk like that have a difficult time with that floating 1. So while yes it's a standard that programmers are familiar with, it's not something that someone new will be very comfortable with. Lots of people 1 work with come from a pandas-only background.

Not really conceding because as far as 1 can see, everybody is coming from a position of familiarity.

n4r9 · on Oct 23, 2023

That's totally fair. Perhaps the confusion could have been avoided by qualifying in your initial comment that you're referring to a specific situation i.e. not-too-technical analysts writing ad hoc code.

hobs · on Oct 21, 2023

I didn't downvote you, but consider this - I work with SQL a lot, like a lot a lot. Something that's your code today is probably my code tomorrow.

So when you say "my flow is X" and your flow is inimical to maintaining it and extending it, people might get a bit irritated at the last dev that did the exact same thing.

croes · on Oct 21, 2023

How would Select * break with a new column?

hobs · on Oct 21, 2023

Any situation where a new or elsewise unknown attribute breaks your code, binding is a big one.

* Say you are joining two tables and one now has a conflicting/duplicate name, surprise, you now have broken code, it literally will not execute with ambiguous duplicate references.

* By the same token, downstream views can break for the same/similar reason.

* In some engines views are not going to actually include your columns until they are "refreshed"(SQL Server) so that one day that's out of band of your deployment your views will suddenly change.

* Say you have a report with specific headers - tada, it's now got whatever people add to the table - and sourcing it can be a pain because its unclear where the source is in the query, requiring schema.

* Performance expectations can change if the data type is much larger, up to the point of actually breaking the client or consuming tens of billions of times more resources.

cwbriscoe · on Oct 21, 2023

If you have a table with two columns and you do a 'select *', adding a column to the table can break code that is only expecting 2 columns.

croes · on Oct 21, 2023

Usually code refers to columns by name and additional columns are just ignored, the only case I know is when you add a column to a joined table and the column name already exists in the other table resulting in ambiguous column names.

In an exists clause the * is harmless

cwbriscoe · on Oct 23, 2023

I am talking about 'select ' at the top level, they can be harmless in exists, sub-selects and CTE's. The number of columns sent to the client (the program) will change when you add a column to the database. If you don't remember to change all of the places in your code where you used a 'select ', you program is likely going to fail or have unexpected results.

moritzwarhier · on Oct 21, 2023

Just wanted to write, why not select count(*) ? But I guess that's what you meant by row numbers.

Select 1 communicates that no columns need to be selected, so it forces inexperienced readers such as myself to understand why that is the case.

So imho, it carries more information than selecting some arbitrary columns, or counting the number of rows (for which I only care it's > 0)

mjan22640 · on Oct 21, 2023

The relevant thing for exists is the set of the rows returned by the select being non/empty. The value in the columns is irrelevant. Using * creates an incorrect impression that the value is relevant and the relevance tied to the data.

DonHopkins · on Oct 21, 2023

Sounds like what you want is "bang" from Max, which is kind of like the opposite of null (the billion dollar mistake) but more existential, or dang (the hn moderator) but more explosive:

User:

Explain Max/MSP's use of bang.

ChatGPT:

Max/MSP, often simply referred to as "Max", is a visual programming language for music and multimedia. In Max, the fundamental building blocks are "objects" that are connected with "patch cords". These objects can process, generate, or manipulate various types of data, including audio, MIDI, and other control messages.

One of the most basic and essential objects in Max is the "bang". Visually, a bang is represented by a small button in the Max patcher window. Conceptually, a bang is an event trigger. When a bang is received by an object, it tells that object to do something, often to execute its primary function.

Here's a breakdown of how the bang is used in Max:

Triggering Actions: Many objects in Max perform an action when they receive a bang. For example, the metro object (a metronome) sends out bangs at regular intervals. When connected to other objects, these bangs can trigger various actions in a patch.

Flow Control: Bangs can be used to control the order of operations in a patch. The trigger object, often abbreviated as t, is particularly useful for this. It can output a sequence of bangs (and other data types) in a specific order.

User Interaction: The bang object can also be used as an interactive button in a user interface. When you click on it, it sends out a bang.

Feedback: Bangs can be used to provide visual feedback. For instance, when a certain condition is met in your patch, you might use a bang to flash, indicating that the event has occurred.

Conversion: Sometimes, you might receive data, but you only want to know when the data has arrived and not necessarily what the data is. In such cases, you can convert any incoming data into a bang. For example, if you connect a MIDI note input to a bang, the bang will flash every time a note is played, regardless of the note's pitch or velocity.

In essence, the bang is a fundamental concept in Max/MSP, representing an event or a moment in time. It's a way to initiate actions, control the flow of operations, interact with the user, and more.

User:

How about a special value "dang" for propogating mistakes?

The concept of a special value like "dang" to propagate mistakes is an interesting and humorous idea. Let's explore this hypothetical scenario:

Purpose: "Dang" could be used in a system to indicate that a mistake or an error has occurred and has been propagated. Instead of failing silently or producing unpredictable results, the system would clearly mark the affected data or process with "dang".

Usage:

In a programming context, "dang" could be a special return value indicating that a function has encountered an error that wasn't caught but isn't catastrophic enough to halt execution.

In a database, "dang" could indicate corrupted or suspicious data that resulted from a known error but hasn't been cleaned or addressed.

Advantages:

Transparency: Instead of silent failures or cryptic error messages, "dang" provides a clear and human-readable indication of a problem.

Debugging: It could make debugging easier, as developers can quickly identify and trace back the source of "dang" values.

User Experience: For end-users, seeing a "dang" value could be a more friendly or humorous way of being informed that something went wrong, as opposed to a generic error message.

Challenges:

Overuse: If used too liberally, "dang" could become a catch-all for various issues, making it harder to pinpoint specific problems.

Misunderstanding: Without proper documentation or understanding, developers or users might be confused about what "dang" actually represents.

Standardization: For "dang" to be widely adopted, it would need to be standardized across platforms and tools, which is a significant challenge.

In essence, while "dang" as a concept introduces a fun and novel way to handle and propagate errors, its practical implementation would require careful consideration to ensure it adds value to the development and user experience.

mvdtnz · on Oct 21, 2023

Strange hill to die on - the SELECT 1 query is not particularly hard to understand, and it's demonstrably better.

chaps · on Oct 21, 2023

Eh, anytime I need to read SQL that includes that I have to sit down and think about it. It's not glance-and-understand and you have to maintain that state in your head. Not the sort of SQL I want to rely on as a practice.

Mind you, most of what I write are ad hoc SQL queries where I might write 100-200 queries in a single day. So that sort of thing really adds up. And it's in a strange place where performance isn't dire, since I can work on other stuff while queries chug.

wodenokoto · on Oct 21, 2023

Help a fellow out here. What’s the purpose?

noisem4ker · on Oct 21, 2023

With most databases (PostgreSQL is an exception), SELECT must select something, even if you only care about the record existing in general, and you're not interested in reading anything from it. SELECT 1 means "select the first column", whatever it is. Specifying a column number also works with ORDER BY and other clauses. Of course any natural number works, as long as the record has at least that many fields.

wruza · on Oct 21, 2023

SELECT 1 means "select the first column", whatever it is

No, it selects a literal integer 1 in all SQL engines.

You can, and do, select any expressions, not only column names.

  CategoryId,
  CategoryId + 1,
  1,
  sqrt(CategoryId * ProductId),

Column names alone are expressions too.

SQL queries return not columns from table, but a result of relational operations with new columns specified by expressions. What a downvoted user itt cannot accept is that 1 is as meaningful as * or <colname> in general and is an idiom for “columns don’t matter here, only [the number of] rows”. It’s like [[1]].length vs [[id, …]].length. Saying “it’s not my data” is misunderstanding what that data is.

magicalhippo · on Oct 21, 2023

> You can, and do, select any expressions, not only column name

I find this very useful when I need to insert some fixed data alongside some data from the database. I'll then use

    insert into TableX(Key, TotalWeight, Name, Category) 
    select Key, sum(Weight), 'foo', 42 from TableY where Key = 1234 group by Key

or something like that. Usually the source of the fixed data is in a spreadsheet, so I just use Excel to generate the SQL statements.

nice-try-fletch · on Oct 21, 2023

_SELECT 1 means "select the first column"_

In SQL Server at least, no, it literally means select the integer 1. In the ORDER BY clause, it does mean to order by ordinal position, but that's not a great thing to glorify, since ordinal position is not necessarily stable. I think other dialects like MySQL might allow GROUP BY 1, but that's not a great thing to glorify either.

pcthrowaway · on Oct 21, 2023

I'm pretty sure you still have to select something in postgres... or are you saying there's a different operator to do the same thing?

> SELECT 1 means "select the first column"

interesting.. in postgres this actually gives you a single-column value of 1, in an unnamed column... I haven't used other DBMSes much but expected similar behaviour

Quekid5 · on Oct 21, 2023

It's kind of a workaround for the fact that SQL doesn't do 0-tuples, aka. "unit".

paulddraper · on Oct 21, 2023

You can write `SELECT NULL` or `SELECT *` if you prefer.

PostgreSQL allows a zero-column `SELECT`.

inopinatus · on Oct 21, 2023

It’s literally the documented idiom for producing singular output from amplified rows at https://www.postgresql.org/docs/current/functions-subquery.h....

gigatexal · on Oct 21, 2023

This isn’t readable? It’s a common pattern.

funnymony · on Oct 21, 2023

Its a matter of your eyes being used to this.

If for ten years you always indented the code this way

    Void F()
    {    Foo();
         Bar();
         Baz();}

Then following snippet will seem hard to parse mentally:

    Void F()
    {
         Foo();
         Bar();
         Baz();
    }

And vice versa

gigatexal · on Oct 21, 2023

You mean the where exists/not exists concept? Or the select 1 usage in that correlated subquery?

funnymony · on Oct 21, 2023

wvenable · on Oct 21, 2023

I assume you could just use `SELECT *` there and the result is the same.

TurkTurkleton · on Oct 21, 2023

Indeed, I would be surprised if there exists any RDBMS where the content of the select list in an EXISTS subquery matters. Postgres's SQL dialect even lets you use an empty select list (`... WHERE EXISTS (SELECT FROM related_table WHERE related_id = id_from_outer_query)`). In T-SQL, however, a non-empty select list is required, and in my experience, developers writing T-SQL tend to prefer a constant value like `1` or `NULL` over `*` -- I suspect there's some superstition there related to both the common wisdom that `SELECT *` queries are to be avoided (which is true -- you really should only ever select the columns of interest) and a lack of truly understanding that EXISTS only reads enough of the table to ascertain whether rows exist that satisfy the subquery's predicate, and returns a boolean value, not a result set.

gpm · on Oct 21, 2023

As a non-sql-expert (I've used an exists query approximately once in my life) I prefer `select 1` over `select column_name` or `select *` because it doesn't look like it might be meaningful.

Postgres's dialect seems like it made the right choice here.

jrumbut · on Oct 21, 2023

Selecting NULL to signify something exists breaks my brain a little bit. I really prefer 1 stylistically, even if there is no technical difference.

justsomehnguy · on Oct 21, 2023

It works fine for me.. when I'm thinking in sqlese. The bigger travesty here is using SELECT for the tasks which are not selecting (returning data) anything.

mjan22640 · on Oct 21, 2023

Turn the brain 90° from columns to rows and it will be fine.

chaps · on Oct 21, 2023

Yeah. 1 breaks my brain for the same reason probably.

marcosdumay · on Oct 21, 2023

Both Oracle and SQL Server have the advice to use a constant value on their documentation. Postgres used to advise the use of `*`, but looks like they improved their optimizer so it doesn't matter anymore.

It's not superstition. It's people that know deeply how a complex system works picking the option with the best set of side-effects.

nice-try-fletch · on Oct 21, 2023

You could. But `SELECT 1` is a very common pattern in SQL Server, intended exactly to declare intent to the reader that "this subquery doesn't return any real data" - in the event the `EXISTS` that surrounds it isn't enough of a clue. You could use `SELECT NULL` or `SELECT 0` or `SELECT *` or `SELECT 12/0` and the behavior would be identical, but most of those would likely raise even more questions to an inexperienced reader than `SELECT 1`.

Sesse__ · on Oct 21, 2023

If you don't like EXISTS, the other standard way to get a semijoin is using IN with a subquery. SELECT * FROM a WHERE x IN (SELECT y FROM b);

magicalhippo · on Oct 21, 2023

Just to be sure to check the resulting plans. The DB we use usually has sub-optimal plans when using IN compared to EXISTS, often quite drastically so.

Sesse__ · on Oct 21, 2023

You always need to check the resulting plans if you care about execution performance, that's just how life is.

I'm offhand a bit surprised IN does worse than EXISTS; I can understand NOT IN being slow, because it has very surprising NULL handling that is hard to optimize for.

meinaccount00 · on Oct 21, 2023

I've been working with Spark lately and just beeing able to write "semi join" (or "anti join") is quite nice and readable

Sesse__ · on Oct 21, 2023

Explicit semi- and antijoin operators would be a nice addition to the language, but do note that they present some additional challenges for the optimizer. In particular, when the join condition becomes explicit, it can take on more forms than before, which can limit the reorderings allowed or allow new ones. E.g. you now need to deal with this structure:

  (a SEMIJOIN b ON a.x=b.y) JOIN c ON b.z=c.z

This was an impossible structure before, since IN and EXISTS both hide b's columns (all semi- and antijoins effectively come last), and your optimizer will now need to know whether e.g. this associative rewrite is allowed or not:

  a SEMIJOIN (b JOIN c ON b.z=c.z) ON a.x=b.y

Also, you'll need to deal with LATERAL semijoins, which you didn't before…

None of this is impossible, but there's more to it than just a small syntactic change.

nsenifty · on Oct 21, 2023

Yeah just change it to `SELECT OrderID`.

chaps · on Oct 21, 2023

Yep, agreed.

thaumasiotes · on Oct 21, 2023

I don't get it. "SELECT 1" is significantly more informative than "SELECT red_herring". It's a self-documenting and automatically enforced comment telling you that the information requested by the SELECT statement is not used by the query.

chaps · on Oct 21, 2023

I copy paste SQL all the time, especially with CTE after CTE all the time. The row number is a simple thing to copy paste that fills multiple functions with zero modification.

thaumasiotes · on Oct 21, 2023

And that's different from 1 in what way other than being more confusing?

Note that in the case you're advocating for, the row number is called "OrderId", which you might have trouble with if you insist on copying a query from somewhere else and using it without modification. Wouldn't you prefer "1"?

chaps · on Oct 21, 2023

It's helpful within CTEs as a unique identifier. This whole conversation makes me wonder if not many people minmax this sort of meta programming thing.

croes · on Oct 21, 2023

Then write Select *

continuational · on Oct 21, 2023

Quite often what you want isn't an arbitrary row (for which exists is correct), but a specific one.

In that case, PostgreSQL has a feature where you can do this:

    select distinct on (customer_id) customer_id, order_id
    from orders
    order by customer_id, created_date desc

And get all customer IDs with their latest order ID.

nightfly · on Oct 21, 2023

Is that different/better/worse than?

  select customer_id, max(order_id)
  from orders
  group by customer_id
  order by customer_id

hackernewds · on Oct 21, 2023

the max order_id is not always the latest id

although it would seem to be so by design, however design is only as robust as the thousands of individuals building on the system :)

dspillett · on Oct 21, 2023

> although it would seem to be so by design, however design is only as robust as

It only seems so if you assume auto-increment is used to populate the order_id and that it always increases with time. That latter assumption is quite unsafe:

* Systems could have been merged with a bulk import of old orders into this one from elsewhere (assuming order_id is a surrogate key and there is a separate order code or such that is used to identify the orders externally).

* In fact, a simple insert of several records in the same statement will not necessarily get auto-increment values in the order you expect (in practise they usually do - but the DB engines do not guarantee this, it is an accident of other factors in their design rather than a defined behaviour).

* Because of optimisations for concurrency in the way auto-increment is handled, it is possible that long-running transactions could cause ordering discrepancies. In theory at least, in practise unless you've explicitly opted out of ACID-preserving locking semantics for those transactions I suspect these protections will stop this un-ordering happening by blocking the concurrency. This sort of issue is why you occasionally see unexpected gaps in auto-increment values.

* I have seen an example where an incrementing signed-int ID was getting too close to MAXINT for comfort, and as a temporary measure ahead of changing that ID to be a longer type the increment was reset to restart at MININT and head back towards 0 from there! This was with a 16-bit integer (I'm old enough to have been around when it was common to use them to save space, where we generally default to 32-bit these days) but the same could happen to larger types.

szundi · on Oct 21, 2023

If you have an auto-increment and DO NOT have some logic around draft orders, max could be the one. Either way, you can max on the submission date then

eterm · on Oct 21, 2023

Max on submission date doesn't return the corresponding order ID from that row, aggregations are applied separately.

That's a classic pitfall that comes up all the time.

withinboredom · on Oct 21, 2023

Sounds like a job for a windowing function.

williamdclt · on Oct 21, 2023

Do you have an example?

billwashere · on Oct 21, 2023

select distinct customer_id, max(order_id) over (partition by customer_id order by created_date desc) FROM orders

http://sqlfiddle.com/#!15/51df39/2

dent9876543 · on Oct 21, 2023

Hmm… my hunch is that this doesn’t do what you think it does.

I expect the order by in the window function is effectively lost because max operated over the whole window. (And you happen to get the most recent, because in many implementations, order_id will be a sequence.)

But I might be wrong. And I might only now be learning that order by with max() and over substitutes how the “value” of the order_id is understood.

withinboredom · on Oct 21, 2023

You aren't wrong.

http://sqlfiddle.com/#!15/7eb3a/7

Here's a pretty simple/normalish way to handle the edge cases. This one (without distinct) is far more consistent (wall-clock-wise, doesn't depend on caches): http://sqlfiddle.com/#!15/7eb3a/9

Note that order 2 is after order 4 in the example schema.

withinboredom · on Oct 21, 2023

If you just need customer id and order id (and nothing from the original orders table), you can simplify it further http://sqlfiddle.com/#!15/7eb3a/10

billwashere · on Oct 21, 2023

Oops, you are right

williamdclt · on Oct 21, 2023

Im very ignorant of partition by, but it doesn’t look like it works? http://sqlfiddle.com/#!15/696cb/1

withinboredom · on Oct 21, 2023

It doesn't. http://sqlfiddle.com/#!15/7eb3a/7 is a proper implementation using windowing functions to get the first something.

asielen · on Oct 21, 2023

A nice feature of snowflake is the min_by/max_by functions.

https://docs.snowflake.com/en/sql-reference/functions/min_by

jletienne · on Oct 21, 2023

whenever I see [select distinct ... group by] , i know i'm in for some bs. the original code creator had some duplication/cartesians but never properly addressed it

thiht · on Oct 21, 2023

I’ve done it a few times in the app I’m working with and had no idea DISTINCT was not the proper way to address it. I mean, it’s a standard feature, named « distinct », easy to use and it works. It’s on me that I didn’t run an explain on these queries, but it seems like an easy mistake to make.

Anyway I’ll happily fix these in my code on Monday.

morkalork · on Oct 21, 2023

Seeing distinct being abused in application code is pretty rare though. What gets me is seeing it in some ETL or data analyst's queries. That's the sign that you're about to spend the next hour unfucking a query with like 50 joins and no CTEs for organization, written by someone who has no understanding of the underlying schema.

tomhallett · on Oct 21, 2023

Any recommended resources (books, courses, etc) which go deeper on sql topics like this? Comparing and contrasting different ways of writing the same query, with respect to performance.

The one book I’ve found and really enjoyed was “SQL Performance Explained”.

texuf · on Oct 21, 2023

So loop over the customers and see if they have an order instead of looping over all the orders and de-duplicating the customers?

sublinear · on Oct 21, 2023

I know it's just an example, but why aren't these tables normalized better anyway? None of them should have columns for the PK of another table.

The relationships between them should be represented by bridge tables. Really there's only one relationship here which is product orders.

Why have the expense of a join in your select queries when you can just associate them properly during the insert in the first place?

Then you can simply select limit 1 from the bridge table and only have one scan on the customer table for the whole query.

thaumasiotes · on Oct 21, 2023

> why aren't these tables normalized better anyway? None of them should have columns for the PK of another table.

> The relationships between them should be represented by bridge tables. Really there's only one relationship here which is product orders.

What?

There appears to be a mistake in the Products table, which only allows a product to belong to one category.

But the many-to-many relationship between orders and products is already represented in a bridge table, OrderDetails, and the one-to-many relationship between customers and orders is correct. Why would you have a bridge table between customers and orders instead of recording the same information in a column of the Orders table?

And... how is your bridge table going to operate without having columns for the PKs of multiple other tables? That's what OrderDetails is doing. What would you change?

nice-try-fletch · on Oct 21, 2023

I don't think the one category is a mistake, merely a simplification. Why introduce another table that complicates the example and adds nothing? The article doesn't seem to be about "here's the absolute perfect design for orders and products" and dwelling on those details, well, doesn't really add anything either. :-)

sublinear · on Oct 21, 2023

> the one-to-many relationship between customers and orders is correct. Why would you have a bridge table between customers and orders instead of recording the same information in a column of the Orders table?

Because as you said the relationship is one customer to many orders.

> the many-to-many relationship between orders and products is already represented in a bridge table, OrderDetails

The relationship between Orders and Products is one-to-many, not many-to-many. Also, OrderDetails is not a bridge table because it has the quantity of Products. Bridge tables should only have the IDs and it's not going to have one row per order nor per customer. It's going to have n rows where n is the number of categories a product in an order belongs to multiplied by the number of products in an order multiplied the number of orders made by a customer multiplied by the number of customers who have made orders. It's a lot of rows, but compresses very well and makes the queries simpler by flattening all these relationships into one. The only relationship that matters for this query is customers and product category, but you can query for any relationship without joins in this bridge table I'm proposing.

> That's what OrderDetails is doing. What would you change?

OrderDetails is messed up in other ways too. What is "LineItemID"? LineItem should have the quantity and price? That table isn't even defined in the example. OrderDetails is a bad table that needs to go. Your interpretation of what I'm saying is correct. I would add an actual bridge table (not OrderDetails) that relates the customers to the orders.

You're correct that OrderDetails, as written, does need to relate an order to all its products, but my point was that you don't need to join to query for all the customers who have made at least one order of a given category. You just need to select where the category ID matches.

thaumasiotes · on Oct 21, 2023

I'd kind of like to respond, but nothing you just said makes any sense.

> OrderDetails is messed up in other ways too. What is "LineItemID"? LineItem should have the quantity and price? That table isn't even defined in the example.

You could have answered these questions by just reading the example schema. LineItemId is part of the two-column primary key defined for the OrderDetails table. It's not a reference to anything; presumably its purpose is to define the order in which line items appear in an Order.

> OrderDetails is a bad table that needs to go.

I hope you don't do a lot of code review. Or much coding, for that matter.

sublinear · on Oct 21, 2023

> I hope you don't do a lot of code review. Or much coding, for that matter.

With responses like yours, I feel the same back at you. I always find it hilarious how passionate people can get about SQL just because it's hard to discuss without a full working example.

You should re-read and try out what I said. The schema in the example is contrived to create the very issue they're whining about, but bad schemas are at the root of all bad queries.

thaumasiotes · on Oct 22, 2023

>>> The relationship between Orders and Products is one-to-many, not many-to-many.

Let's deal with this one first. Suppose customer 12 orders 10 pounds of gravel and a potted cactus while customer 15 orders 30 pounds of gravel and a cherry sapling.

I claim that this illustrates that it's possible for one order to be associated with multiple products (customer 12 placed an order for two different things), at the same time that it's also possible for one product to be associated with multiple orders (customers 12 and 15 each ordered gravel).

But you appear to know better than me. Please explain to me why it wasn't possible for the same product to appear in several orders.

> I always find it hilarious how passionate people can get about SQL just because it's hard to discuss without a full working example.

I actually don't have much in the way of opinions on SQL. But I do have strong opinions about self-righteous stupidity. You have repeatedly made it plain that you don't understand what the words you're using mean. Take this claim of yours:

> why aren't these tables normalized better anyway? None of them should have columns for the PK of another table.

The tables are already fully normalized. It would be possible to render them differently, so that (for example) the Orders table didn't have a column for CustomerID and instead a separate table existed with records of the form (OrderID, CustomerID) -- but this would not affect the normalization of the tables. All this would do is cause you to perform extra queries whenever you had one piece of information and you wanted the related piece of information.

Or this one:

> Also, OrderDetails is not a bridge table because it has the quantity of Products.

Annotating the relationship it embodies won't make OrderDetails not a bridge table. It's a bridge table because it records the many-to-many relationship between Orders and Products. Here ( https://www.youtube.com/watch?v=uKd01Ouw_b8 ) is a university lecture in which one of these tables is explicitly identified as a "bridge table" at the same time that it records such supplemental information. (In that case, what's happening is that the "AssignmentGrades" table bridges between "Assignments" and "Enrollments" while recording the grade each student earned on each assignment. This is fundamentally identical to OrderDetails bridging between Orders and Products while recording the quantity of each product purchased in each order.)

icameron · on Oct 21, 2023

Good points but what you think about Bob ordering 47.05 worth of Strawberries?

thaumasiotes · on Oct 21, 2023

None of the products have any prices, so there's no such thing as ordering 47.05 worth of strawberries.

The OrderDetails table already records how many strawberries were included in Bob's order of July 3rd.

sitharus · on Oct 21, 2023

Maybe Bob makes jam?

magicalhippo · on Oct 21, 2023

The bridge tables will lead to a lot of extra joins in the rest of the application though. Like the typical case of a grid with the order items which has product name and category as well.

Unless you mean something very different in which case some concrete examples would be nice.

ahoka · on Oct 21, 2023

What I don’t understand is the totally arbitrary cargo cult nvarchar limits. Why 255?

nice-try-fletch · on Oct 21, 2023

It's an arbitrary but common one used in SQL Server. If the example used 128, you'd ask "why 128?" If it was 64, 50, 100, or any other number, you'd still ask. What is the right number? String lengths should come from business rules and the data domain but, when those don't exist (or someone is creating a fake table just to show an example), they borrow from other examples or existing schemas. It's clearly not the point of the article to come up with the absolute best and universally satisfying table design for orders.

sgarland · on Oct 22, 2023

As a sibling comment said, optimizing for size. However, since in ut8mb4 (the standard for Postgres, and modern MySQL) each char may take up to 4 bytes to store, the updated version of this would be VARCHAR(63).

If you can guarantee you’ll only store ASCII, you can safely use 255.

This is also probably a premature optimization for practically anyone to be fair. Even at 1 billion rows, you’ve only added 1 GiB of overhead, and at that scale that’s a small percentage of the overall size.

kagakuninja · on Oct 21, 2023

255 is the maximum array size representable with an unsigned byte. It is possible that old relational DBs could optimize arrays with sizes < 256.

I remember that the ancient Pascal dialect I used in college had max array sizes of 255 because they used a single byte for the size.

badrabbit · on Oct 21, 2023

Slightly unrelated question:

Why does Lucene and derivatives (ES,Solr,etc..) not have the concept of a join? Always found it to be a major downside of them.

riffraff · on Oct 21, 2023

Because they're not relational databases, I guess.

I mean, it's like asking why PostgreSQL does not provide a gopher server, it's just not their scope.

badrabbit · on Oct 22, 2023

Splunk can do a join and pipe just fine. It's a direct ES competitor and is not an RDBMS.

croes · on Oct 21, 2023

More than once a Distinct means you don't really know the underlying data model and just want to get rid of the duplicates.

bowsamic · on Oct 21, 2023

Don’t use subqueries if you are using Azure SQL, they’re painfully slow

FridgeSeal · on Oct 21, 2023

> Don’t use Azure SQL

There, fixed it for you. /s

barbegal · on Oct 21, 2023

At least generate some sizeable mock data before showing the query plans. The optimisations applied can be very different depending on the number of rows in each table.

nubinetwork · on Oct 21, 2023

How do you get those fancy pictures displaying the explain output?

5e92cb50239222b · on Oct 21, 2023

SQL Server Management Studio, I believe. For the rest of us who avoid Microsoft's database, there's this:

https://explain.dalibo.com

https://explain.depesz.com

nice-try-fletch · on Oct 21, 2023

Looks like the Plan Explorer extension for Azure Data Studio.

sakesun · on Oct 21, 2023

Glad somebody said this.

denton-scratch · on Oct 21, 2023

I lost interest once I came to the words "SQL Server", which I haven't used for a couple of decades. Unfortunately, those words didn't crop up until the last sentence of the article...

dspillett · on Oct 21, 2023

There is no reason to mention SQL Server for most of the article, the issue being discussed is relevant to all (or at least all common) SQL implementations. The query pattern is flawed in a fundamental way, or the table structure necessitating it is, rather than this being a SQL Server specific trap.

The only reason the product is specifically mentioned in that last part is because it mentions an optimisation that is possibly specific to SQL Server's query planner.

If you have such an allergy to the product that you can't process general advice from someone who uses it, there are clues you could have triggered your anaphylaxis earlier and saved you time: t-sql is mentioned in the site's side-bars, as is “Learn SQL Server” for that matter.

denton-scratch · on Oct 21, 2023

I'm not allergic to SQL Server. I didn't study the example code; I was just curious about why DISTINCT was a problem. TL;DR: for some query planners, DISTINCT requires a table sort, which requires a full table scan, which can be avoided by using a construct (GROUP BY) that gives the planner hints.

But I assumed that was dependent on the query planner; Oracle SQL, as I recall, has/had specific constructs for giving hints to the query planner.

dspillett · on Oct 21, 2023

> But I assumed that was dependent on the query planner;

Problems that can be “fixed” with DISTINCT but really shouldn't, an accidental cross-join being one I've seen many times, be are common to many (all?) SQL implementations and I don't think they are treat much better by any of them (they are generally mistakes, so prioritising optimising for them is not a good use of the query planner engineering team's time).

> Oracle SQL, as I recall, has/had specific constructs for giving hints to the query planner.

SQL Server has index hints much like Oracles, along with locking hints (which I think Oracle has similarly), and so forth, which can tweak the query planners behaviour. They are best avoided if other options are available because:

• they often only work well for a given data pattern, so you are just pushing the problem into the future when your data size has grown noticeably (in this context they are valuable as a temporary workaround, just make sure you treat them as such and work to remove the need)

• they potentially block optimisations that future versions of the query planner/engine can perform by forcing it a different way around

• sometimes they make explicit reference to objects (indexes usually), tightly coupling their names to things that others may not expect making accidental breaking changes more likely

Postgres doesn't have the same sort of index hints IIRC, but does have options that constrain the query planner in similarly to try make it prefer a particular path and/or avoid another.

denton-scratch · on Oct 21, 2023

Yeah, thanks. I get all that; avoid if possible any proprietary languange constructs.Try to tune the query and the schema so that no sane optimizer would do the wrong thing.

But the last time I tangled with Oracle, I was working for a bank; schema changes took months to approve, and the DBAs were GHODZ. And there was no chance of the bank changing suppliers anytime soon. And anyway, EVERYTHING was proprietary. They had their own proprietary stored-procedure language, and it wasn't dispensable - I believe people bought Oracle because of their stored procedure setup.

To be clear, I'm not in any sense advocating hinted SQL!

philshem · on Oct 21, 2023

“Think, before you DISTINCT”

(Not mine, and I don’t know the source)

munch117 · on Oct 21, 2023

I don't. I put distinct on everything by default, because it makes SQL more like relational algebra.

What's the argument for avoiding distinct? Is it performance? I've often wondered if the (sometimes redundant) distincts slow things down. Perhaps it does, but just not enough that I noticed.

funnymony · on Oct 21, 2023

Performance implication exist, but it is secondary.

Primary reason: distinct on every select shows either lack of knowledge of schema, in particular which columns make rows unique, or unfortunate schema design. (Apart from niche cases, schema should be somewhat normal. I.e. column parent_name belongs in the table parent, not in the table student)

Select a from x where myuniquekey=1; —- guaranted to return 1 or zero rows, if myuiniquekey is actually unique.

Select a from x join y on x.parent_id = y.y_id —- guaranteed to return same amount of rows as exist in y, never more, never duplicates y rows. (N-to-1 relation)

If distinct is used in any of above, then question “why?” naturally arises.

In more severe case, leads to bugs:

Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.

Technically sql allows comparing unrelated columns (colour=last_name), but for vast majority of cases, when joining, one of the side should be joined using it’s unique key, and other side should be using it’s foreign key, which ensures that duplicates don’t appear randomly, and thus distinct is not needed.

munch117 · on Oct 21, 2023

> If distinct is used in any of above, then question “why?” naturally arises.

Not if distinct is the default.

> Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.

Either with or without distinct can be a bug depending on what you are doing it for.

There are actually 4 variations on what you might want, and you can get all of them with distinct:

  select distinct student.student_id, student.student_name, parent.parent_id, parent.parent_name from ...
  select distinct student.student_name, parent.parent_id parent.parent_name from ...
  select distinct student.student_id, student.student_name, parent.parent_name from ...
  select distinct student.student_name, parent.parent_name from ...

magicalhippo · on Oct 21, 2023

Our main application at work is essentially a CRUD application, and I've worked on it for over 10 years now. I'm fairly confident I can count on one hand the number of cases where a join returned unexpected duplicates which DISTINCT would "fix".

Sometimes I wonder if we're just weird, somehow avoiding this issue.

funnymony · on Oct 21, 2023

These examples reminded me one more issue: change in column selection, might change number of rows, which means column addition/removal is so much riskier afair.

> Not if distinct is the default.

If that works for you, great, but let’s agree to disagree here.

munch117 · on Oct 22, 2023

We have different mental models.

Your mental model, if you will forgive the straw man, is that SELECT over multiple tables is conceptually equivalent to nested for-loops over each table, and the WHERE condition is an if-statement.

My mental model is that I'm working with sets. If yesterday I asked for the set of CITY,COUNTRY, and today I've changed that to the set of COUNTRY, then obviously the result set today is going to be much smaller. This is not a risk to me -- asking for a different set gives me a different set, I can't imagine being surprised by that.

devin · on Oct 21, 2023

The article clearly explains, as do a number of comments here.

munch117 · on Oct 21, 2023

If it was that clear, then I wouldn't have asked.

pharmakom · on Oct 21, 2023

I would love to use a SQL database where all tables are DISTINCT and unordered as a rule. A true relational model.