> There was another interesting use case I wrote about a few years ago that showed how changing DISTINCT to GROUP BY – even though it carries the same semantics and produces the same results – can help SQL Server filter out duplicates earlier and have a serious impact on performance.
I recently learned this is also what Amazon recommends when querying Spectrum. [1]
This is such a pervasive issue with the relational SQL query model ! The fact that joins always create brand new tables is powerful, but it's rarely what is needed.
The example in this article can manage to do without a join because it's only using the other tables to filter the customer table. The proposed solution breaks down as soon as a column needs to be brought in from the other tables (for example, "the total number of products purchased from the Beauty category"). A notion as simple as "I want to return a list of customers with additional properties" does not map to any dedicated SQL construct ! You can JOIN then GROUP BY, but making sure that the result only has exactly one value per customer is not something you can explicitly say in your SQL, it's a consequence of how you set up your JOIN and GROUP BY keys, and it's both easy to make a mistake and hard to understand the intent.
> A notion as simple as "I want to return a list of customers with additional properties" does not map to any dedicated SQL construct ! You can JOIN then GROUP BY, but making sure that the result only has exactly one value per customer is not something you can explicitly say in your SQL.
You can, you say it by grouping on only columns from the customers table, and those columns including a candidate key. Necessarily and equivalently, this means everything not from the customers table must be only referenced in an aggregate expression, but that's trivially what “I want a single row per customer, with some data not from the customers table” is asking.
> it's a consequence of how you set up your JOIN and GROUP BY keys,
Well, its a consequence of the GROUP BY. which is literally “what do you want one row for each of”, so... it's kind of weird to complain that it should be something else. GROUP BY is the dedicated construct in SQL that soecifies the thing you are looking for.
SELECT Customer.Name, sum(Sales.Quantity)
FROM Customer
INNER JOIN Sales ON Customer.A = Sales.A
GROUP BY Customer.B
Can you tell whether this will return exactly one line for each customer ? If INNER JOIN + GROUP BY was the dedicated construct to do so, then the answer would be "yes", because by definition it is the dedicated construct to do so. That's what a dedicated construct does: it is dedicated to doing that thing.
But both INNER JOIN and GROUP BY are much more versatile than that. In order to return exactly one line per customer, the following must be true: 1. column Customers.B must be a unique key of the Customers table (otherwise you'll get duplicate lines), and 2. each value of column Customer.A must also appear in column Sales.A (otherwise you'll have missing lines). Neither of these properties can be ascertained by looking at the query alone.
A dedicated construct would be something like (imaginary syntax):
SELECT Customer.Name, sum(Sales.Quantity)
FROM Customer
INNER JOIN Sales ON Customer.A = Sales.A
GROUP INTO Customer
> Can you tell whether this will return exactly one line for each customer ?
I can tell it will return one row for each Customer.B.
If Customer.B is a candidate key of Customer, that will also be one eow per Customer.
> If INNER JOIN + GROUP BY was the dedicated construct to do so
The dedicated construct to say what you want one row per is GROUP BY. Yes, it operates by columns, not tables, so what it means in table terms is schema-dependent.
> each value of column Customer.A must also appear in column Sales
Well, yes, that’s what INNER JOIN means. The dedicated construct to assure that every row from the first source but not the second source table is included in the result set before filtering by WHERE is LEFT [OUTER] JOIN.
As I read what you wrote, my first thought is that you don't have any such thing as a "customer" in that data model. You can ask for "one line for each customer.B" (which is what you're doing). But the idea that you can ask for "one line per customer" relies on some amount of non-db, domain knowledge.
If you can't define what a "customer" is via the information in your database alone, then you can't query based on "a customer". And if the answer is "each row in Customer with a unique B", then that's part of the definition and reasonable to use as knowledge in getting "one row per customer".
I didn't explain that well, I think... but that's the general thought that was running through my head as I read your writing.
I agree ! This creates a situation where the SQL query does not represent the domain knowledge, but makes assumptions about it and cannot be understood without it. And while this will always be true for the more unusual parts of the domain, it is quite disappointing to be unable to properly represent as simple a concept as "what is a customer?" in SQL.
It's the same as a language forcing you to use `c & ~0x20` because it doesn't have a `Char.toUpper(c)` function. The code works (under the right assumptions) and produces the same result, but it does not convey the concept of converting a letter to uppercase.
What makes it so frustrating is that the DDL portion of SQL spends significant effort on representing such concepts in the database schema ! I can create a Customers table, with its primary key, and its foreign keys into and from other tables, and so on. I can represent "these are all the customers" in DDL, I can represent "every sale must be associated to a customer", and so on. But after the first join, I'm no longer using the Customers table, I'm using a new in-memory relation with no primary or foreign keys, and the concept that "this is the customers table, but with extra fields" is something I need to keep track of in my head, instead of in the language.
In some sense, what you're asking for is straightforward with the tools we already have. It suffices to set a convention that each table always has a column named `id` which is the primary key for the table.
The harder part is how to enforce that within an organization. It sounds like you'd like technology to enforce it.
Existing tools already do the math part: you can set a constraint on a table so that the database maintains the primary key property and throws an error if a transaction would change the table in a way that violates the property.
What you're left to do is get everybody on board with the "every table has a primary key column named id" plan. Some syntactic sugar like GROUP INTO might help with that.
Looking at this again, I think the actual complaint isn't so much about base tables (though those were used in the illustration) but intermediate derived relations created in deeply nested queries (or even regular views), where even though their may in effect be primary/unique keys, they aren’t declared and recognizing them depends on tacit knowledge (and because the functional dependencies aren't recognized by the DB engine, they can’t be leveraged in GROUP BY to omit redundant non-key columns so a GROUP BY needs to specify all the non-aggregate columns with the domain understanding being opaque.
A primary key convention for base tables doesn’t help with this; I also don't think the propsed GROUP INTO solves it, though it requires it to be solved first to work (i.e., unless you are only using it to GROUP INTO base tables rather than intermediate tables formed by arbitrary joins, it requires first having the engine infer, or provide a way of declaring and having the engine validate, keys for those tables.)
Honestly, I'm not sure what all this means. Maybe an example would help?
It sounds like there's an interest in the database inferring something subtle and making some kind of automated decisions based on that. Business stakeholders often make this kind of request - "can't an AI just figure all this out?" kind of thing. It often doesn't go anywhere because it's too far removed from the level of detail needed for a machine to automatically solve a problem.
First of all, all code requires domain knowledge to understand. Some domains are just simple. Business domains never are. Even with your upper case example, if you don’t know what upper case letters are, you are in the same position.
Second, data can be organized in infinite permutations and sql has to accommodate that. People have been complaining about sql since the Dawn of time, but all proposed solutions only fix a subset of problems.
MySQL aside, your first query will end up an error if Customer.Name is not functionally dependent on Customer.B. This highlights that there could be different customers with the same name and that it was a poor query to start.
For number 2, that's exactly what inner join means otherwise use a left join. The first question someone should ask themselves is if they want all customers or only ones that have had sales.
Agreed - I think SQL is great, but the JOIN operator has been problematic for me with analytic queries and queries where you want to enrich data across different tables.
I think the problem is fundamentally that if you take a record, and split it across 2 tables, then there's no easy operation to merge them and guarantee distinctness.
Separately, even though there are some %-breakdowns in the article, I don't think the `DISTINCT` sort is so much of a problem in most cases. Usually the actually joining is the problem for any truly large dataset, so while I trust the diagrams shown, I think those %'s change dramatically based on the underlying data.
> I don't think the `DISTINCT` sort is so much of a problem in most cases
I train people new to SQL at work, and I can tell you with complete certainty that you only get this impression because people learn very quickly to never use `distinct`.
It's by far the most common cause of performance problems for people new to SQL.
But the example is exactly the type of query where I often see DISTINCT used where it shouldn't: they just want to know the customer with at least one order or the author who wrote at least one book or the user who upvoted at least one comment. I think that was the point of the post (not to never use DISTINCT or that this is the only example where anyone ever needs to join). DISTINCT isn't likely to be used in the more elaborate example you mention (total products from some category) because DISTINCT doesn't fix a problem that the necessary GROUP BY to get the answer has already prevented appearing anyway.
If you don’t need the result from the joined table, a semijoin is often the fastest way. In the given example, if there was more data in the tables, it’s likely the planner would choose that. Not as familiar with MSSQL as MySQL and Postgres so I’m not positive.
Also, the good news is the planners are increasingly better at converting your desired end state into queries utilizing features like this, without you explicitly telling them to.
Well, SQL Server (seemingly the author's area of expertise) doesn't have a lateral join. You can simulate this to some extent using CROSS APPLY, but that wouldn't really be any better than EXISTS in the case like the example, where you don't actually need any data from the secondary table.
Nobody actually answered your question helpfully, so here goes:
When you write `SELECT DISCTINCT a, b` you are asking for every unique pair (a, b) in the result. Specifically, the same value of `a` could appear multiple times in the result paired with different values of `b`.
The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
As a human, we look at this and think "Oh, but `a` is the primary key (in this case CustomerID) and `b` is a column in the same table (CustomerName) so we only need to be distinct over `a`, which means the query optimizer should be able to "push down" the DISTINCT clause into the scan of the joined tables". The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
At this level you really start getting into the nitty gritty of what optimizations a particular database has implemented though: the solution presented in the article may be the best way to write this for SQL Server, but this "correlated subquery" form could be catastrophically slow in other database engines...
> The only way to implement that "generally" is to visit all of the rows in the result-set, and skip over duplicates using a hash table of "already seen" pairs, or by sorting the result. The problem is that this intermediate result-set is much bigger than we really want.
This isn't quite right; there are two general ways. One is a hash table as you say, the other one is sorting. The second is quite relevant once you have a reasonable framework for interesting orders (e.g., scanning along an index, or doing sortahead on a smaller table with a foreign key that guarantees uniqueness).
> The problem is that the query optimizer has not been endowed with this kind of knowledge about the specific relationship between `a` and `b`, and so cannot do this optimization. In principle, a query optimizer could implement this kind of optimization though.
There are absolutely query optimizers that know this (try such a query in MySQL, for instance).
Hum... If you know `a` is a key of table `X` and `b` is a key of table `Y`, you can replace a `select distinct` with a `join ... on exists ()` with complete certainty.
And this solves almost all of the problematic cases of this on practice. If the thing you are querying isn't a key, you will naturally think a bit further and not write the distinct.
I guess that kind of optimization just goes counter to the philosophy of SQL optimizers. I know that I would be deeply surprised if I found a database doing it, and not in a good way.
You're overthinking it - the optimization is absolutely possible as you've described. The reason databases haven't done is not some question of design philosophy, it's simply that nobody bothered to implement it.
> I know that I would be deeply surprised if I found a database doing it, and not in a good way.
SQL engines do way more surprising optimizations than this. SQL is a "4th gen" language after all, meaning it's absolutely within the spirit that the database engine does whatever it wants to get the result.
There's no good reason why you couldn't rewrite DISTINCT to GROUP BY, but remember that it wouldn't work in all cases; you could have both. E.g. SELECT DISTINCT COUNT(*) FROM t1 GROUP BY a;. Similarly, HAVING and window functions happen after GROUP BY but before DISTINCT, so a simple move won't always be correct and you'd have to account for that.
Tangentially related, how IN and ANY can produce wildly different results [0] due to type casting. I have personally seen this, and at least for my data, the fastest query / fewest buffers scanned was to shift the matching ids out to a CTE with VALUES, and join with IN. It wasn’t a minor speed up, either, it was literally twice as fast.
Has nothing to do with the language. 1 (in its isolated presentation) exists outside of my set of data and that's not something that gives me comfortable feelings. It forces me to think (less think more good) why some random value needs to float around my data and maybe needs to be appended or summed or whatever.. because reasons. As far as I'm concerned you might as well make it 200 or 1648194. And why would those ever be needed in any context around my data?
No need to be a dismissive dickhole about it, friend.
Maybe I'm not wording it right then? Like I said in other posts, I copy paste SQL statements all the time. If I were to copy/paste that statement then all I'd get is a bunch of ones and that's useless to me. The SQL itself is data to me in the same way that when I view a lazy list comprehension, I view it (and SQL) as something that's just waiting to be run. Maybe not now, and maybe not in the current SQL, but there's a non-zero chance that I'll copy/paste it. So in that context, a bunch of ones is useless to me and IS data because the ones are literally the output of the SQL statement. Better to generalize my code writing process in a way so that I can copy/paste a "select *" or "select rownum" because those are more useful down the line.
You've seriously missed the point if this is what you're still saying. Take a step back, breathe, and try to consider that you've missed something. Whether that's a point I've made, or a lack of perspective, I don't know. For example, we almost definitely work in different fields with different practices and reasons for doing things differently. And that's fine. But your continued dismissiveness isn't helpful. Like another post said, "select *" is helpful in data analytics work. If it's not helpful in your field, that's also fine. But for me and my colleagues, it is. And I promise you're wrong in your thinking of why I copy paste SQL. What a weird fucking conversation.
Also, calling someone obtuse is name calling, so pot kettle black and all that, ya obtuse weirdo.
Yes, and all those constants would be treated the same, just like in many languages things are conventions - you don't need to use _ to name a variable that you throw away, but you do, because conventions.
There's so many people that want to complain about the conventions of SQL but praise the conventions of other languages and I think its often just "which did you see first?"
You're missing the point that I don't really care about how they're treated. I want to glance at it and know what it does. For me, the 1 gets in the way. Maybe it doesn't get in the way for you because you don't use it the same way as I do, as a practice.
...but... this is literally how you write an EXISTS clause in SQL.
> you might as well make it 200 or 1648194
Yes, you can put anything there.
There's lots of funky inconsistent syntax in SQL because it evolved organically over literally the past half century. It has everything to do with the language.
Would it make you feel better about it to think of it as (or write it as) "select true" ? Because that's essentially what it's communicating, that there is something to return and that's what we're checking for.
Nope, for the same exact reason. Select * makes the most sense because * is in the context of my data, not something I'm working into it. Pretty sure it's the same speed.
Also mind you, I use a lot of CTEs, so this would look weird in that context -- hence why using row number sometimes makes more sense and achieves the same thing.
Filter out everything, Project only what you need, Transform it as lightly as possible.
In any context I understand a row number would never "make sense" if a constant of 1 would be the same output, it would be a lot more code that does... nothing?
Any code using select * just breaks in the future with any new columns being added, no thanks.
Select * is pretty standard for wide table queries like they are used in Data Analytics teams regularly. Because you really want ALL columns, even new ones without going back to fix all 300+ dashboards.
you're getting downvoted because it's not really a preference, it's pretty widely known to be bad practice and unhygienic in production queries. select * would get your pr rejected and chewed out by dba at every place i've ever worked at. so you kinda just look like you don't know what you're talking about.
In an exists query select * is harmless, select 1 and select * result in the same execution plan at least in MS SQL.
In a query than returns result rows it could break the query as soon as you add columns with names that already exist in other tables you joined in the query.
yes, it's harmless in this position but it provides no additional benefits to the select 1 idiom and is suggestive of poor query discipline. it's far easier to say just don't ever use select * in queries.
1. Given that "select " is considered something to avoid except when necessary in edge cases
2. And "select 1" will accomplish the same goal
Anyone reading the "select " version of the code will have to stop and consider whether it is using "select " for a reason, because "select 1" would be the normal choice. Using "select " is assumed to be conveying some intent (that isn't there) _because_ it's not the expected way to do it.
Sure, they do the same thing... but you have to stop and look at the second one to make sure you're understanding what it does and if there's some reason its weird.
I haven't downvoted anyone, but have followed this argument with interest as an intermediate SQL user.
If I was to guess why someone would downvote you, it wouldn't be for disagreeing with you, but more because you've subtly shifted from quite a strong objective stance ("this is not readable") to a subjective one ("this is not how I prefer to write it"), without really conceding anyone else's points.
1 think my point makes more sense when you consider that I 1. Don't work with production code (more analysis, ad hoc code in an investigatory capacity) and 2. that when 1 mention someone is "new" what 1 mean is someone actively learning and not from a technical background. 1ME, folk like that have a difficult time with that floating 1. So while yes it's a standard that programmers are familiar with, it's not something that someone new will be very comfortable with. Lots of people 1 work with come from a pandas-only background.
Not really conceding because as far as 1 can see, everybody is coming from a position of familiarity.
That's totally fair. Perhaps the confusion could have been avoided by qualifying in your initial comment that you're referring to a specific situation i.e. not-too-technical analysts writing ad hoc code.
I didn't downvote you, but consider this - I work with SQL a lot, like a lot a lot. Something that's your code today is probably my code tomorrow.
So when you say "my flow is X" and your flow is inimical to maintaining it and extending it, people might get a bit irritated at the last dev that did the exact same thing.
Any situation where a new or elsewise unknown attribute breaks your code, binding is a big one.
* Say you are joining two tables and one now has a conflicting/duplicate name, surprise, you now have broken code, it literally will not execute with ambiguous duplicate references.
* By the same token, downstream views can break for the same/similar reason.
* In some engines views are not going to actually include your columns until they are "refreshed"(SQL Server) so that one day that's out of band of your deployment your views will suddenly change.
* Say you have a report with specific headers - tada, it's now got whatever people add to the table - and sourcing it can be a pain because its unclear where the source is in the query, requiring schema.
* Performance expectations can change if the data type is much larger, up to the point of actually breaking the client or consuming tens of billions of times more resources.
Usually code refers to columns by name and additional columns are just ignored, the only case I know is when you add a column to a joined table and the column name already exists in the other table resulting in ambiguous column names.
I am talking about 'select ' at the top level, they can be harmless in exists, sub-selects and CTE's. The number of columns sent to the client (the program) will change when you add a column to the database. If you don't remember to change all of the places in your code where you used a 'select ', you program is likely going to fail or have unexpected results.
The relevant thing for exists is the set of the rows returned by the select being non/empty.
The value in the columns is irrelevant. Using * creates an incorrect impression that the value is relevant and the relevance tied to the data.
Sounds like what you want is "bang" from Max, which is kind of like the opposite of null (the billion dollar mistake) but more existential, or dang (the hn moderator) but more explosive:
User:
Explain Max/MSP's use of bang.
ChatGPT:
Max/MSP, often simply referred to as "Max", is a visual programming language for music and multimedia. In Max, the fundamental building blocks are "objects" that are connected with "patch cords". These objects can process, generate, or manipulate various types of data, including audio, MIDI, and other control messages.
One of the most basic and essential objects in Max is the "bang". Visually, a bang is represented by a small button in the Max patcher window. Conceptually, a bang is an event trigger. When a bang is received by an object, it tells that object to do something, often to execute its primary function.
Here's a breakdown of how the bang is used in Max:
Triggering Actions: Many objects in Max perform an action when they receive a bang. For example, the metro object (a metronome) sends out bangs at regular intervals. When connected to other objects, these bangs can trigger various actions in a patch.
Flow Control: Bangs can be used to control the order of operations in a patch. The trigger object, often abbreviated as t, is particularly useful for this. It can output a sequence of bangs (and other data types) in a specific order.
User Interaction: The bang object can also be used as an interactive button in a user interface. When you click on it, it sends out a bang.
Feedback: Bangs can be used to provide visual feedback. For instance, when a certain condition is met in your patch, you might use a bang to flash, indicating that the event has occurred.
Conversion: Sometimes, you might receive data, but you only want to know when the data has arrived and not necessarily what the data is. In such cases, you can convert any incoming data into a bang. For example, if you connect a MIDI note input to a bang, the bang will flash every time a note is played, regardless of the note's pitch or velocity.
In essence, the bang is a fundamental concept in Max/MSP, representing an event or a moment in time. It's a way to initiate actions, control the flow of operations, interact with the user, and more.
User:
How about a special value "dang" for propogating mistakes?
The concept of a special value like "dang" to propagate mistakes is an interesting and humorous idea. Let's explore this hypothetical scenario:
Purpose: "Dang" could be used in a system to indicate that a mistake or an error has occurred and has been propagated. Instead of failing silently or producing unpredictable results, the system would clearly mark the affected data or process with "dang".
Usage:
In a programming context, "dang" could be a special return value indicating that a function has encountered an error that wasn't caught but isn't catastrophic enough to halt execution.
In a database, "dang" could indicate corrupted or suspicious data that resulted from a known error but hasn't been cleaned or addressed.
Advantages:
Transparency: Instead of silent failures or cryptic error messages, "dang" provides a clear and human-readable indication of a problem.
Debugging: It could make debugging easier, as developers can quickly identify and trace back the source of "dang" values.
User Experience: For end-users, seeing a "dang" value could be a more friendly or humorous way of being informed that something went wrong, as opposed to a generic error message.
Challenges:
Overuse: If used too liberally, "dang" could become a catch-all for various issues, making it harder to pinpoint specific problems.
Misunderstanding: Without proper documentation or understanding, developers or users might be confused about what "dang" actually represents.
Standardization: For "dang" to be widely adopted, it would need to be standardized across platforms and tools, which is a significant challenge.
In essence, while "dang" as a concept introduces a fun and novel way to handle and propagate errors, its practical implementation would require careful consideration to ensure it adds value to the development and user experience.
Eh, anytime I need to read SQL that includes that I have to sit down and think about it. It's not glance-and-understand and you have to maintain that state in your head. Not the sort of SQL I want to rely on as a practice.
Mind you, most of what I write are ad hoc SQL queries where I might write 100-200 queries in a single day. So that sort of thing really adds up. And it's in a strange place where performance isn't dire, since I can work on other stuff while queries chug.
With most databases (PostgreSQL is an exception), SELECT must select something, even if you only care about the record existing in general, and you're not interested in reading anything from it. SELECT 1 means "select the first column", whatever it is. Specifying a column number also works with ORDER BY and other clauses. Of course any natural number works, as long as the record has at least that many fields.
SQL queries return not columns from table, but a result of relational operations with new columns specified by expressions. What a downvoted user itt cannot accept is that 1 is as meaningful as * or <colname> in general and is an idiom for “columns don’t matter here, only [the number of] rows”. It’s like [[1]].length vs [[id, …]].length. Saying “it’s not my data” is misunderstanding what that data is.
In SQL Server at least, no, it literally means select the integer 1. In the ORDER BY clause, it does mean to order by ordinal position, but that's not a great thing to glorify, since ordinal position is not necessarily stable. I think other dialects like MySQL might allow GROUP BY 1, but that's not a great thing to glorify either.
I'm pretty sure you still have to select something in postgres... or are you saying there's a different operator to do the same thing?
> SELECT 1 means "select the first column"
interesting.. in postgres this actually gives you a single-column value of 1, in an unnamed column... I haven't used other DBMSes much but expected similar behaviour
Indeed, I would be surprised if there exists any RDBMS where the content of the select list in an EXISTS subquery matters. Postgres's SQL dialect even lets you use an empty select list (`... WHERE EXISTS (SELECT FROM related_table WHERE related_id = id_from_outer_query)`). In T-SQL, however, a non-empty select list is required, and in my experience, developers writing T-SQL tend to prefer a constant value like `1` or `NULL` over `*` -- I suspect there's some superstition there related to both the common wisdom that `SELECT *` queries are to be avoided (which is true -- you really should only ever select the columns of interest) and a lack of truly understanding that EXISTS only reads enough of the table to ascertain whether rows exist that satisfy the subquery's predicate, and returns a boolean value, not a result set.
As a non-sql-expert (I've used an exists query approximately once in my life) I prefer `select 1` over `select column_name` or `select *` because it doesn't look like it might be meaningful.
Postgres's dialect seems like it made the right choice here.
It works fine for me.. when I'm thinking in sqlese. The bigger travesty here is using SELECT for the tasks which are not selecting (returning data) anything.
Both Oracle and SQL Server have the advice to use a constant value on their documentation. Postgres used to advise the use of `*`, but looks like they improved their optimizer so it doesn't matter anymore.
It's not superstition. It's people that know deeply how a complex system works picking the option with the best set of side-effects.
You could. But `SELECT 1` is a very common pattern in SQL Server, intended exactly to declare intent to the reader that "this subquery doesn't return any real data" - in the event the `EXISTS` that surrounds it isn't enough of a clue. You could use `SELECT NULL` or `SELECT 0` or `SELECT *` or `SELECT 12/0` and the behavior would be identical, but most of those would likely raise even more questions to an inexperienced reader than `SELECT 1`.
Just to be sure to check the resulting plans. The DB we use usually has sub-optimal plans when using IN compared to EXISTS, often quite drastically so.
You always need to check the resulting plans if you care about execution performance, that's just how life is.
I'm offhand a bit surprised IN does worse than EXISTS; I can understand NOT IN being slow, because it has very surprising NULL handling that is hard to optimize for.
Explicit semi- and antijoin operators would be a nice addition to the language, but do note that they present some additional challenges for the optimizer. In particular, when the join condition becomes explicit, it can take on more forms than before, which can limit the reorderings allowed or allow new ones. E.g. you now need to deal with this structure:
(a SEMIJOIN b ON a.x=b.y) JOIN c ON b.z=c.z
This was an impossible structure before, since IN and EXISTS both hide b's columns (all semi- and antijoins effectively come last), and your optimizer will now need to know whether e.g. this associative rewrite is allowed or not:
a SEMIJOIN (b JOIN c ON b.z=c.z) ON a.x=b.y
Also, you'll need to deal with LATERAL semijoins, which you didn't before…
None of this is impossible, but there's more to it than just a small syntactic change.
I don't get it. "SELECT 1" is significantly more informative than "SELECT red_herring". It's a self-documenting and automatically enforced comment telling you that the information requested by the SELECT statement is not used by the query.
I copy paste SQL all the time, especially with CTE after CTE all the time. The row number is a simple thing to copy paste that fills multiple functions with zero modification.
And that's different from 1 in what way other than being more confusing?
Note that in the case you're advocating for, the row number is called "OrderId", which you might have trouble with if you insist on copying a query from somewhere else and using it without modification. Wouldn't you prefer "1"?
It's helpful within CTEs as a unique identifier. This whole conversation makes me wonder if not many people minmax this sort of meta programming thing.
> although it would seem to be so by design, however design is only as robust as
It only seems so if you assume auto-increment is used to populate the order_id and that it always increases with time. That latter assumption is quite unsafe:
* Systems could have been merged with a bulk import of old orders into this one from elsewhere (assuming order_id is a surrogate key and there is a separate order code or such that is used to identify the orders externally).
* In fact, a simple insert of several records in the same statement will not necessarily get auto-increment values in the order you expect (in practise they usually do - but the DB engines do not guarantee this, it is an accident of other factors in their design rather than a defined behaviour).
* Because of optimisations for concurrency in the way auto-increment is handled, it is possible that long-running transactions could cause ordering discrepancies. In theory at least, in practise unless you've explicitly opted out of ACID-preserving locking semantics for those transactions I suspect these protections will stop this un-ordering happening by blocking the concurrency. This sort of issue is why you occasionally see unexpected gaps in auto-increment values.
* I have seen an example where an incrementing signed-int ID was getting too close to MAXINT for comfort, and as a temporary measure ahead of changing that ID to be a longer type the increment was reset to restart at MININT and head back towards 0 from there! This was with a 16-bit integer (I'm old enough to have been around when it was common to use them to save space, where we generally default to 32-bit these days) but the same could happen to larger types.
If you have an auto-increment and DO NOT have some logic around draft orders, max could be the one. Either way, you can max on the submission date then
Hmm… my hunch is that this doesn’t do what you think it does.
I expect the order by in the window function is effectively lost because max operated over the whole window. (And you happen to get the most recent, because in many implementations, order_id will be a sequence.)
But I might be wrong. And I might only now be learning that order by with max() and over substitutes how the “value” of the order_id is understood.
Here's a pretty simple/normalish way to handle the edge cases. This one (without distinct) is far more consistent (wall-clock-wise, doesn't depend on caches): http://sqlfiddle.com/#!15/7eb3a/9
Note that order 2 is after order 4 in the example schema.
whenever I see [select distinct ... group by] , i know i'm in for some bs. the original code creator had some duplication/cartesians but never properly addressed it
I’ve done it a few times in the app I’m working with and had no idea DISTINCT was not the proper way to address it. I mean, it’s a standard feature, named « distinct », easy to use and it works. It’s on me that I didn’t run an explain on these queries, but it seems like an easy mistake to make.
Anyway I’ll happily fix these in my code on Monday.
Seeing distinct being abused in application code is pretty rare though. What gets me is seeing it in some ETL or data analyst's queries. That's the sign that you're about to spend the next hour unfucking a query with like 50 joins and no CTEs for organization, written by someone who has no understanding of the underlying schema.
Any recommended resources (books, courses, etc) which go deeper on sql topics like this? Comparing and contrasting different ways of writing the same query, with respect to performance.
The one book I’ve found and really enjoyed was “SQL Performance Explained”.
> why aren't these tables normalized better anyway? None of them should have columns for the PK of another table.
> The relationships between them should be represented by bridge tables. Really there's only one relationship here which is product orders.
What?
There appears to be a mistake in the Products table, which only allows a product to belong to one category.
But the many-to-many relationship between orders and products is already represented in a bridge table, OrderDetails, and the one-to-many relationship between customers and orders is correct. Why would you have a bridge table between customers and orders instead of recording the same information in a column of the Orders table?
And... how is your bridge table going to operate without having columns for the PKs of multiple other tables? That's what OrderDetails is doing. What would you change?
I don't think the one category is a mistake, merely a simplification. Why introduce another table that complicates the example and adds nothing? The article doesn't seem to be about "here's the absolute perfect design for orders and products" and dwelling on those details, well, doesn't really add anything either. :-)
> the one-to-many relationship between customers and orders is correct. Why would you have a bridge table between customers and orders instead of recording the same information in a column of the Orders table?
Because as you said the relationship is one customer to many orders.
> the many-to-many relationship between orders and products is already represented in a bridge table, OrderDetails
The relationship between Orders and Products is one-to-many, not many-to-many. Also, OrderDetails is not a bridge table because it has the quantity of Products. Bridge tables should only have the IDs and it's not going to have one row per order nor per customer. It's going to have n rows where n is the number of categories a product in an order belongs to multiplied by the number of products in an order multiplied the number of orders made by a customer multiplied by the number of customers who have made orders. It's a lot of rows, but compresses very well and makes the queries simpler by flattening all these relationships into one. The only relationship that matters for this query is customers and product category, but you can query for any relationship without joins in this bridge table I'm proposing.
> That's what OrderDetails is doing. What would you change?
OrderDetails is messed up in other ways too. What is "LineItemID"? LineItem should have the quantity and price? That table isn't even defined in the example. OrderDetails is a bad table that needs to go. Your interpretation of what I'm saying is correct. I would add an actual bridge table (not OrderDetails) that relates the customers to the orders.
You're correct that OrderDetails, as written, does need to relate an order to all its products, but my point was that you don't need to join to query for all the customers who have made at least one order of a given category. You just need to select where the category ID matches.
I'd kind of like to respond, but nothing you just said makes any sense.
> OrderDetails is messed up in other ways too. What is "LineItemID"? LineItem should have the quantity and price? That table isn't even defined in the example.
You could have answered these questions by just reading the example schema. LineItemId is part of the two-column primary key defined for the OrderDetails table. It's not a reference to anything; presumably its purpose is to define the order in which line items appear in an Order.
> OrderDetails is a bad table that needs to go.
I hope you don't do a lot of code review. Or much coding, for that matter.
> I hope you don't do a lot of code review. Or much coding, for that matter.
With responses like yours, I feel the same back at you. I always find it hilarious how passionate people can get about SQL just because it's hard to discuss without a full working example.
You should re-read and try out what I said. The schema in the example is contrived to create the very issue they're whining about, but bad schemas are at the root of all bad queries.
>>> The relationship between Orders and Products is one-to-many, not many-to-many.
Let's deal with this one first. Suppose customer 12 orders 10 pounds of gravel and a potted cactus while customer 15 orders 30 pounds of gravel and a cherry sapling.
I claim that this illustrates that it's possible for one order to be associated with multiple products (customer 12 placed an order for two different things), at the same time that it's also possible for one product to be associated with multiple orders (customers 12 and 15 each ordered gravel).
But you appear to know better than me. Please explain to me why it wasn't possible for the same product to appear in several orders.
> I always find it hilarious how passionate people can get about SQL just because it's hard to discuss without a full working example.
I actually don't have much in the way of opinions on SQL. But I do have strong opinions about self-righteous stupidity. You have repeatedly made it plain that you don't understand what the words you're using mean. Take this claim of yours:
> why aren't these tables normalized better anyway? None of them should have columns for the PK of another table.
The tables are already fully normalized. It would be possible to render them differently, so that (for example) the Orders table didn't have a column for CustomerID and instead a separate table existed with records of the form (OrderID, CustomerID) -- but this would not affect the normalization of the tables. All this would do is cause you to perform extra queries whenever you had one piece of information and you wanted the related piece of information.
Or this one:
> Also, OrderDetails is not a bridge table because it has the quantity of Products.
Annotating the relationship it embodies won't make OrderDetails not a bridge table. It's a bridge table because it records the many-to-many relationship between Orders and Products. Here ( https://www.youtube.com/watch?v=uKd01Ouw_b8 ) is a university lecture in which one of these tables is explicitly identified as a "bridge table" at the same time that it records such supplemental information. (In that case, what's happening is that the "AssignmentGrades" table bridges between "Assignments" and "Enrollments" while recording the grade each student earned on each assignment. This is fundamentally identical to OrderDetails bridging between Orders and Products while recording the quantity of each product purchased in each order.)
The bridge tables will lead to a lot of extra joins in the rest of the application though. Like the typical case of a grid with the order items which has product name and category as well.
Unless you mean something very different in which case some concrete examples would be nice.
It's an arbitrary but common one used in SQL Server. If the example used 128, you'd ask "why 128?" If it was 64, 50, 100, or any other number, you'd still ask. What is the right number? String lengths should come from business rules and the data domain but, when those don't exist (or someone is creating a fake table just to show an example), they borrow from other examples or existing schemas. It's clearly not the point of the article to come up with the absolute best and universally satisfying table design for orders.
As a sibling comment said, optimizing for size. However, since in ut8mb4 (the standard for Postgres, and modern MySQL) each char may take up to 4 bytes to store, the updated version of this would be VARCHAR(63).
If you can guarantee you’ll only store ASCII, you can safely use 255.
This is also probably a premature optimization for practically anyone to be fair. Even at 1 billion rows, you’ve only added 1 GiB of overhead, and at that scale that’s a small percentage of the overall size.
At least generate some sizeable mock data before showing the query plans. The optimisations applied can be very different depending on the number of rows in each table.
I lost interest once I came to the words "SQL Server", which I haven't used for a couple of decades. Unfortunately, those words didn't crop up until the last sentence of the article...
There is no reason to mention SQL Server for most of the article, the issue being discussed is relevant to all (or at least all common) SQL implementations. The query pattern is flawed in a fundamental way, or the table structure necessitating it is, rather than this being a SQL Server specific trap.
The only reason the product is specifically mentioned in that last part is because it mentions an optimisation that is possibly specific to SQL Server's query planner.
If you have such an allergy to the product that you can't process general advice from someone who uses it, there are clues you could have triggered your anaphylaxis earlier and saved you time: t-sql is mentioned in the site's side-bars, as is “Learn SQL Server” for that matter.
I'm not allergic to SQL Server. I didn't study the example code; I was just curious about why DISTINCT was a problem. TL;DR: for some query planners, DISTINCT requires a table sort, which requires a full table scan, which can be avoided by using a construct (GROUP BY) that gives the planner hints.
But I assumed that was dependent on the query planner; Oracle SQL, as I recall, has/had specific constructs for giving hints to the query planner.
> But I assumed that was dependent on the query planner;
Problems that can be “fixed” with DISTINCT but really shouldn't, an accidental cross-join being one I've seen many times, be are common to many (all?) SQL implementations and I don't think they are treat much better by any of them (they are generally mistakes, so prioritising optimising for them is not a good use of the query planner engineering team's time).
> Oracle SQL, as I recall, has/had specific constructs for giving hints to the query planner.
SQL Server has index hints much like Oracles, along with locking hints (which I think Oracle has similarly), and so forth, which can tweak the query planners behaviour. They are best avoided if other options are available because:
• they often only work well for a given data pattern, so you are just pushing the problem into the future when your data size has grown noticeably (in this context they are valuable as a temporary workaround, just make sure you treat them as such and work to remove the need)
• they potentially block optimisations that future versions of the query planner/engine can perform by forcing it a different way around
• sometimes they make explicit reference to objects (indexes usually), tightly coupling their names to things that others may not expect making accidental breaking changes more likely
Postgres doesn't have the same sort of index hints IIRC, but does have options that constrain the query planner in similarly to try make it prefer a particular path and/or avoid another.
Yeah, thanks. I get all that; avoid if possible any proprietary languange constructs.Try to tune the query and the schema so that no sane optimizer would do the wrong thing.
But the last time I tangled with Oracle, I was working for a bank; schema changes took months to approve, and the DBAs were GHODZ. And there was no chance of the bank changing suppliers anytime soon. And anyway, EVERYTHING was proprietary. They had their own proprietary stored-procedure language, and it wasn't dispensable - I believe people bought Oracle because of their stored procedure setup.
To be clear, I'm not in any sense advocating hinted SQL!
I don't. I put distinct on everything by default, because it makes SQL more like relational algebra.
What's the argument for avoiding distinct? Is it performance? I've often wondered if the (sometimes redundant) distincts slow things down. Perhaps it does, but just not enough that I noticed.
Performance implication exist, but it is secondary.
Primary reason:
distinct on every select shows either lack of knowledge of schema, in particular which columns make rows unique, or unfortunate schema design.
(Apart from niche cases, schema should be somewhat normal. I.e. column parent_name belongs in the table parent, not in the table student)
Select a from x where myuniquekey=1; —- guaranted to return 1 or zero rows, if myuiniquekey is actually unique.
Select a from x join y on x.parent_id = y.y_id —- guaranteed to return same amount of rows as exist in y, never more, never duplicates y rows. (N-to-1 relation)
If distinct is used in any of above, then question “why?” naturally arises.
In more severe case, leads to bugs:
Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.
Technically sql allows comparing unrelated columns (colour=last_name), but for vast majority of cases, when joining, one of the side should be joined using it’s unique key, and other side should be using it’s foreign key, which ensures that duplicates don’t appear randomly, and thus distinct is not needed.
> If distinct is used in any of above, then question “why?” naturally arises.
Not if distinct is the default.
> Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.
Either with or without distinct can be a bug depending on what you are doing it for.
There are actually 4 variations on what you might want, and you can get all of them with distinct:
select distinct student.student_id, student.student_name, parent.parent_id, parent.parent_name from ...
select distinct student.student_name, parent.parent_id parent.parent_name from ...
select distinct student.student_id, student.student_name, parent.parent_name from ...
select distinct student.student_name, parent.parent_name from ...
Our main application at work is essentially a CRUD application, and I've worked on it for over 10 years now. I'm fairly confident I can count on one hand the number of cases where a join returned unexpected duplicates which DISTINCT would "fix".
Sometimes I wonder if we're just weird, somehow avoiding this issue.
These examples reminded me one more issue: change in column selection, might change number of rows,
which means column addition/removal is so much riskier afair.
> Not if distinct is the default.
If that works for you, great, but let’s agree to disagree here.
Your mental model, if you will forgive the straw man, is that SELECT over multiple tables is conceptually equivalent to nested for-loops over each table, and the WHERE condition is an if-statement.
My mental model is that I'm working with sets. If yesterday I asked for the set of CITY,COUNTRY, and today I've changed that to the set of COUNTRY, then obviously the result set today is going to be much smaller. This is not a risk to me -- asking for a different set gives me a different set, I can't imagine being surprised by that.
I recently learned this is also what Amazon recommends when querying Spectrum. [1]
[1] https://aws.amazon.com/blogs/big-data/10-best-practices-for-...