The question to be asked is whether your coding language contains non-essential ...

corey_moncure · on Nov 12, 2019

I just recently learned that the ordering of fields in your GROUP BY clause has meaningful consequences for the output and can differ from what's in your top SELECT. Try it with something like SQL Server's `FOR JSON AUTO` and you'll immediately see why `GROUP BY` demands the explicit order from you.

dspillett · on Nov 12, 2019

This is the case for the older equivalent syntax for XML also. Though it isn't quite SQL Server automatically doing anything, i.e. automatically extended the GROUP BY clause as needed. It is simply a short-hand for "use your standard heuristic to guess my schema instead of me giving it to you" and you explicitly tell it you want that heuristic to be followed, it isn't just deciding to because the other things you've told it are ambiguous. Also note that it has not always been version safe: I had trouble between SQL 2005 and 2008 with the way some FOR XML AUTO queries were interpreted having changed.

Maybe a similar syntax, "GROUP BY AUTO", could be useful, but I suspect it would quickly become a code smell - some people would slap "AUTO" in when they get an error in much the same way they slap "DISTINCT" in to fix an accidental cartesian product without actually thinking about the problem.

As irritating as having to list many columns in the GROUP cause of a statement with a wide output, I like that it has to be explicit: if I have the column lists wrong I've made an error and I don't want SQL Server to guess how to fix that error. Perhaps instead of missing a column in the grouping list I've instead messed up in the projection clause. And if the statement is getting inconvenient enough that the grouping clause is a significant irritation, perhaps it needs to be refactored more generally.

One further thing to note is that some DBs do automagically decide what to do when columns are neither in an aggregate or the grouping list: mySQL for instance will just return an arbitrary value from those that exist in the current group for such columns (IIRC the first value hit as the data is read). This leads to situations where the query seems to work fine until something changes in either the data balance or the available indexes (or some change is made to the query itself) that makes the processing happen a different way around so a different arbitrary value starts to get selected for the same group in the same query.

corey_moncure · on Nov 12, 2019

>This leads to situations where the query seems to work fine until something changes in either the data balance or the available indexes (or some change is made to the query itself) that makes the processing happen a different way around so a different arbitrary value starts to get selected for the same group in the same query.

That's a great point

paulddraper · on Nov 12, 2019

FOR JSON AUTO is a non-standard SQL extension.

And as the name suggests its somewhat magic/implicit in its behavior.

No ANSI SQL result is altered by changing the order of GROUP BY.

jimktrains2 · on Nov 12, 2019

> IMO, in an aggregated query, SQL ought to implicitly add all non-aggregated fields to the GROUP BY

I have literally never, ever wanted this. Why in the world would you group by any field not appearing in an aggregate?

Also, the group by can specify cubes and rollups, more over you may not want to group by any field, but rather an expression.

paulddraper · on Nov 12, 2019

You virtually always want this.

   SELECT count(*), favorite_color, favorite_song
   FROM person
   GROUP BY ...

The "..." is of course favorite_color, favorite_song.

Whether you want this or not, ANSI SQL requires that you group by at least these.

You could group by more more fields, such as favorite_pet, though that would be somewhat odd since you are did not include it in the result set.

jimktrains2 · on Nov 12, 2019

Ah, you meant in the select, not the table. I guess I'm just much more in favour of being explicit, but I can understand the usefulness of what you said now that I fully understand it.

sanderjd · on Nov 12, 2019

This would definitely lead to me being very confused when things I'm selecting that I thought were aggregates were actually being implicitly grouped, or when one thing I'm selecting that I didn't realize is an aggregation results in grouping by all the other fields. There is no way for the query engine to know which direction I made a mistake in, so it just tells me something is wrong and let's me figure out how to fix it. Failing fast is much better than trying to guess at what I'm trying to do.

paulddraper · on Nov 12, 2019

To summarize: "Writing things twice allows you to know where you went wrong and correct it."

That's true.

I still prefer DRY, but I can appreciate the WET perspective.

sanderjd · on Nov 13, 2019

Your summary is wrong. When you write "select field" you are saying which field to return from the query. When you write "group by field", you are saying that you want the query to group by that field when generating query results. You are not writing anything twice, you are writing two different things.

paulddraper · on Nov 13, 2019

When I write "select field" I am saying which field I want to return from the query and which field I want to group by. There's no difference. And the DBMS enforces that there is no difference. *

* Except for the special case when it's non-aggregated and I want duplicate rows in the result set. SQL allowing duplicate rows is a departure from the underlying relational paradigm.

jeltz · on Nov 12, 2019

The reason it needs to be explicit is to prevent mistakes, which is why it is a good thing that most databases try to infer which columns are functionally dependent on the GROUP BY columns so that you can skip them for the GROUP BY clause.