I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.
Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().
It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.
It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].
I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.
Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.
For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases.
https://www.sql-workbench.eu/dbms_comparison.html
There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)
I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)
If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:
SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;
"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."
So you can query like this to get stats for all districts in CA->Mendocino county:
SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL
or like this to get a single aggregate of all the counties in CA put together:
SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL
However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:
SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'
If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.
I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.
SQL opens up when used with OLAP schemas. Most devs are experienced in querying "object mapped" schemas where cube, roll up, etc. are not useful. Nothing bad per se, but it can give an impression that SQL is a bad language, when actually it clicks well with a proper data schema.
Indeed. I think your mind can really be opened by having to answer complex business questions with an expansive and well designed data warehouse schema. It's a shame it's such a relatively niche and unknown topic, especially in the startup world.
This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.
I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:
`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.
Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.
In the context of `group by` it's treated as grouping sets, but that's not its only use. (Though that does seem to be special cased in terms of parsing, since afaict - I can't find the full query BNF on mobile - `grouping sets` is not optional.)
Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.
it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.
If you aren't careful you can cause that without infinite recursion. If the query optimiser an't see to push relevant predicates down to the root level where needed for best performance, or they are not sargable anyway, or the query optimiser simply can't do that (until v14 CTEs were an optimisation barrier in posrges), then you end up scanning whole tables (or at least whole indexes) multiple times, where a few seeks might be all that is really needed. In fact, you don't even need recursion for this to have a bad effect.
CTEs are a great feature for readability, but when using them be careful to test your work on data at least as large as you expect to see in production over the life of the statements you are working on, in all DBMSs you support if your project is multi-platform.
HAVING is less understood? There’s nothing strange about HAVING, it’s just like WHERE, but it applies after GROUP BY has grouped the rows, and can use the grouped row values. (Obviously, this is only useful if you actually have a GROUP BY clause.) If HAVING did not exist, you could just as well do the same thing using a subselect (i.e. doing SELECT * FROM (SELECT * FROM … WHERE … GROUP BY …) WHERE …; is, IIUC, equivalent to SELECT * FROM … WHERE … GROUP BY … HAVING …;)
FWIW I just tested a somewhat complex query using HAVING vs a sub-select as you indicated and Postgres generated the same query plan (and naturally, results) for both.
But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.
Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:
SELECT
*,
row_number() OVER (
PARTITION BY quux.record_time
ORDER BY quux.projection_time DESC
) AS record_version
FROM foo.bar.quux AS quux
WHERE (NOT quux._bad OR quux._bad IS NULL)
QUALIFY record_version = 1
Without QUALIFY, you'd have to nest queries (ugh):
SELECT *
FROM (
SELECT
*,
row_number() OVER (
PARTITION BY quux.record_time
ORDER BY quux.projection_time DESC
) AS record_version
FROM foo.bar.quux AS quux
WHERE (NOT quux._bad OR quux._bad IS NULL)
) AS quux_versioned
WHERE quux_versioned.record_version = 1
or use a CTE (depending on whether your database inlines CTEs).
I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:
But HAVING can also act on aggregate results. In fact, the example in the article is not the most important use for HAVING. Subselects can't do something like.
SELECT year, COUNT(*) sales, SUM(price) income
FROM sales
HAVING sales > 10
AND income < 1000;
For more complicated examples, HAVING can produce easier to read/understand/maintain statements.
There may also be performance differences depending on the rest of the query, but for simple examples like this exactly the same plan will be used, so the performance will be identical.
Unfortunately, the simplest examples do not always illustrate the potential benefits of less common syntax.
What I never understood is why HAVING and WHERE are different clauses. AFAIU, there are no cases where both could be used, so why can’t one simply use WHERE after a GROUP BY?
(I know that I am probably missing some important technical points, I would like to learn about them)
It doesn't improve the power of SQL, it's just syntactic sugar. Because of how the SQL syntax works, you'd have to do something like:
WITH A AS (
SELECT x, sum(y) AS z
FROM SomeTable
GROUP BY x
)
SELECT * FROM A WHERE z > 10
With an HAVING clause you can instead just tuck it after the GROUP BY clause.
Also, although it's not an issue these days given how good query planners are (any decent engine will produce exactly the same query plan with a subquery or an having clause, it's indexes that fuck up stuff), but you're signaling that the filter happens "at the end".
It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.
HAVING is a hack. Because SQL has such an inflexible pipeline we needed a second WHERE clause for after GROUP BY.
It would be nice if we could have WHERE statements anywhere. I would like to put WHERE before joins for example. The optimizer can figure out that the filter should happen before the join but it is nice to have that clear when looking at the code and it often makes it easier to read the query because you have to hold less state in your head. You can quickly be sure that this query only deals with "users WHERE deleted" before worrying about what data gets joined in.
I like using HAVING just to have conditions that reference expressions from the SELECT columns.
e.g. rather than having to do
SELECT
COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
...
FROM ...
WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL
just do
SELECT
COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
...
FROM ...
HAVING district IS NOT NULL
Hopefully the optimizer understands that these are equivalent, I haven't checked.
It doesn't irk me anything like as much as 'a Docker' (a Docker what? Usually container) for some reason.
Although perhaps that shouldn't annoy me anyway, it ought to be better than being specific but inaccurate (which you could probably expect from someone unfamiliar enough to say 'a Docker') - mixing up image/container as people do.
Forget HAVING -- in my opinion RANK is the most useful SQL function that barely anyone knows about. Has gotten me out of more sticky query issues than I can count.
RANK is a good example of a useful less known window function. Please also consider dense_rank and row_number. All 3 have their advantages and disadvantages... (Rank leaves holes after rows with the same order, dense_rank gives consecutive numbers, row_number just numbers the rows.)
Well, I used SQL professionally for at least 7 years in different capacities and “having” was a new construct to me. You might say that I suck at SQL or you might realize that there’s more than one way to achieve a result in SQL. I usually prefer “with” statements. I hope this is not the only criteria you use to filter candidates.
In my mind a better question would be to state the problem and see if the candidate can come up with a working SQL statement.
I would ask you more questions like window functions, cte, joins, etc. or how would you address certain questions, and tbh i would be really curious that you know good sql but haven’t heard of having.
In my hiring experience it always acted as my first sql impression of candidates and 100% filter of people who don’t actually know sql besides some basic inner joins and aggregations. There are always exceptions i guess. :-)
Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.
It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US
You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14
Eg
select c.ctry_name, i.year_nbr,
sum(i.item_cnt) as tot_cnt,
sum(i.invoice_amt) as tot_amt
from country c
inner join invoice i on (i.ctry_code = c.ctry_code)
where c.ctry_name <> 'USA'
group by c.ctry_name, i.year_nbr
The generous reading, which I'm inclined to, is that it's illustrating OLAP techniques which one would actually apply to a relational database of high normalization.
It's tedious to construct an example database of, say, fifth normal form, which would show the actual utility of this kind of technique. So we're left with a highly detailed query, with some redundancies which wouldn't be redundant with more tables.
My SQL knowledge is very limited - I had heard of HAVING but not GROUP BY CUBE or COALESCE - but one thing stood out: "The rewritten sql ... ran in a few seconds compared to over half an hour for the original query."
I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?
> I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?.
Clearly not since they got it down to a few seconds ;).
But tongue in cheek aside, it's incredibly dependent on what is in those 4M rows, how big they are, if they are indexed, whether you join in weird ways, whether the query planner does something unexpected.
SQL tooling is by and large pretty barbaric compared to most other tool these days, Intellij (and various spin off language specific IDE's) have the best tooling I've seen in that area but even then it's primitive.
I'd have to disagree with you on the tooling. Mature (especially commercial) SQL databases have a lot of great tooling built around them that inany areas surpass most of the language tools. For example, Extended events, which comes with SQL Server by default, allows you to trace/profile/filter hundreds of different events in real time, going from simple things like query execution tracing and going all the way to profiling locks, spinlocks, IO events and much more. There's (also built-in) another tool called Query Store that allows to track performance regressions, aggregate performance statistics etc. And then there's whole infrastructure of 3d party tools for things like execution plan analysis, capacity planning etc etc. Oracle has similar rich set of tools. Postgres is lacking some of those, but it's getting better.
IDE support in JetBrains products is not that far away from, say, java experience, but from a pure coding perspective it's a bit behind.
I think that when doing these kind of report style queries cubed/rolled up columns becomes labels and loses their data type. So it makes sense to convert+coalesce the year column as well to get 'All years' instead of the hard to interpret "null year".
How would one use the query result in the example? Every rows having multiple grouping levels seems like a hard result set to use in any capacity other than a human reading it and interpreting it directly
Well, I consider myself somewhat fluent with SQL, and for some reason left joins are the ones that occasionally get me really confused - so much so that I actively try to avoid them. Trouble is not in the vanilla cases, but when you start throwing multiple tables and multiple where clauses in the same query, then there is something about left joins and nulls that is really unintuitive to my brain. Maybe I should spend some time and study the left join more...
I think left join is a bad name. Probably something like OPTIONAL JOIN or TRY JOIN would be more obvious. Of course the problem is then what do you call a right join? REVERSE OPTIONAL JOIN? But that is getting pretty confusing now. But maybe worth it because in my experience left is far more common.
It's best to pretend that RIGHT JOIN doesn't exist, imnsho.
For one thing, in SQLite, it doesn't. Which is a weak argument for not using it on supported systems. The other weak argument is that a RIGHT JOIN is just the b, a version of a LEFT JOIN a, b.
When you add them up it's an extra concept, SQL execution flow is already somewhat unintuitive, and a policy of using one of the two ways of saying "everything from a and matches from b" makes for a more consistent codebase.
I would hope a blue-sky relational query language wouldn't support two syntaxes for an operation which is non-commutative, when order is important it can and should be indicted by order.
It'll be a long time before one can use it in portable SQLite queries, for those cases where that matters. I'll continue to eschew the right join for the clarity of only thinking about that relation in one way, but we statically link SQLite for several good reasons, including being able to use new features as they arrive.
You are right about the LEFT join coupled with a filter in the WHERE clause sometimes makes the join a normal INNER join. Just when this is the case is difficult for me to work out. This is often a trick question in some SQL assessments I have seen. I say this as someone with over 20 years of SQL almost on a daily basis. I avoid such scenarios by using CTEs.
After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...
"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)