Hacker News new | past | comments | ask | show | jobs | submit login
Having, a less understood SQL clause (smallthingssql.com)
169 points by vpanyam on Aug 6, 2022 | hide | past | favorite | 81 comments



Having is less understood?

After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...

"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)


I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...


I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.

Is there such a book?


In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...


This is great! Thanks for the link. :)


Joe Celko has a number of SQL for Smarties books I've been meaning to look through.


These are phenomenonal books. Your understanding of sql as both a language and as a tool will be transformed.

A bit dated in places but that also is helpful as you learn how a technique works instead of just relying on a vendor specific extension


Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.


Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.

MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html

For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html

There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)


I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.


I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)

If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;
"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL
or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL
However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'
If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.


SQL opens up when used with OLAP schemas. Most devs are experienced in querying "object mapped" schemas where cube, roll up, etc. are not useful. Nothing bad per se, but it can give an impression that SQL is a bad language, when actually it clicks well with a proper data schema.


Indeed. I think your mind can really be opened by having to answer complex business questions with an expansive and well designed data warehouse schema. It's a shame it's such a relatively niche and unknown topic, especially in the startup world.


This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.


I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:

`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.


Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.


Your suggestion is incorrect. CUBE is not part of the expression.


Well, I did say I wasn't familiar with it, but it's correct for `coalesce` and I don't think `cube` is different: https://www.postgresql.org/docs/current/cube.html

In the context of `group by` it's treated as grouping sets, but that's not its only use. (Though that does seem to be special cased in terms of parsing, since afaict - I can't find the full query BNF on mobile - `grouping sets` is not optional.)


https://www.postgresql.org/docs/14/sql-select.html

    GROUP BY [ ALL | DISTINCT ] grouping_element [, ...]

    grouping_element can be one of:

        ( )
        expression
        ( expression [, ...] )
        ROLLUP ( { expression | ( expression [, ...] ) } [, ...] )
        CUBE ( { expression | ( expression [, ...] ) } [, ...] )
        GROUPING SETS ( grouping_element [, ...] )
You're right about COALESCE of course.


Exactly the same for me. It makes me think I need to start reading the TSQL documentation cover to cover.


Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.


Heard of group by cube before, even tried it in production, but it was a total dog of a query to run, so I filed it away to try again in a few years.


Same!

I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.


The only “standard” feature id rather try not to understand is recursive CTEs lol


it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.


If you aren't careful you can cause that without infinite recursion. If the query optimiser an't see to push relevant predicates down to the root level where needed for best performance, or they are not sargable anyway, or the query optimiser simply can't do that (until v14 CTEs were an optimisation barrier in posrges), then you end up scanning whole tables (or at least whole indexes) multiple times, where a few seeks might be all that is really needed. In fact, you don't even need recursion for this to have a bad effect.

CTEs are a great feature for readability, but when using them be careful to test your work on data at least as large as you expect to see in production over the life of the statements you are working on, in all DBMSs you support if your project is multi-platform.


they’re just self joins on views of the query.


HAVING is less understood? There’s nothing strange about HAVING, it’s just like WHERE, but it applies after GROUP BY has grouped the rows, and can use the grouped row values. (Obviously, this is only useful if you actually have a GROUP BY clause.) If HAVING did not exist, you could just as well do the same thing using a subselect (i.e. doing SELECT * FROM (SELECT * FROM … WHERE … GROUP BY …) WHERE …; is, IIUC, equivalent to SELECT * FROM … WHERE … GROUP BY … HAVING …;)


FWIW I just tested a somewhat complex query using HAVING vs a sub-select as you indicated and Postgres generated the same query plan (and naturally, results) for both.


A DB wizard I used to work for showed me HAVING after looking at this nasty sub-sub select I did, with some app-layer-loop after.

Caesar if you see this thanks for being a great mentor.


Snowflake SQL also has the interesting feature QUALIFY.

Their docs go into more detail (https://docs.snowflake.com/en/sql-reference/constructs/quali...), but the short version is that typically SELECT is evaluated in the order FROM, WHERE, GROUP BY, HAVING, WINDOW, DISTINCT, ORDER BY, LIMIT.

But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.

Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:

    SELECT
      *,
      row_number() OVER (
        PARTITION BY quux.record_time
        ORDER BY quux.projection_time DESC
      ) AS record_version
    FROM foo.bar.quux AS quux
    WHERE (NOT quux._bad OR quux._bad IS NULL)
    QUALIFY record_version = 1
Without QUALIFY, you'd have to nest queries (ugh):

    SELECT *
    FROM (
      SELECT
        *,
        row_number() OVER (
          PARTITION BY quux.record_time
          ORDER BY quux.projection_time DESC
        ) AS record_version
      FROM foo.bar.quux AS quux
      WHERE (NOT quux._bad OR quux._bad IS NULL)
    ) AS quux_versioned
    WHERE quux_versioned.record_version = 1
or use a CTE (depending on whether your database inlines CTEs).

I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:

    (query (from foo.bar.quux :as quux)
           (select *)
           (where (or (not quux._bad)
                      (null quux._bad))
           (select (row-number :over (:partition-by quux.record-time
                                      :order-by (:desc quux.projection-time))
                               :as record-version)
           (where (= 1 record-version)))


> I definitely pine for some kind of optimizing Blub-to-SQL compiler

I've been playing with malloy[1] that lets you pipeline/nest queries like you are describing here.

  source: quux as from_sql(..) {
    where: record_version = 1 
    where: _bad != null  
  }
1. https://looker-open-source.github.io/malloy/documentation/la...


But HAVING can also act on aggregate results. In fact, the example in the article is not the most important use for HAVING. Subselects can't do something like.

SELECT year, COUNT(*) sales, SUM(price) income FROM sales HAVING sales > 10 AND income < 1000;


Why wouldn't

    SELECT *
    FROM (SELECT year, COUNT(*) as nb_sales, SUM(price) as income
          FROM sales)
    WHERE nb_sales > 10 AND income < 1000;
work just like your example?


For more complicated examples, HAVING can produce easier to read/understand/maintain statements.

There may also be performance differences depending on the rest of the query, but for simple examples like this exactly the same plan will be used, so the performance will be identical.

Unfortunately, the simplest examples do not always illustrate the potential benefits of less common syntax.


I totally agree with that, but I was responding to this statement:

> Subselects can't do something like: (…)

which is wrong.


This should work, right? Just a bit more unnecessary text.

SELECT year, sales, income FROM ( SELECT year, COUNT(*) sales, SUM(price) income FROM sales ) AS innerquery WHERE sales > 10 AND income < 1000;


Having is the where-clause for Group By. It's easier to understand by thinking the SQL query as a pipeline.

Stage 1: From returns the whole world of rows.

Stage 2: Where filters down to the desired set of rows.

Stage 3: Group By aggregates the filtered rows.

Stage 4: Having filters again on the aggregated result.

Stage 5: Select picks out the columns.


What I never understood is why HAVING and WHERE are different clauses. AFAIU, there are no cases where both could be used, so why can’t one simply use WHERE after a GROUP BY?

(I know that I am probably missing some important technical points, I would like to learn about them)


It doesn't improve the power of SQL, it's just syntactic sugar. Because of how the SQL syntax works, you'd have to do something like:

   WITH A AS (
        SELECT x, sum(y) AS z
          FROM SomeTable
      GROUP BY x
   )

   SELECT * FROM A WHERE z > 10
With an HAVING clause you can instead just tuck it after the GROUP BY clause.

Also, although it's not an issue these days given how good query planners are (any decent engine will produce exactly the same query plan with a subquery or an having clause, it's indexes that fuck up stuff), but you're signaling that the filter happens "at the end".

It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.


    select product
         , sum(price) as price 
      from table
     where price<1000
    having price>10000

You can refer to the aliased price column before or after aggregation using where or having. Depending on the sql engine.


I didn’t know one could still refer to the value before aggregation after it is aliased.

Is there an implicit group by in this query?


This is non-standard and highly dependent on the sql engine. If you believe in portability, your where clauses should (sadly) use the long form.

If that's not a requirement, this approach can add some clarity


There's plenty of cases - remove the people who have attribute, then aggregate them, then filter the aggregate.

Find me the list of non-deleted users who have more than 50 dollars worth of transactions in the transaction table.

Technically you can always subquery and use a where instead of a having but its nice to ... have.


I have pretty often cases where I use where and having. Where filters the results before the group and having afterwards.



HAVING is a hack. Because SQL has such an inflexible pipeline we needed a second WHERE clause for after GROUP BY.

It would be nice if we could have WHERE statements anywhere. I would like to put WHERE before joins for example. The optimizer can figure out that the filter should happen before the join but it is nice to have that clear when looking at the code and it often makes it easier to read the query because you have to hold less state in your head. You can quickly be sure that this query only deals with "users WHERE deleted" before worrying about what data gets joined in.


You can put WHERE before joins using nested queries, i.e. "FROM (SELECT * FROM users WHERE deleted") INNER JOIN ...".


I like using HAVING just to have conditions that reference expressions from the SELECT columns.

e.g. rather than having to do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL
just do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    HAVING district IS NOT NULL
Hopefully the optimizer understands that these are equivalent, I haven't checked.


This should work with WHERE too at least as long as the name given using AS in the SELECT is given to a row and not to an aggregate.


I use this all the time to find dups:

Select pk columns + count(column that might have dups) From table Group by pk Having count(*) > 1


Calling each select "a sql" is really cute


It doesn't irk me anything like as much as 'a Docker' (a Docker what? Usually container) for some reason.

Although perhaps that shouldn't annoy me anyway, it ought to be better than being specific but inaccurate (which you could probably expect from someone unfamiliar enough to say 'a Docker') - mixing up image/container as people do.


Agree. Let me create a quick Jira for that.


Thanks for your HN, appreciate it.


> Agree. Let me create a quick Jira for that.

Yeah, that really irritates me for some reason.


Forget HAVING -- in my opinion RANK is the most useful SQL function that barely anyone knows about. Has gotten me out of more sticky query issues than I can count.


The entire family of Windowing Functions which are mostly ubiquitous on all modern SQL seem to be glossed over by most.


RANK is a good example of a useful less known window function. Please also consider dense_rank and row_number. All 3 have their advantages and disadvantages... (Rank leaves holes after rows with the same order, dense_rank gives consecutive numbers, row_number just numbers the rows.)


Having vs where is my first question to filter candidates who have less experience in sql than they claim.


Well, I used SQL professionally for at least 7 years in different capacities and “having” was a new construct to me. You might say that I suck at SQL or you might realize that there’s more than one way to achieve a result in SQL. I usually prefer “with” statements. I hope this is not the only criteria you use to filter candidates. In my mind a better question would be to state the problem and see if the candidate can come up with a working SQL statement.


I would ask you more questions like window functions, cte, joins, etc. or how would you address certain questions, and tbh i would be really curious that you know good sql but haven’t heard of having.

In my hiring experience it always acted as my first sql impression of candidates and 100% filter of people who don’t actually know sql besides some basic inner joins and aggregations. There are always exceptions i guess. :-)


It’s also a hint that the candidate is not curious and doesn’t thoroughly read the docs which is a minus.


Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US

You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14

Eg

select c.ctry_name, i.year_nbr, sum(i.item_cnt) as tot_cnt, sum(i.invoice_amt) as tot_amt from country c inner join invoice i on (i.ctry_code = c.ctry_code) where c.ctry_name <> 'USA' group by c.ctry_name, i.year_nbr


> Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

Yeah, four queries for the four requirements seems the most straightforward, and easier to maintain than the final version.

But still a very nice illustrating of group by cube.


The generous reading, which I'm inclined to, is that it's illustrating OLAP techniques which one would actually apply to a relational database of high normalization.

It's tedious to construct an example database of, say, fifth normal form, which would show the actual utility of this kind of technique. So we're left with a highly detailed query, with some redundancies which wouldn't be redundant with more tables.


My SQL knowledge is very limited - I had heard of HAVING but not GROUP BY CUBE or COALESCE - but one thing stood out: "The rewritten sql ... ran in a few seconds compared to over half an hour for the original query." I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?


> I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?.

Clearly not since they got it down to a few seconds ;).

But tongue in cheek aside, it's incredibly dependent on what is in those 4M rows, how big they are, if they are indexed, whether you join in weird ways, whether the query planner does something unexpected.

SQL tooling is by and large pretty barbaric compared to most other tool these days, Intellij (and various spin off language specific IDE's) have the best tooling I've seen in that area but even then it's primitive.


I'd have to disagree with you on the tooling. Mature (especially commercial) SQL databases have a lot of great tooling built around them that inany areas surpass most of the language tools. For example, Extended events, which comes with SQL Server by default, allows you to trace/profile/filter hundreds of different events in real time, going from simple things like query execution tracing and going all the way to profiling locks, spinlocks, IO events and much more. There's (also built-in) another tool called Query Store that allows to track performance regressions, aggregate performance statistics etc. And then there's whole infrastructure of 3d party tools for things like execution plan analysis, capacity planning etc etc. Oracle has similar rich set of tools. Postgres is lacking some of those, but it's getting better. IDE support in JetBrains products is not that far away from, say, java experience, but from a pure coding perspective it's a bit behind.


I think that when doing these kind of report style queries cubed/rolled up columns becomes labels and loses their data type. So it makes sense to convert+coalesce the year column as well to get 'All years' instead of the hard to interpret "null year".


How would one use the query result in the example? Every rows having multiple grouping levels seems like a hard result set to use in any capacity other than a human reading it and interpreting it directly


Well, I consider myself somewhat fluent with SQL, and for some reason left joins are the ones that occasionally get me really confused - so much so that I actively try to avoid them. Trouble is not in the vanilla cases, but when you start throwing multiple tables and multiple where clauses in the same query, then there is something about left joins and nulls that is really unintuitive to my brain. Maybe I should spend some time and study the left join more...


I think left join is a bad name. Probably something like OPTIONAL JOIN or TRY JOIN would be more obvious. Of course the problem is then what do you call a right join? REVERSE OPTIONAL JOIN? But that is getting pretty confusing now. But maybe worth it because in my experience left is far more common.


It's best to pretend that RIGHT JOIN doesn't exist, imnsho.

For one thing, in SQLite, it doesn't. Which is a weak argument for not using it on supported systems. The other weak argument is that a RIGHT JOIN is just the b, a version of a LEFT JOIN a, b.

When you add them up it's an extra concept, SQL execution flow is already somewhat unintuitive, and a policy of using one of the two ways of saying "everything from a and matches from b" makes for a more consistent codebase.

I would hope a blue-sky relational query language wouldn't support two syntaxes for an operation which is non-commutative, when order is important it can and should be indicted by order.


> For one thing, in SQLite, it doesn't.

It does now, since the latest release 3.39, along with full join.


That's great news!

It'll be a long time before one can use it in portable SQLite queries, for those cases where that matters. I'll continue to eschew the right join for the clarity of only thinking about that relation in one way, but we statically link SQLite for several good reasons, including being able to use new features as they arrive.

Full join is certainly a welcome addition.


It's actually pretty obvious name. You just need to visualize a Venn diagram and left, right and inner become obvious.


You are right about the LEFT join coupled with a filter in the WHERE clause sometimes makes the join a normal INNER join. Just when this is the case is difficult for me to work out. This is often a trick question in some SQL assessments I have seen. I say this as someone with over 20 years of SQL almost on a daily basis. I avoid such scenarios by using CTEs.


WHERE filters before the aggregate. HAVING filters after the aggregate. What’s not to understand here?


øøhhhh...this post was a poor write-up on something that have nothing to do with having.

I also dislike code that does not have all lines included. The UNION clauses here are missing which i simply find irritating.


Having is where for after the collation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: