Hacker News new | past | comments | ask | show | jobs | submit login
SQL Window Functions (helenanderson.co.nz)
151 points by helenanders26 on Sept 3, 2019 | hide | past | favorite | 43 comments



I interview people quite often for data engineer positions where good knowledge of SQL is also required and I'll usually have the applicant give me a presentation of some system or data pipeline that they have worked on earlier (any possible business or client secrets can of course be obfuscated or omitted). I'll ask questions about that system or process "like I'm 12 years old" and then I'll make up on the fly some kind of problem statement that I would like to have this and this kind of results from the data - how would you do it? Very often I include problems that would be easily solvable with window functions and more often than not those who know them get hired :)

IMHO it is really helpful to know these.


Yeah, knowledge of window functions allows to solve quite complex data manipulation problems fairly easily.

For some data-engineering type of work such as sessionization, doing it without window functions would make the task really complicated.

In some platforms such as MySQL there are alternatives such as correlated subqueries that also allow to do extensive data manipulation easily, but at quite the cost penalty.

In my experience, people who know window functions, are already quite well versed in SQL and thus can serve as a good proxy to gauge overall experience in Analytical SQL.


I need to pick them up myself. I'm decent in SQL, but still have a ways to go and there are a few times a year where they are critical.


The existence of LAG and LEAD have solved some seriously iffy requirements twice in the past year now. One time for measuring / determining from two consecutive GPS points some details about speed, route and distance (omitting some details here), LAG allowed to see and compare the previous points recorded and is IMO absolutely vital for PostGIS.

second time for calculating changes between two rows and combining that with some business rules, the existence of LAG was absolutely crucial for keeping it all in the database.

RANK / PARTITION BY is something that when you need it you really really need it. Like if you join a bunch of rows to a single column but would like to just have the first or last of those rows for results and disregard all the other rows? RANK and order by


I would say you need _to know about the existence of_ window functions. Knowing they exist and generally what they are used for means some time in the future I can identify a problem I am trying to solve, and then go read up on the documentation for window functions and use them in that place, and then probably forget about the specifics.


"...there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know." - Donald Rumsfeld

Basically if you know that something exists, but don't know about it you can still go learn it and apply it to solve a problem.

But if you don't know about it, most likely you'll try to reinvent the wheel and spend way too much time doing it.


It always annoys me when that phrase is attributed to Donald Rumsfeld. The concept is much older than him, and the attribution to him gives him an air of wisdom that he doesn't deserve.


Then find someone else to credit it with? Seems futile to combat this attribution without a replacement


Does it need to be attributed to someone? It has long been a well known concept in the intelligence community.


I guess it depends on the nature of your work. If you're doing any kind of analytics, window functions are indispensable for their ability to have a set act like a list.


Yep. When you need a window function, you _need_ a window function. The alternative isn't pretty.


In my opinion this is great advice for pretty much any programming concept - and one of the reasons I regret not getting a formal computer science education.


I have no formal CS education other than basic programming and computer security (was an HVAC-engineer), and find that I outsmart the “fast hamsters in their wheel with a CS-degree” because I check with them regularly and try to find all my ‘unknown unknowns’.

Reading up on (overview of) all the SQL functions is a good start.


The author explicitly states that she is using these in SQL Server, but I did a bit of digging and they are supported in ANSI SQL2003/2008, which is implemented by all the major vendors (MySQL, PostegreSQL, Oracle, DB2, SQL Server).

Wikipedia has some data on support: https://en.wikipedia.org/wiki/Select_(SQL)#Window_function_s...


MySQL only as of 8.0, which also added CTEs, more JSON features, and index features relevant to indexing JSON.


Date is hard. The example says

> BETWEEN '2018-01-01 00:00:00:000' and '2018-12-31 00:00:00:000'

So the last day of december will be missing.


BETWEEN is inclusive, so it will match '2018-12-31 00:00:00:000' but not '2018-12-31 00:00:00:001'. So if the date column actually only stores dates (not times) then it will actually work correctly - certainly using "2019-01-01 00:00:00:000" would give wrong results.

It would be safer, and clearer, by doing the half-open comparison explicitly. But I would argue it's BETWEEN that's hard (because it's closed rather than half-open as you'd expect) rather than dates, at least in this case.

orderdate >= '2018-01-01 00:00:00:000' AND orderdate < '2019-01-01 00:00:00:000'


In postgres I would use tsrange to make it very explicit if I want to include the edges or not: '[2018-01-01 00:00:00, 2019-01-01 00:00:00)'::tsrange @> '2018-01-01 00:00:00'::timestamp


I didn't realise that Postgres has ranges, thanks for that. But, even though I have a background as a pure mathematician, I strongly disagree that "[a, b) includes x" is any more explicit than "x >= a AND x < b". Given that those ranges aren't standard SQL and are much less commonly used than simple inequality operators, I'd stick with inequalities for clarity.


You don't have to put a full timedate inside the comparison string. Writing `BETWEEN '2018-01-01' AND '2019-01-01'` works correctly and it's pretty readable.

However, I agree that making `BETWEEN` use closed intervals is one of the many design-level mistakes on SQL.


`BETWEEN '2018-01-01' AND '2019-01-01'` does not work correctly. It will match '2019-01-01 00:00:00.000'. Assuming that this column only contains dates (i.e. with the time always set to 00:00:00.000) this means that 2019-01-01 will always be matched, one whole extra day than is required.

`BETWEEN '2018-01-01' AND '2018-12-31'`, which as you say is just shorthand for what the article says, will correctly match '2018-12-31' (again, assuming the time in all values is precisely 00:00:00.000) and will not '2019-01-01', which is the correct behaviour.


If the dates are stored in strings, then you are wrong in both counts.

'2019-01-01' < '2019-01-01 00:00:00.0000', so that time won't match the date. '2018-12-31' < '2018-12-31 00:00:00.000', to it will not match that either.


> If the dates are stored in strings

The rest of your comment is fairly academic because, surely, the dates are stored as a native date/time type. (Either `date` itself or `timestamp without timezone` - probably the latter given that the literals in the article mention times.) Using strings would both be inefficient cause numerous correctness problems, including, as you just said, the fact that comparison with '2018-12-31' would give different results to comparison with '2018-12-31 00:00:00.000'. There's no reason to assume that the article had chosen this insane path.

Just to humour the possibility that someone had, foolishly, used a string for this possibility: Your most recent comment is correct that as strings '2018-12-31' < '2018-12-31 00:00:00.000', so the BETWEEN in my previous comment would not have worked (although it depends, of course, on what exact strings were stored in the column!). But the comparison you suggested in your previous comment would also be incorrect, for the same reason I already objected to: `BETWEEN '2018-01-01' AND '2019-01-01'` will match (the string) '2019-01-01'.


Not to tack on too much hyperbole but window functions are amazing and can save you from doing a lot of self joins.


Same with CTEs. In the last year I've worked with both and they really change the game for nontrivial queries.


I don't know that that's hyperbole. Many folks do feel that window functions were no less than a revolution.


I have used window functions in SQLite. You don't need it a lot, but there are some uses for it. However, I don't know what the EXCLUDE clause is good for; it doesn't seems so useful.


Yeah the addition of window functions onto sqlite last year was quite a good feature to enable more data use cases with sqlite.

The exclude clause is part of the "window" configuration which defines which row to process, with it you can for instance define a range between all the previous and next rows and check if your data-point looks abnormal.


I remember using window functions to create complex data outputs for data transformation pipelines. I didn't even know what they were but postgres just started supporting them and it looked like the right solution at the time.

I don't find a whole of use for them now but, I believe simply being aware they exist and an idea of what they do is easy enough to research and apply to any project you are working on. This is from a analyst and dev perspective.


Anyone have an example of using window functions in a non time series data set? Kinda seems like a bit of a flag in my head that if I'm asked to do some analysis w/ time series data (depending on the question) a window function might be a good the tool. Reading this you can tell I'm not particularly fluent in them. We use them at work and I was asked during my interview about them (did not know but got job anyway).


Say you have to grab a bunch of files a few times a day and zip them up and drop them on and endpoint somewhere. But the endpoint has a max file size for a single upload.

If you have a table of metadata about your files (zipped_size and location, for example) you can window on the sum of the zipped_size to collect some number of files you can get that will be under the file size limit.


Thank you


Window functions can be used to compute the median:

    select percentile_disc(0.5) within group (order by things.value)
    from things


Thank you


As well as the basics, it really helps to understand all the different options for window frames, allowing you to only look backwards or forwards a certain amount, or skip the current row etc with more recent support in Postgres at least.

Also if you want to get really fancy, it's always worth remembering that you can write your own aggregates in Postgres, which can of course work as window functions.


In BigQuery I've found array_agg() to be preferable in most use cases.

Also, in Hive I used to use Custom Reducers for many things where now I have to make do with Window Functions.

BigQuery is blazingly fast though so it seems a reasonable sacrifice.


I use pretty simple window functions with almost every BigQuery query I write, I can't imagine going back to a database that doesn't support them now.


Yeah - I've never had to use a db that didn't support them.

It's almost unimaginable for me!


Why, in the second example with RANK(), does the values for 1/2 go from rank 12 to 15 and then 6, 7?

Am I missing something obvious?


It's because the query has an ORDER BY clause that orders it all by sale_date, whereas the RANK() is calculated from subtotal order.


How is this different from LIMIT and OFFSET options?


Very.


...but how?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: