Optimizer nuances aside, I agree with this and think it's helpful for anyone writing SQL. I've trained a lot of non-developers (and new-developers) to use SQL, and one of my favorite tips is, "Start with FROM."
Even sooner than "start with FROM" is "What does each row of the result mean?" Is it a sale? A month of sales? A customer? If you can answer what you want, then it's usually easy to start there. Especially if your answer corresponds to an existing table, put that in the FROM, and then add whatever joins you need. You can even think about the joins "imperatively".
Thinking imperatively about JOINs is also a helpful way of understanding a mistake I call "too many outer joins", which I wrote about here: https://illuminatedcomputing.com/posts/2015/02/too_many_oute... If you don't know about this mistake, it's easy to get the wrong answer from aggregate functions like AVG or COUNT. (EDIT: I should probably call this "too many joins" instead, since it's not really specific to outer joins. But in practice I find outer joins are more often used in these sort of situations.)
For updates, and deletes on the other hand I try to start with the where clause. Learnt the lesson after running a update query once before i typed the where to cause something like 40k worth of damage
I've forced the habit upon myself to first compose a SELECT statement and verify that it will give me exactly the dataset that I want to run my operation on. Then I wrap that statement into a CTE and run the DELETE / UPDATE on the CTE.
This is the one I use too. Tiny difference - I write
select * -- delete
from sales
where customer = 1
so it's even harder to accidentally highlight the 'delete' part.
Although I've switched to DBeaver for a year now, and it automatically pops up a warning when it detects a DELETE query without a WHERE, which is very nice.
DBeaver has something even better, albeit sometimes annoying: you can define a connection as "Production" (highlighted red) and "Test" (green). On Production systems, DBeaver will ask you for confirmation for every write statement.
This saves you only if you delete directories. Better start with `ls`. The one with git force push is spot on though. This is one of the reasons to prefer tools that let you write options at the end of the argument list instead of after the command name.
Yeah i've made a whoops like this before. I usually start a n update with a select statement to make sure my assumptions about the intent of the change is on point.
For destructive statements I almost always do a select with the WHERE statement that I will use in the delete first so I can see I am not doing anything stupid.
Thanks to both of you for these articles. I'm very fond of outer joins as in most cases (in my day to day work) they're exactly what I need. While I don't use them often with aggregate queries, I didn't realize that multiple OUTER JOINS will break aggregates. I'll need to review my query logs to make sure I'm not doing this anywhere.
Unless I have the whole query in my head by the time I've typed "$db->query(", I will indeed start with the FROM by just typing the placeholder SELECT * and then revisiting the asterisk when I'm done with (most of) the query. Somehow that makes it easier, especially with complex queries.
If you type just a table name into an empty cell it will autocomplete with a full SELECT statement. e.g. if you typed "pubuser" and selected "public.users" it would autocomplete with:
> Especially if your answer corresponds to an existing table, put that in the FROM, and then add whatever joins you need.
This is absolutely right. In fact, my ideal query language would require the programmer to be explicit if they want to increase the cardinality of the result set. Meaning, if I start a query with "from Employee ..." the query language should not let me accidentally add a join that will cause the result set to contain the same Employee more than once. (It should be possible, but only if I make it clear that it is intentional and not an accident.)
This is one of the things MS Entity Framework gets right - it enforces the "start with FROM". It actually ends with SELECT, which makes so much more sense.
Seriously, between navigational properties, group-with-children, EF to me hammers home how bad SQL is. EF is a square peg in a round hole but it does a great job at makign some trivial improvements over SQL, while still remaining true to the principles of relational databases.
Of course, the framework is a hairy mess for other reasons (lazy loading, untranslateable methods, ugly generated queries) but the alternative to SQL it presents is lovely.
What do you mean SQL is bad? SQL is very good and some dialects can do some amazing things. I don't think EF is much better, and EF (older Linq2Sql as well) has some gotchas because it converts your expression into SQL, except for when it doesn't and throws a runtime error that it couldn't. And there are some edge cases where you think a C# expression would convert to SQL but alas, it wont.
The foundation theory is the relational model of data, and the use of operators from relational algebra. So, we use relations to represent the data in the application domain, and use the operators of relational algebra to take relational inputs and produce relational outputs. It's an extensible and composable model.
When we solve problems using relations, we benefit by thinking in terms of relations.
We can ask: How well does SQL allow us to express an algorithm in relational terms? (Let's ignore that that SQL introduces some non-relational impurities that may have practical value - eg column order, row order, duplicate rows.)
When I examine an SQL expression, and then visualize the relational operators involved, I find a very poor correspondence between the SQL expression and the relational algebra. I need to basically read the entire statement to understand the parts. (Contast to the LINQ examples, or perhaps a shell pipline.) In a way, this is the substance of the original article.
Thanks. Even on just that point, I disagree. I think they are equally readable. Just like Python and C# are equally readable to the respective cohorts that are used to programming in those languages.
Yes, EF queries are more readable to many developers who know C# well but don't want to get used to SQL. But the problem with that is DB apathy + EF leads to some real performance problems as systems scale that grinds the application to a halt with SQL queries longer than dissertations, to which the developers need a quick hack to fix. The query plan for those beasts is intractable. Which of the 10 table scans or index scans to tackle first? And how do you tackle it?
They discover "indexed views" and all is well until you start to have locking problems, so more hacks are needed. etc. By writing the damn SQL to begin with they'd have to tame the beast earlier on and address the design flaws in their schema and/or application.
Ok this is a bit of a strawman (although it is based on experience) but I think saying SQL is less readable is a red herring. It is more readable for a lot of use cases. It's a DSL designed for the job, you really can't get much better (except for Elastic Search use cases, etc.). It's slightly old fashioned but that's just fashion.
This isn't about ORMs or database performance. It's just a comparison of query languages, with LINQ being more natural for data models with lots of joins and nested records that are common today (especially since that's how people work with objects inside their programming language).
SQL is far from perfect. It would be nice to see some modern SQL dialects that follow the LINQ syntax and are natively supported by RDBMS.
> "Yes, EF queries are more readable to many developers who know C# well but don't want to get used to SQL."
My main field is databases. Been working with SQL for ~15 years. My knowledge of LINQ is much less than SQL (partly also because I was writing plenty of SQL before LINQ existed).
I also think SQL is a big big mess full of leaky abstractions that forces you to do things in unituitive ways. And yet it is incredibly powerful. But don't listen to me. Read any book by C.J.Date (especially this one http://www.thethirdmanifesto.com/), for all the reasons SQL is poorly designed.
> "But the problem with that is DB apathy.....leads to some real performance problems....."
I completely agree. Databases are very powerful and more people should know how to properly use them so that the rest of us would have fewer headaches.
> " It's a DSL designed for the job, you really can't get much better"
The problem is that due to many reasons it is the only such DSL in wide spread use. Any alternatives are just footnotes.
I do prefer this structure, but it differing for this one product is more of a cost than I'd like to pay for it. SQL is wonderful because so much of it is platform independent - nearly everything can be expressed in a common SQL manner and, in practice, I tend to avoid dialect specific SQL whenever possible - this just ups the comprehension immensely.
There are a bunch of tech that's tried to reinvent the wheel with SQL like Redis and Splunk... both of those systems are just a pain to interact with - their variants have strengths, sure, but they lose the ease of declaration and universality of SQL. I'd rather see a SQL variant emerge (if it were to) as a logically compatible declarative language that various DB vendors can add support for.
The thing is that LINQ's purpose is not to reinvent SQL. While EF has an adapter which converts LINQ expressions to SQL, you can use LINQ against objects and other sources for which it is incredibly useful. In fact, it's better suited to those tasks as it doesn't map 1:1 with SQL. But as someone who uses EF and LINQ on a daily basis, it is possible to learn how the mapping is performed and get the results you want consistently even in some pretty advanced scenarios.
Side note, I fell in love with the Mongo LINQ driver. There is no impedance mismatch. While I never used LINQ to avoid learning sql, I would create my complex Mongo queries in LINQ first.
Would there be any conceptual barrier to a particular DB engine getting an updated parser that could handle "FROM... SELECT..." as well as "SELECT... FROM..."?
It's my primary tool for querying against databases, because like you, I really prefer writing LINQ queries to SQL. It won't let you do most DB admin tasks, but that's what SSMS is for.
> Of course, the framework is a hairy mess for other reasons (lazy loading, untranslateable methods, ugly generated queries) but the alternative to SQL it presents is lovely.
What is wrong with lazy loading enabled by default?
Hidden side effects. Iterating over a list of something in one layer shouldn't trigger database queries in another.
Ideally, you should know ahead of time what data you need and how to pre-load it without relying on something that could cause multiple N+1 issues based on the depth of lazy loading.
>This is one of the things MS Entity Framework gets right - it enforces the "start with FROM"
It gets it "right" (right being a very loosely defined term here) by happenstance, because you're operating off of an object. Not because it was some higher level design choice or discussion.
The author has been posting some pretty awesome SQL content on Twitter recently. I'm a data scientist who uses SQL pretty frequently, and I'm learning lots of stuff - from random tidbits like this that clarify something I think I intuitively knew to things that immediately became useful to me.
Unless I'm writing 'select *', I always do the select portion of a SQL query last. Starting with from enables helps with autocomplete on most editors too. I actually with 'select' was at the bottom. The only reason it would be useful at the top would be if it allowed you to use the new/computed column names as aliases later in the query, which doesn't work with most databases.
Which you probably shouldnt, because you are just writing the subquery each time - a temp table in MSSQL is a much better solution than CTEs 99/100 times.
It depends - generally you wouldnt ever drop a temp table because when it goes out of scope it dies.
If you wanted the CTE/temp table to re-evaluate the query for each select, yeah, you'd effectively have to drop/truncate - but my read on your code is that the subquery would turn into a temp table, be re-used across all your selects, then go out of scope, and disappear.
Correct, an inlined view - which again if you are referencing the view multiple times in the query, I would again offer a temp table as a likely performance improvement.
edit: I say this from experience fixing hundreds of CTEs that came down to "I want to simplify the downstream query so I am going to package a heck ton of relations in this cte and combine it with this other one to hide the complexity as I do some other stuff!"
I really like jvns's work but I have to say I found this diagram misleading when I first saw it! SQL queries "happen" in whatever order the database optimiser decides.
I'm really glad the accompanying article has a bunch of qualifications of the form "Database engines don’t actually literally run queries in this order" but a lot of the beauty of these diagrams is that they work on their own.
I wish the title of the picture said something more like "SQL queries are interpreted in this order" to make it clear we're talking about semantics and not what actually takes place.
She has a whole section called "queries aren’t actually run in this order (optimizations!)"
SQL and other declarative languages can be frustrating to talk about because what you tell it you want and _how_ it does the work isn't supposed to matter. So there are two questions that you might answer with either jvsn's answer (how should you imagine it works to understand how to use SQL?) and your answer (how should you imagine it works to figure out why it's slow?). Someone learning SQL is probably asking the first question, somebody debugging performance is probably asking the second one.
So this is a weird criticism because neither of you is incorrect, it's just that different audiences require different kinds of detail. If she instead wrote the answer to your question then someone else would inevitably come along with a comment nearly identical to yours except it would say "well that's just implementation details but you should instead imagine that...".
> She has a whole section called "queries aren’t actually run in this order (optimizations!)"
Yep! Article is a big improvement; first I saw the image was earlier today when it was tweeted on its own here: https://twitter.com/b0rk/status/1179449535938076673?s=20 and her images are often awesome because they capture nuance in a tiny poster
She’s pretty clear that she means the order things happen semantically. Just like in any programming languages we talk of things happening in some order, but under the ”as if” rule the compiler and processor can reorder things however they want.
Reason: I started dealing with Oracle when it was at v8 and I think up to including v9 when I specified a hint it was usually followed. Then later with v10 and v11 it progressively became more difficult and nowadays with v12 I'm having a really hard time (I usually have to use some "tricks" that don't change the logic of the SQL but do change its technical execution, like placing somewhere a useless but tactical "distinct", to confuse it enough so that my hints are more or less followed/used).
Btw., if you want to state something like "using hints is wrong" then in general I agree (with some exceptions for special cases) but currently I'm taking care of a very old app that uses often quite big SQLs (involving up to ~15 tables) and as the maintenance budget is limited and the app will anyway be decommissioned in 1-2 years I cannot start rewriting half of the application to break down and improve the statements => when once per quarter the data distribution/quantity/etc... change and the optimizer thinks that it has a new brilliant idea about how to exec some SQL and then of course the SQL hangs then I usually just try to spend 1-2 hours trying to find some hint(s) that will bring back the old execution plan, but since we upgraded to Oracle12 I often see no change in the exec plan unless I do what I mentioned above.
> Btw., if you want to state something like "using hints is wrong" then in general I agree (with some exceptions for special cases) but currently I'm taking care of a very old app that uses often quite big SQLs (involving up to ~15 tables) and as the maintenance budget is limited
This is a very common truth, I've been in similar situations where budgets are tight and things are working so don't touch a thing. I think it's actually just a specific example of a very common problem pattern in tech - the way I usually push back on it is "The stuff that is is and I'm not going to waste time fixing it - but any stuff I'm adding a feature to or fixing a bug in, it's not worth the company's time to fix it the wrong way because it'll end up costing more when another bug or feature requires an adjustment near this in a year." All that said, especially within MySQL (and I haven't worked in Oracle so maybe there too) the query planner is a bit dumb, so sometime you really need to help it along.
> especially within MySQL (and I haven't worked in Oracle so maybe there too) the query planner is a bit dumb, so sometime you really need to help it along.
Yeah, I've seen it as well last year when working with MariaDB (but I'd guess that MySQL might be a bit more clever as maybe Oracle might have improved it a bit).
This is somewhat intentional I think to guide you toward buying Enterprise edition, which since 11G includes plan baselines ( see https://docs.oracle.com/cd/B28359_01/server.111/b28274/optpl... ) - which allows you to lock a SQL queryID to a specific plan.
At the company we're using EE, and I've heard about the plan locking functionality, but I never dared to use it.
Does it survive DB-restarts? Additionally we have a setup of an active/primary cluster that is replicated to a passive/secondary one in our secondary datacenter (which then becomes leading in case of a disaster in the primary datacenter) => I don't think that a locked plan is replicated to the secondary cluster (which, in a case of a disaster would become a 2nd disaster as many SQL all of a sudden would stop working).
It really is "whatever the optimizer feels like". Pretty sure that's the first law of SQL performance optimization - "The optimizer is always your best friend and worst enemy".
I've worked at a place where running time-sensitive batch file submission was the company's main line of business. It is always fun when the optimizer heuristic suddenly makes a 3-minute query take over 3 hours! And there's basically no way to defend against this while still using SQL, short of moving everything to incremental generation.
Not having query hints isn't the "SQL at fault" - it's very much the optimizer, or you wouldn't need to override it with said hint. I've seen it where one index hit an arbitrary number of rows and then the query flipped to joining on another, non-indexed column, increasing the time to run massively - that sort of 3m to 3 hours magnitude. Had to throw on a hint but that's only because the optimizer decided it knew better.
And that's why I listed a bunch of other options as well - simplifying code is probably the best one, but there will always be gaps in their heuristics.
In my experience, you have one of three scenarios:
You know ahead of time exactly the problem you are going to solve and so probably shouldnt choose SQL from a performance perspective.
Your problem size changes and you get to pick a simple system that you understand well and has predictable performance but might be slower than other options.
Or a complex system that's less well understood, and generally has great performance and productivity up to a tipping point, where the abstraction has leaked too much.
In your case - your query sounds exactly like those heuristics worked up to a point, and then finally failed. Is that the optimizers "fault"? It's mostly a trade off between the time spent making the plan and the time spent getting your data back - and again, sometimes those tradeoffs are wrong too.
A slightly more charitable view of this situation is that an optimization which was previously triggering to make the query fast suddenly no longer applied.
I think that might be too optimistic - it went from a table with thousands of records to hundreds of millions of records. If I were to speak technically the oracle optimizer "did a derp".
So... stale statistics? I.e. the optimizer making the query plan assuming that the table has thousands of records (because that's what it had earlier) and now it has hundreds of millions of records? A sudden order of magnitude change in table sizes isn't normal operation, for most DB platforms that requires running some "everything's different now, please recalculate statistics" command to maintain proper functioning.
Depending on why, static / manual statistics on tables (also normally a no-no) may also help.
(This is assuming whatever applicable regular maintenance (stats, index or table reorgs if needed) is done prior to the time-sensitive batch... and understanding sometimes there's just not enough hours in a day to run batch AND maintenance :| )
In general, I love SQL. It's one of my favourite languages (or rather,
language families) to play and tinker with. But the fact that SELECT
comes before FROM has always ground my gears. Thanks to this article, I
now finally have a better explanation than “Dunno, feels weird” :-)
A long time ago, I might have forgiven such rigidness. But, yeesh, that was 30 years ago. In 2019 if I've already tokenized it for you with FROMs and GROUP BYs, you're a computer, go figure it out. Many years of writing SQL prevents me from but the most rare ordering mistake, but it kinda grinds my gears just a little that it's still an issue.
I see your point, but FROM-first SQL would provide better support for
autocomplete. Right now in e.g. psql when you write "SELECT " and press
<Tab>, it can't help you with the columns. But if you could write "FROM
users SELECT " and then press <Tab>, then the DB client could tell that
you probably want a column of the "users" table.
You can use the WITH clause if you prefer that style.
WITH
employee AS (SELECT * FROM EMPLOYEES),
department AS (SELECT * FROM DEPARTMENTS)
SELECT employee.name, department.name
FROM employee
INNER JOIN department ON employee.department_id = department.department_id
How does that help at all, though? When it comes time to write the final select, you still haven't established the meanings of the things you are selecting from until you write the from and join clauses.
Your IDE should be able to complete the columns at that point, so when it comes to typing SELECT employee.name, the IDE knows that name is a column of the EMPLOYEES table.
This works with IntelliJ and Datagrip, I don't know about other IDEs.
It shouldn't, though, because there is no table bound to that name to draw from at that point. You haven't added a from-list that is actually relevant to the current scope yet.
I totally believe that it gives the suggestions as you describe, what I'm saying is that in order for it do that, the completion must be making faulty assumptions about the way that scoping works in SQL.
SELECT as the first keyword also makes sense as an opening indicator of the SQL statement's type. "Start a new task that SELECTs data" (then "what [data] do I want" then "given these rules") instead of INSERT, UPDATE, etc.
From a source code structure perspective "SELECT foo ...;" sort of works like a function header: "function foo() { ... };"
+1. One of the few programming languages that allows you to "work backwards" from your desired result to the data / computation needed for that result.
I'm a big SQL fan too. Like the author, I kind of intuitively got the order after years and years of writing sql but if the syntax had evolved into FROM then SELECT I could get how it would make more sense. As it stands most of my queries start out:
1. "SELECT *" because habits
2. Start writing the FROM/JOINs and table aliases
3. Go back and refine the SELECT to tableAlias.ColumnName
4. (OPTIONAL) Aggregates
5. Forget the WHERE clause and burn the optimizer's cached query plans to the ground! (j/k, add WHERE or just throw a TOP 500 into the SELECT)
I wish it looked like the underlying relational algebra expression that it represents, but unfortunately, that's too much for your average joe with no compsci background.
I pretty much always start typing SQL queries with “SELECT *” because I know I’ll come back and figure that part out later.
It’s similar to how I write Python list comprehensions. I almost always type “x for x in” from muscle memory, then get the collection part at the end correct, then go back and fix the beginning. My brain just won’t think in the other direction.
Oh yeah I agree - the forward reference to attributes before the source is stated does feel backwards when writing the expression.
Compare scala's for-comprehension that feels more natural: `for (x <- xs) yield x + 1;`
Perhaps there's an argument that the shape of the result is defined up front, so it might be easier to comprehend the result if not the implementation.
If you type just a table name into an empty cell it will autocomplete with a full SELECT statement. e.g. if you typed "pubuser" and selected "public.users" it would autocomplete with:
SELECT pu.*
FROM public.users AS pu
WHERE 1=1
It will even take a stab at aliasing the table for you.
I've actually found myself more and more disappointed that FROM isn't the first clause in the query, it'd be really nice to express those queries as WITH, FROM, GROUP BY, HAVING, WHERE, SELECT, ORDER BY, LIMIT as that's usually the way I consider queries - granting HAVING is sort of a free card here, I don't use it enough to place it accurately, but being near WHERE or GROUP BY makes a lot of sense.
A professor I took a databases course from in college was convinced XML was the next big thing, so he made us also learn the XQuery language [1], which does things in roughly the order you described.
I don't remember the specifics, just that the acronym was something like "flower" and you started with your from clause and ended with what you would return.
Or rather, why have the language limit you on the order of the operations? Why can't you for example do a LIMIT and then a WHERE? This can currently only be accomplished by using subqueries. I've been thinking of developing a different query language that transpiles to SQL that allows you to more-or-less reorder these operations arbitrarily. You can easily do that in any regular programming language so why not SQL?
I learned this and more earlier from https://blog.jooq.org/2016/03/17/10-easy-steps-to-a-complete...
my goto guide and recommendation for people with practical SQL experience who want to improve their understanding. (I first read it via Hadley Wickham's tidy data writeup.)
If you type just a table name into an empty cell it will autocomplete with a full SELECT statement. e.g. if you typed "pubuser" and selected "public.users" it would autocomplete with:
The posts, tutorials and zines of Julia Evans are amazing. She has a knack for explaining things. This one is also quite great, isn't it? It doesn't talk about relational algebras, sets, cartesian products etc but explains things in a very easy to understand fashion while pointing out common problems and pitfalls.
I wish she did a monad one, you bet it wouldn't be "it's just a monoid in the category of endofunctors, what's the problem".
It's pretty straightforward, but it does require knowing haskell style functions (which I struggled with for a while). The chatty tone is irritating but perhaps it does make it less intimidating. I read it before and think it's pretty good.
Maybe showing my age, but I'm incapable of reading the phrase "If you want something visual" without hearing it in Tim Curry's fruity Sloane Square tones.
The reason why is historical. SQL (structured query language) originally came from SEQUEL (structured english query language) and was designed to emulate a normal English sentence as a query.
'SELECT columns FROM table' makes sense when you're asking it literally. However SQL and relational DB's have evolved since then and the English grammar is now awkward. It would be nice to see newer dialects accept the inverse order to help with tooling and structure.
I'm a native English speaker and both "from the top shelf select a book" and "select a book from the top shelf" are perfectly valid syntax, even if the second one is more common.
This is why I like Linq in C#. The syntax order corresponds to the logical order. This also makes queries more composable.
And a very practical advantage is that the editor can support syntax completion. SQL syntax cannot support that since you write the projection before the from-clause.
Yeah, the lack of composeability in raw SQL is pure agony for me. The very fact that "UDFs area performance disaster" is considered acceptable for a SQL in 2019 is revoltingly laughable.
it's a constant source of pain for us DBAs too. i've just always heard that code reuse isnt an imperative for sql server, not to the same degree as it is on the application side.
I've found that CTEs help ameliorate this quite a bit, and it's especially exciting that pg will soon be removing the mandatory performance implications of CTEs in pg12 (fencing via CTEs will remain optional via a keyword switch, but in pg<12 CTE fencing is impossible to disable).
CTEs seems like a very nice middle ground for code reuse - enshrining them in views is nice for somethings but keeping them raw and composable is to my preference since application developers aren't losing any visibility of logic occurring within the composed in blocks... this is definitely not a simple problem to solve though and is a constant balancing act so there is some clear room for improvement.
If code reuse isn't imperative, then at least there ought to be an official recommended preprocessor that could do macros. Something comparable to the C-preprocessor.
I want a null-handling-equals macro, for example (god I hate three-value-logic so much).
Have you checked out the LINQ expression support in F#? They cover all (I'm pretty sure) LINQ operators within an expression style (AKA not the method-call syntax).
This looks much better than the way it is done in C#.
I absolutely hate the SQL-ish LINQ expressions in C# code. It's really awkward when you make a query and need to materialize it. You either store the Queryable<> in a temporary variable and then convert it to a list or an array or whatever, or you have to glom parentheses around a big LINQ expression.
I prefer the approach of LINQ expressions over LINQ method chains, but in C# very few LINQ operators are exposed to the LINQ expression syntax. So if you start with expressions you often end up with a terrible hybrid of both.
Maybe I’m just showing my age here, but has use of SQL really fallen so far out of favour that very few developers know it well? I’m honestly surprised.
I don't think its an age think at all. I think that Tableau and other BI tools killed the bulk of rudimentary SQL work for thousands of developers. And good riddance. Business people need to be able to have as much access to the numbers as is feasible, and should have to wait in line to get them.
I think that the fast majority of movement away from relational databases is pure folly. With JSON columns and horizontal scaling, relational DBs can handle virtually every workload, and give devs wonderful things like joins, aggregates, window functions, views, foreign data wrappers, user defined functions and all of this other stuff the have to build themselves on top of their "easy" NoSQL DB. There are many, many good use cases for document DBs, where they win on latency and scale. I've just seen a lot of NoSQL by default designs, or NeverSQL attitude, and those end up being very expensive to actually develop and own.
I've wrote CRUD apps for 5 years very long time ago. Then I was spending weeks debugging business logic in stored procedures and views, fine-tuning big query execution plans, you name it.
At my current work nobody talks about SQL, literally every discussion starts with "We will create this new Rails model". Indexes are not created deliberately - we are adding them only when the need is obvious, because some functionality is plain slow. No one talks about SQL, plain SQL queries in code are considered really bad practice - in cases where the query generated by Rails is complex and slow and a plain SQL query is PR-ed, the code reviews discussions are pretty long.
I am not saying this is good or bad, in fact all looks OK, but I was really really surprised at the beginning.
I once worked with a guy who would generate SQL queries via a series of multiple regex transformations. Even now that's still up there as one of the most terrifying things I've seen a developer do.
Why terrifying? Sounds like he saw that the structure of much of what he was doing was boilerplate and created a bespoke code generator to save himself time. As long as the resulting queries are readable and effective I don't see a problem. Unless I'm missing something, understanding his code generator is not (necessarily) a pre-requisite for understanding the resulting code.
Because you're leaving an unmaintainable mess. Nobody else can rationalise the code aside himself. It's harder to debug because even he cannot account for every edge case his regex might throw up -- which also makes it less secure. Also regex is comparatively slow so transforming an SQL query using multiple regex search and replace patterns is not the way to write hot code hit by literally millions of visitors each hour.
> understanding his code generator is not (necessarily) a pre-requisite for understanding the resulting code.
Oh it absolutely is if you want maintainable and secure code. I get some situations call for complexity but when you're generating SQL on a public facing web application with a centralised database backend, you want to be damn sure you can rationalise the SQL being generated. The best ways of doing that is to either not to try to be too clever with your generation (KISS) or to put your confidence into an established and well tested existing ORM or equivalent.
> Indexes are not created deliberately - we are adding them only when the need is obvious, because some functionality is plain slow.
Is there anything wrong with this? My thinking is that it could be seen as a premature optimization if an index is created without knowing if it will be used.
In my world, if you don't know whether an index is likely to be used, it indicates a lack of understanding of what your users are likely to be doing, and/or the nature of the data set. Yes, there are marginal use-cases, but overall, you should understand what most of your users/consumers are doing and optimise accordingly.
In my experience the execution order isn’t obvious to most sql users. You’ll find a lot of people using sql who don't understand why something from SELECT can be used in ORDER BY but not WHERE or HAVING for example. It takes a slightly more advanced SQL understanding to be able to explain these kinds of errors in thinking. Thats why a good ORM can often be the best choice for users.
In Google Bigquery you can use them in GROUP BY and ORDER BY but in Hive you can only use them in ORDER BY.
As far as I know the standard only defines that you cannot use column aliases in the WHERE clause. I'm sure someone else can chime in with what the standard says about column aliases in ORDER BY, GROUP BY and HAVING.
Actually I've got bad news. The where clause should be run before the select but for reasons of, I guess cheating at the benchmarks, oracle doesn't do that and MSSQL followed that.
It is particulary nasty as it mostly works, so this (CTE to not clutter your schema)
with data as
(
select *
from
(
values(1), (2), (0), (4)
) as abc(d)
)
select 1 / d
from data
-- where d <> 0
gives divide by zero. If you uncomment the where it then seems to work correctly.
Well it doesn't. You cannot rely on this for MSSQL2005 or later. When the expressin grows more complex it blows up - I am saying this from losing days of work by finding out the hard way.
I'll repeat myself, the expressions in the select and that in the where can be evaluated in parallel. Thank you Microsoft.
Microsoft's solution to this is to expect the programmer to deal with it by skipping the evaluation of the expression
with data as
(
select *
from
(
values(1), (2), (0), (4)
) as abc(d)
)
select case when d <> 0 then 1 / d else null end
from data
where d <> 0;
The case prevents any possible division by zero.
I've had this behaviour confirmed by Microsoft, and the supposed fix was their solution.
Not coincidentally I'm looking at learning postgres. For this and other reasons I don't want to work with MSSQL any more.
I'm a bit of surprised by how shallow people's understanding of database is. Especially for someone have literally wrote thousands of sql queries. The query engine implementation is/should be covered by some extent in any of the database courses, thus I assume anyone with a CS degree shouldn't be surprised that a queries doesn't execute from SELECT(how's that even possible). Perhaps nowadays the major users of SQL don't really have a CS background.
We live in a large N-Dimensional world of knowledge. Everyone at some point is guaranteed to have missed something "they should know", both according to themselves and according to others.
Many times you encounter such knowledge when you are not ready for it, so you forget having encountered it. A good way to convince yourself of this is to start opening up some of your old college text books and look at the chapter introductions and conclusions. You will find some insights that you simply missed regardless of how you performed in the classes.
In my development as a software engineer, there was a noticeable "level up" after 3-4 years where I had accumulated enough knowledge across the various abstraction levels to the point where new concepts integrated cleanly into my mental model of software systems.
During the first few years of exposure to computer science, I definitely lacked the necessary context to understand certain concepts when I encountered them for the first time. For example, I had a theory of computation course that was all about regular expressions, regular languages, context free grammars, etc. At the same time as I took this course, I was also building my first iPhone app, and I often thought "none of this seems applicable". And then two years later I wrote a compiler to implement a subset of Java and I massively reassessed my opinion on the value of understanding the theory of computation and programming languages.
In fairness, you don't need to be a DBA to have written thousands of SQL queries and if you're in your 30s or older it's quite possible you might have forgotten some of the theory since you don't need that level understanding for your day to day.
It's getting on for 20 years since I did CS and I'm certain there's stuff I've forgotten about (I've definitely forgotten half the stuff about network topologies but most of that stuff was about coax networks rather than ethernet so it's not stuff I've much since)
It's fair to forget OSI models since most of programmers don't need to deal with layers below IP, but the fact that there are 3 more hidden layers shouldn't surprise you. As for someone who writes SQL queries for work, I expect some level of understanding of how your underlying system work. Not in great detail, but at least a general picture. I usually consider the desire of understanding an blackbox as an indicator of whether the person will become a great programmer.
I agree with the "blackbox" metaphor 100% but you have to bare in mind that we all have blind-spots. Sometimes they're due to the path we took into our profession, eg different education/training routes; and sometimes due to the way we've specialised, eg the example I'd given about how specialised knowledge that isn't recalled frequently can be forgotten over time. Sometimes it's just because there's so much information to take in that our leaky brains missed that specific detail in lessons. However the fact that the author literally did research that "blackbox" -- hence the article being written -- should be an indicator that she is actually a pretty decent engineer.
I find criticising people for having blind spots that we might consider obvious isn't a great way to share knowledge. Not only does it make individuals less willing to come forward with questions but it also means they're less likely to correct your errors (we are all fallible) for fear of being chastised again.
There are a lot of semi "power users" with little CS background that write simple SQL queries for their work.
I think the main difference is in the level of curiosity/caring the person exhibits when they get a result you don't expect. When they do something that they think should work and it doesn't, do they carefully figure out the system behind it and why?
The users that do will write better queries faster, and the others will will just keep getting frustrated and foist the work onto others.
I got slightly shocked while I do appreciate his curiosity of the db engine and made it to an informative article.
To me, data analytics jobs will be more exposed to non-CS people so SQL will be much more written by non experts. The autonomous database might be the right product towards this future.
I know a lot of healthcare analysts who don't have a proper CS background but MUST use SQL to get data out of govt databases. There's a LOT of inefficient querying going on, and I was definitely guilty of that as a beginner.
But the syntax is in the wrong order for autocomplete to help you.
If from was first, autocomplete could help you with the names of columns, like RStudio does with dplyr:
data_frame_name %>% select(column_name)
Because you start by piping the data reference into your select function, RStudio can inspect the data and autocomplete the column names in the select statement, completely opposite of what is possible inSQL
With a free order, you'd be able to start SQL queries with `FROM table_name SELECT ...` and have columns autocompleted.
Sorry, I didn't mean that in defense of SELECT being first. I agree with the article that SELECT must come toward the end. And the first thing I tell people I teach SQL to is that they have to read SELECT last.
Query languages that came after it that weren't concerned with compatibility realized this. XQuery has "for" first, which is to a first approximation FROM, and "return" last, which is like a more powerful SELECT. SQL++ and its brethren also had it this way very early on.
However there's so many applications and users that are accustomed to it the original way, that it is not usually possible to fight against the inertia.
Agreed... it feels a bit more natural. I actually like using the extension methods for LINQ in C# in a similar order (not the LINQ syntax itself though).
edit:
CTEs can make it closer to the syntax in question though.
It's surprising to me that someone can run into window functions earlier than subqueries. I guess my background with MySQL which only recently supported window functions affected my perspective there.
Her question that starts the post is about doing it in a single query (no subqueries.) I kinda had this misunderstanding on Twitter too when responding to an earlier diagram of hers https://twitter.com/firasd/status/1172874054002307073
I've had use cases for subqueries long before I knew about window functions. This was just the example from the article where subqueries provide a solution.
I'm always surprised how apparently not-deterministic SQL Server appears to be when it comes to performance.
We use for our apps Kendo grids at work which allows you to create any report by dynamically choosing the filters and columns you want. On the server side a query is composed from the parameters the user chooses, including only the columns that are necessary. The data source for the query is a inline table valued function (TVF).
At one point we had trouble with performance after going live, and it could not be explained by changes to the TVF. Even when we reverted the released TVF changes the issues kept happening.
A day or so later I was with a database expert one day long optimising the query (may I plug the excellent free Sentry One tool here?). Eventually we had optimised the query. However, when we put the query back into the TVF performance was down again. The slightest things that changed affected the performance.
This is something that affects every RDBMS that exposes SQL. The SQL language describes what and not how, how specifically to retrieve your data is left to the implementation.
The most common causes for this in MySQL (and I expect SQL Server too) is that often your query can be resolved on multiple indexes, and picking which one is a matter of a heuristic. On MySQL changing table statistics can change which index is used, which can drastically affect query performance.
So a common poweruser optimization strategy (which should be used very carefully) is FORCE INDEX, also known as "I know better which index to use". More often than not I've found uses of FORCE INDEX that were more harmful than not.
But an ever more insidious version of this is queries that retrieve cached data being faster. This is something you can generally only observe in production because what's in cache is a matter of other traffic on the server.
So you can easily get cases where something looks much worse under EXPLAIN, but is faster. E.g. it can be faster to execute a "worse" query with no indexes than one with, if the one with no indexes happens to need data that's already in memory v.s. sitting on disk.
Of course they don't start with SELECT, which is also why SQL is a terrible language.
When you type the fields you want to select, there is zero context as to what those fields apply to, meaning that there is zero contextual help that you could get from know the table and group by ahead of typing select.
Relational database queries are supposed to be _declarative_. As a user, you're not supposed to think about the mechanics of execution because the database system is supposed to be able to decide how to execute your query using whatever magic it wants, as long as it satisfies the contract that it gives you the right answer.
It's absolutely useful as a debugging tool to build up some semantic understanding of what the query means, and I encourage every database user to learn to use EXPLAIN, but relying on a mental execution model is borderline dangerous.
The author makes it crystal clear that she is not advocating this as a mental model for the execution of the query:
Database engines don’t actually literally run queries in this order because they implement a bunch of optimizations to make queries run faster – we’ll get to that a little later in the post.
So:
- you can use this diagram when you just want to understand which queries are valid and how to reason about what results of a given query will be
- you shouldn’t use this diagram to reason about query performance or anything involving indexes, that’s a much more complicated thing with a lot more variables
She claims this is a useful tool to understand the denotational semantics of the query and which kinds of queries are allowed vs. not allowed—and she's absolutely right.
I'm confused, this doesn't seem right. This article, and several of the comments here, suggest that you cannot reference things calculated during SELECT in the HAVING field. But you can.
As a quick demo, pick your favorite table and run:
SELECT 2+2 as num FROM <table> HAVING num > 3;
this will return as many 4s as there are rows in the table
Long story short, relying on HAVING to run after SELECT is not a standard SQL behavior it is dependent on the implementation and what the optimizer thinks is best (more discussion below: https://news.ycombinator.com/item?id=21153870).
The diagram in the OP is a pretty good general/semantic understanding of SQL execution order for writing queries.
In my undergrad class, I was thought that the order is "FWGHOS" (pronounced, "fwog hose"), i.e., FROM, WHERE, GROUP BY, HAVING, ORDER BY, SELECT. (LIMIT was omitted from this.)
This acronym has "ORDER BY" happen before "SELECT", but this article have them the other way around. Does this difference ever matter?
"SQL queries don't start [optimizing] with SELECT"
The title is missing a word. Interestingly, part of the industry is moving towards databases that optimize themselves (ex Snowflake & Oracle Autonomous Database), plus projects that do it for you (Apache Calcite).
With a bit of an exception for the era where we were all hard wiring indexes in MySQL using hints... that's what databases should be doing. DB interactions are supposed to be done using a purely declarative language - postgres has stuck by this the strongest and has been able to make the biggest optimizations in (IMO) a large part due to how little of the internals they've exposed. With pg12 they're making the optimization fences in WITH clauses an accepted give on this point and the discussion on whether to do this or not went back and forth quite a bit.
I came to this realization by accident shortly after using various ORMs and database mapping tools.
Being able to break down and compose a query in your own defined order while leveraging a library to transform it into a valid query is quite nice for that way of thinking.
For bonus confusion points, DISTINCT happens after SELECT and before ORDER BY (and DISTINCT affects the usage rules for ORDER BY).
Personally keep a printed copy of the actual logical order of operations, it comes in handy.
SQL is the most WTF notation/language I've ever seen... How the hell did someone start from the beauty of relational algebra and end up with the this misordered confusing and verbose abomination that we've all standardized upon?!
...even Mongo's JSON-based query language has more elegance to it. SQL seems designed on purpose to be annoying to write, hard to read, and never look like anything, not the math logic, not any query operations that could ever be executed. I may be superficially similar to English, but English is horrible, why would you model a notation after that?!
Everytime I see someone yelling that Haskell or Lisp syntax is weird and unintuitive I think they need to have their face bashed into a printout of a complex SQL query!
I suspect you're asking more figuratively, but I saw this PDF from Stonebraker on the developer of databases and query languages that literally answers your question[0].
I've always done data heavy work, but as a recently hired data engineer, I'm really diving deeply into this layer that previously I had taken for granted. In doing so, I find SQL highly expressive and beautiful for data work; much moreso than Python, my other primary language.
Which languages do you feel manipulate data beautifully?
...that would match rel algebra notation nicely, and extend to complex cases more like:
( employees e *{e.id = a.employee_id} addresses a )
| filter{a.city.name = "London"}
| [e.name, a.city.name]
...you can replace operators with words of course. Or you could represent this as a JSON structure, even better imo. But the point would be to have a notation highlighting the idea of a "product", of restricting and specializing that product or specifying how it's made, and of indexing into its fields. The way I see it in my mind is like taking "a (maybe generalized) cartesian product" of different things, then filtering this, then zooming on each dot/point/object and filtering its fields etc. And obviously focusing on piping/chaining stuff.
Maybe the notation I suggest is not a good one, but I'd care more about making the concepts obvious and readable and composable.
hah. famous last words. SQL is beautiful. I allows non-developers to actually work with data in ways more flexible and extensible that most notations/languages out there.
Who? I've done so many "can you write a query that gives me all the foo bars where so an so" for managers, cs people, or even data analysts, that it makes my skin crawl. Sometime they know how to load that data into Excel and do their magic there, but with SQL, anything that's more complicated than a simple select from a single table, forget about it.
I have designed/given internal SQL classes to business analysts, product managers and similar.
From simple selects using some common functions and basic types, to filtering, to aggregations and grouping and finally joins. That gets them about 80% there for the kind of information they want ad-hoc.
Only with joins did the people really struggle for some time.
In my opinion the distinguishing factor wasn't the language, but the peoples willingness to learn and ability to apply it in practice on a somewhat regular basis.
I have long wondered why we can write "ORDER BY 1" (meaning to order by the values in the first column), but it's illegal to say "GROUP BY 1". This article shows why that is.
`group by 1` is actually allowed in several SQL implementations, and there's no reason it can't be -- the same as with column aliases the db engine can just rewrite
would defeat something MySQL was getting wrong in a query that already wasn't ordered by anything. A coworker put it as "waving a dead chicken that doesn't even exist".
To be honest I don't think the article highlights very well where this applies. I am a SQL Developer with SQL Server right now and in the past using Teradata I vaguely remember an aliased column could be referenced in the where clause if you didn't use the AS keyword.
One blog post getting 500+ points is about as much “proof” that developers don’t know sql as a comment that cites that as evidence is proof that HN users don’t understand statistics....
Every database. It's because of logical order of execution, as explained in article. Eg. you cannot use `having` before `where` even if syntax of SQL is allowing you to do so.
I wish SQL would die as a query language. This is one of the reasons why, writing a query in the syntactic order is counter to how you should reason about getting and shaping your data.
Well, processing doesn't but the beauty of the language is that it starts with what the user wants as the output and the details of how it is pulled follows later.
Title is misleading as far as I can tell. The author was just surprised to realize that the syntax for SQL queries did not mirror the execution order of the various elements of the query.
It is a bit of a "well, duh" thing, but I've had my fair share of those with SQL through the years so I won't hold it against the author.
One of my more recent random discoveries was that I could join a subquery, ie
select x.a, y.b, y.c
from tableX x
join (
select id, sum(b) as b, count(b) as c
from tableY group by id
) as y on y.id = x.id
Again nothing special, just hadn't had the need for it before so didn't even think about the possibility.
That’s one of the insights that Everyone Should Have about RDBs and SQL. Relational databases are about, well, relations, in particular the sort of discrete relations that can be represented as a set of tuples. It doesn’t matter where those sets of tuples come from. From the disk? Great. Computed on the fly? That’s fine too. And by manipulating relations you get… new relations! What RDBs and SQL are actually about is composing relations in the way that, say, Haskell is about composing functions. And indeed the theory behind RDBs is called relational algebra.
I’m pretty sure the author was aware that the execution order was not the same.
What the author was not aware of was what the actual execution order is. This article is about them actually digging in to find out what that order was.
You start explaining what you need and work backwards, on the other end, they work in reverse, because that's how it's stored.
Conceptually, it's like piping in bash, "find | grep | cut". Start with finding the files(tables), grep for the row data (where), cut the fields you want returned (select).
I didn't drop a schema, but I did, um, accidentally get over aggressive with a merge-purge duplicate process. By which I mean that I merged every single record, all of them, into a single record.
I was very fortunate, we had backups and restoring went smoothly, and my boss very calmly asked me one thing: "did you learn from this?" I assured him that I had.
Perhaps a “fake table” (e.g. a view) that is the result of two tables being joined but presents itself as a normal table with the expected columns? That would be a very declarative way of querying joined tables, but yeah, you’d still have to create the views or have some means of declaring the appropriate relationships between tables. ORMs sort of let you do this, like in Django where you declare that a User has many Permissions and the ORM knows how to generate that JOIN query for you.
I'm just saying they are not declarative since you have to
explicitly describe how each table is joined to the next, the operation goes beyond the query itself.
I think we have different definitions of "declarative", since I consider SQL's join syntax to be pretty darn declarative. You're declaring the relationship between two tables. You're not telling the database how to spool through each table and match up values; you're just telling it which values need to match up by which constraints.
Even sooner than "start with FROM" is "What does each row of the result mean?" Is it a sale? A month of sales? A customer? If you can answer what you want, then it's usually easy to start there. Especially if your answer corresponds to an existing table, put that in the FROM, and then add whatever joins you need. You can even think about the joins "imperatively".
Thinking imperatively about JOINs is also a helpful way of understanding a mistake I call "too many outer joins", which I wrote about here: https://illuminatedcomputing.com/posts/2015/02/too_many_oute... If you don't know about this mistake, it's easy to get the wrong answer from aggregate functions like AVG or COUNT. (EDIT: I should probably call this "too many joins" instead, since it's not really specific to outer joins. But in practice I find outer joins are more often used in these sort of situations.)