It's critical to have a good understanding of SQL, but once you do, ORMs with a functional syntax solve this problem for many of us. I know it's not possible for some people to use ORMs as they can't risk the ORM making a performance mistake, but I think the solution is to improve ORMs to the point that writing raw SQL is akin to writing assembly instead of using a higher level language.
That said, we're not there yet. Using an ORM in 2017 without understanding what SQL you want it to produce IS NOT OKAY. Please learn SQL if you're going to use an ORM, and understand the SQL you are producing with the ORM.
Summed up, if you can, use an ORM to solve this problem.
I disagree about SQL being like assembly. It's a fairly high level language, because underneath the engine is figuring out how to use indexes and actually run your query. If you want to draw parallels, ORMs are more like writing code in a template language that's written in the language you're actually working in.
I've spent far too much time fighting with ORMs trying to get the SQL I want generated. Additionally for complex queries, I often develop in a SQL manager interface, because it's so much more direct. When I'm done, I have a working SQL query - I can paste it in to my code and parameterize it, why should I spend even more time fiddling with an orm?
I like "micro ORMs" that just map rows to objects. Generating INSERT and UPDATE statements is ok, and single table selects with a single where clause are sometimes ok. Beyond that, I'd rather just skip the middleman.
That said, for small databases (a GB or two) that are reasonably well designed (e.g. 3NF), the speed of current I/O and optimization engines in databases is such that even the worst SQL will generally run fine - so if you find it faster to use ORM and understand the future pitfalls, go for it.
Actually more often than not ORMs are going to provide better performance than having people inexperienced with SQL writing it themselves. Especially when it comes to batching, pagination and joins and the intelligent use of in memory caching rather than doing everything in database.
And the biggest advantage of an ORM is that you can trivially switch between databases which is often required for running automated tests. SQLite or H2 for development and then Oracle, SQL Server or Teradata for production is a very common pattern I've seen at many companies.
Running your tests against a different DB than what you're using is an anti-pattern IMO. You'll not be able to trust the results, and you'll be limited to the lowest common denominator of the databases your using.
I've found views, parameterized-views/table-valued-functions can be a nice compromise to keeping the SQL relatively simple and thus ORM friendly without sacrificing performance.
This can be dangerous if over used. It can be very tempting to quickly pull in a 'left join blah' to grab some fields you may need. But that blah view could be hiding some massive performance sapping beast.
When it comes to optimizing a slow running query, having to step through layer after layer of nested view can be incredibly challenging.
I can't speak for other engines but at least SQL Server and Oracle will eliminate unnecessary joins for inlined table-valued functions, e.g.:
create function thing (a, b) returns table as
select * from table_1 left join table_2 on ...
select (only cols from table 1) from thing(a, b)
..and table_2 is not accessed
predicates from the topmost query will be pushed down to the function's query as well so the functions performance is generally equivalent to the adhoc version.
on the whole I find this strategy to be very effective at reducing the complexity of adhoc queries without performance penalties. in the case of indexed views it can greatly improve performance.
Worse, someone is using the view (that joins several tables) in a join to get one extra field, which would be a simple single join -- they just didn't bother to check where it came from because 95% of the time that view gets used, and no one questions it. I've seen it happen more than once.
Unless you're fetching all the unneeded fields as well, in any decent DBMS both approaches would result in the exact same execution plan with the exact same performance, the unneeded parts/joins of the view wouldn't be executed.
I don't quite understand it. Your DBMS isn't some ambivalent data store with simple universal semantics.
I'm not just talking about SQL features like complex subselects or CTEs, but index hints, lock order, transactional visibility, non-trivial column constraints, index locks...
These matter in appreciably sized public-facing (e.g. web) systems. Granted, with enough transactions and roundtrips to application logic, an ORM can mimic these.
Now, I do know YAGNI; you might never get to a scale where an ORM poses a limitation.
I just suppose I don't see large enough benefits of learning and using an ORM in the early stages, to outweigh the cost of switching to SQL in the (admittedly hypothetical) later stages.
I think your post makes a lot of assumptions. I do not believe scale to be a reason not to use an ORM. If you pay attention to what you're doing you can scale a system that uses an ORM just fine.
The orm is usually not the problem, it's usually that people don't know what they are doing and introduce serious performance issues.
My db issues for last ten years have frequently been a mix of bad query plans and terrible orm layers. Doctrine in Symfony rarely yield good queries in mysql for us.
We've been moving away from orm where possible because we don't need to support multiple databases. Further if I switch my db to postgres I'm changing out everything else anyway to remove php from the codebase.
I didn't write the original code but my team has to maintain it. The orm makes that hard actually compared to staight sql in the end.
As I understand it, the point of an ORM is to make relational data look and feel more like the objects in my program; I shouldn't have to think about the consequences of my data living on the other side of a network connection. Even if that isn't the official point, it's at least a widely-held misunderstanding. If you're going to write code that uses the ORM this way, you're going to have a bad time, but if you write your code intelligently, you end up using your ORM as a glorified query builder.
I think an ORM makes more sense in a functional language since the interfaces are more likely to be declarative than imperative. For example, an in-memory List and a SQLQuery can both implement a filter() method that takes a function and applies it in the way most efficient to that data structure; in traditional OO languages, we usually loop over our in-memory structures explicitly, and doing that with a database is prohibitive.
While I prefer vanilla SQL to the current generation of ORMs, I think ORMs are trying to address a real problem--namely that SQL (despite being high level) isn't very abstraction-friendly. Abstractions in SQL can be expensive and often unpredictably so, despite how clever databases are today. I think part of the problem is the mismatch between the largely imperative OO paradigm and the largely declarative relational paradigm, but I suspect that there's some semantic impedance between SQL and a truly relational model which contributes to the problem.
I advocate ORM use for guaranteed type-safety - so long as SQL queries are processed as strings inside your application there's always the risk of column names and types not matching your entity classes - by using an ORM that generates SQL based on your classes (e.g. Entity Framework with Linq) it promotes mismatches to compile-time errors. This alone makes it worthwhile.
You either generate a model based on your database, or generate your database from your model.
That way the database and model are always in sync and you get errors in the code. In Entity Framework this is called either "Code first" or "Database first"
The reason for me to use ORM rather than direct SQL queries is so I can keep everything written in a single language, so I can develop with one IDE. Having everything under one roof is a huge boon to productivity, rather than having to edit the logic in the IDE, then edit the database queries in a different editor, and the two knowing nothing about each other. It lets me directly map my source language's type system to the database's schema, avoiding discrepancies.
Don't most IDE's do SQL in the mix with everything else or am I just spoiled by Visual Studio?
My opinion is that one uses an ORM for different situations than they would use SQL. Many applications can use an ORM and nothing else. Some can use an ORM for basic CRUD tasks and then raw SQL to do analysis or reporting.
you (and I) are absolutely spoiled by Visual Studio w/SSDT. there is nothing even remotely close to that level of integration for database development in other stacks.
the level of integration with SSDT is way beyond that - the database structure and stored procs live side-by-side with your code, are source controlled together, built and deployed together. there's compile-time checks for stored procedures, code analysis, etc. DataGrip is just an alternative for SQL Developer/Management Studio, SSDT is much more.
For me, that's Scala JVM on the server, ScalaTags/ScalaCSS/ScalaJS in the web browser, Slick for the ORM, and SBT for the build system. The bliss of learn once, program anywhere.
You can have functional SQL query languages that map directly to SQL
without object oriented row mapping. Everyone can use these tools without performance issues.
Korma seems to do exactly that, N+1 lazy loaded queries included. SQLAlchemy, Esqueleto, Slick, Quill, and the like all generate precisely what you tell them to (i.e. queries semantically the same as what you'd write by hand).
This is not to disparage Korma, it looks pretty cool, I'd certainly give it a look if I worked with Clojure.
Is that really a disclaimer, or just the source for your claim?
I'm sure you know this, but there are actually several "query builder" tools out there for Python, including SQLAlchemy Core. I'd love to see a side-by-side comparison.
The overlords of HN have deemed my post is too old for an "edit" button so, yes, my use of "disclaimer" is a case of auditory familiarity overruling accurate terminology
Agreed. The ORM vs raw SQL debate is endless, but I prefer using some kind of ORM-y thing when I need to interact with a relational database from a program.
There are still cases where you might build a collection of raw queries (think business analysis using Hive/Impala/Spark SQL). I think it's important to approach it the same way you would a normal program: how can I make my intention clear, how can I verify it works as I expect, how can I reuse without reducing clarity.
Query builders are a sort of middle ground between raw SQL and ORMs. They give you database agnosticism and help to prevent syntax errors, etc, especially if your language is statically typed. But they also allow you control over the query performance.
For anybody on Rails, you can call .to_sql on any ActiveRecord::Relation and it will return the exact SQL string that your code is producing. Really handy for troubleshooting, especially for associations.
I love Django -- but what it does is mostly magic to me. That being said... I'm fairly competent in being effective in Django -- although not for advanced and efficient querying.
I want to get strong in SQL -- where/how do I start?
First thing to do would be to start printing the queries generated by the ORM to see what it is doing (though it produces some fairly verbose SQL, but its usually easy enough to understand).
Then when you are asked to get some one off numbers out of the database try doing it in SQL. In one query.
Its usually possible but takes a different way of thinking - in sets. Build up queries gradually. Start with the main table. Join in the next table and see the results. Add some conditions in your where clause or join conditions. See what the result is and how it changes.
As for thinking in sets, that is easily said but difficult to translate into words.
I guess describing the set of data that you want back from the database is a good start. My previous team leader would always describe the results in terms of "if x then y" (imperative thinking). When you are find yourself doing that, instead try to describe the data without the "if" statements and instead describe it as "the set where condition x and y are met". That will get you halfway there. Once you start thinking like that you will start to see the beauty in the relational model.
I was a good few years into my career before I started thinking this way - now I try to do as much work in the database as possible - it avoids a whole categories of bugs, usually keeps your code shorter and is one of the easiest ways to improve performance.
[self plug] Take a look at https://pgexercises.com/ . It's a learn-by-doing set of SQL exercises. Focused on Postgres, but the large majority is standard SQL.
Write raw SQL queries. My feeling on the Django ORM is that it's useful for defining models and managing migrations, but I prefer to write raw SQL for all queries.
Meh. My feeling is that Django and other web frameworks are too magical -- I always worry they might be doing something inefficient. That's why I write all my web applications in hand-tuned assembly. It even has reusable abstractions for when I need them, and nice hand-formatted assembly is super readable!
In both the "never use an ORM" and "always use assembly" cases, the argument is based entirely on having full control of what's going on and not trusting any intermediate layer to get it right or eke out every last femtosecond worth of performance.
The ORM takes a whole lot of pain out of updates as well.
Filtering also work fairly nicely for simple stuff, as putting conditions in a dictionary and passing it to filter(kwargs) is a lot nicer than messing about with SQL strings.
I agree that the ORM is limited if you want moderately complex queries though.
If you're doing this you're missing out on the biggest advantage of the ORM, which is that it allows you to compose queries.
Using the Q objects, query expressions, and custom Queryset objects, you can filter objects by pretty much every imaginable criteria out there. This would be horrendously difficult and error-prone writing plain SQL.
The Django ORM, despite having limitations if you want to run analytics (and that's really a use case for with plain SQL excels), generally writes out exactly the same SQL I'd write by hand.
> This would be horrendously difficult and error-prone writing plain SQL.
I do it all the time :) I love looking at long SQL queries that are formatted nicely and use good naming conventions.
It's also part of a more general philosophy I have of reducing dependencies whenever possible. With Django it makes sense to use the ORM wherever it suits you, since it's already built in, but there are times when you need to know raw SQL anyway (e.g. try populating a large database without COPY, only using the Django ORM).
> I want to get strong in SQL -- where/how do I start?
1. I suggest the PostgreSQL documentation itself (https://www.postgresql.org/docs/current/static/). It's well-written, but still wordier than need be, and if you try to read it from start to finish, you will probably die. However, the beginning chapters are overviews of SQL. Once you feel in the mud, you are probably in very specific, technical chapters. You can just skim these, jump to ones that interest you at the moment, etc.
2. There also a separate wiki (https://wiki.postgresql.org/wiki/Main_Page). I haven't gone there much, but it is a nice complement. It has practical summaries of things that the main documentation spins out of control (like setting up replication). It also has some examples of how to to do really advanced things in SQL.
3. A well-reviewed, short book. The first book I read was SQL Demystified. It's a short and easy intro to SQL in general. Find something like that. I have tried hard to find good books on SQL, but most are huge tomes that will crush your soul.
4. Practice. For the past decade at my job I have been forced to learn complex SQL for various business reports, across a variety of tables. And I still feel like an intermediate.
If you can get an employer to pay for it the Into course that Oracle does is fairly good it certainly got me up to speed to be able to work on an oracle project.
I think the solution is to upgrade SQL the language. There's a very low complexity threshold after which you have to trade readability/composability for decent performance (in other words, there are a lot of cases where a human can notice a permissible optimization, but a database can't because it might break correctness for some odd edge case). We're solving that problem with programming languages (notably Rust); I'm not convinced that there's something inherently different about relational databases.
Project M36 is a database that implements a relational model and corresponding query language (TutorialD) based on (among other things) the ideas from Date and Darwen's Third Manifesto. TutorialD solves many of the problems present in SQL, especially around composability.
I've heard of tutorial D before, but I haven't taken the time to learn it as I'm off the impression it isn't widely deployed. :( I'd love to hear any first hand experiences though!
Even if you work with an ORM instead of writing raw SQL you should know when single row processing is OK and when bulk load / data set processing is a must.
My last employer lost 5M+ euros (not reveue but profit) every month because of excessive use of PL/SQL single row processing.
Some ORM patterns are great, others suck. For instance, the activerecord lifecycle callbacks can be very difficult to maintain and reason about, particularly after_commit hooks and transactions that create associations.
I agree, but all too often composable queries are for people that don't know who their users are or what the want. If you build workflow centric UI's then the need for composable queries generally goes away.
I wish CTEs ("common table expression", i.e. a "with clause") had the same performance as a subquery in PSQL. I always assumed they'd be implemented as a kind of macro that expanded to a subquery. Is there a good reason for this distinction?
For that matter, having some kind of SQL-oriented macro/preprocessor language would be fantastic. I guess GPP (General Preprocessor, https://logological.org/gpp) is always an option.
Even in databases without the same performance issues as PSQL, like SQL Server, "with clauses" can cause performance issues on their on own.
I recently refactored a long SQL query with a half-dozen with-expressions to a single query and increased the performance by an order of magnitude. They had used "with" to build up a query from a bunch of independent sets and then union them together and, while perfectly logical, it was a performance nightmare. In the end, I just took all the conditions that made up each query and combined it into one with the appropriate joins. I'd even argue that ultimately the finished product was easier to understand.
The trade-off is modularity. During development working with CTEs paves the way for establishing views later, possibly highlight recurring requirements or gaps in the schema, and the logical separation of general purpose sets (ie functions) from the specifics of a particular query is valuable.
Like in everything, a balance is ideal. For reporting/analysis/testing purposes, I think nesting views and CTEs is more productive in the long run, while production application code is probably best kept as straightforward as possible.
The reasoning was that it was easier to implement that way. CTEs can contain state changing behavior and for that reason they are treated as optimization fences in order to be as safe as possible.
I believe Tom Lane has since said that there isn't any reason it should stay that way and the community has never received any guarantee that this was going to stay the same. It just hasn't been implemented. In other words, they're seeking contributions. I'd do it myself if I were even a remotely capable C programmer, but I'm not.
This is how CTEs work in BigQuery (using standard SQL), for example. There are advantages and disadvantages both ways, but the benefit of making CTEs behave like subqueries is that the query engine can push down filters into table scans.
Last I looked, this was coming in either Postgres 10 or possibly the release after. I'm on mobile so I don't have a link, but I believe a new keyword is being added that will change the behaviour of "with".
Edit: Can't find anything. I could have sworn they said there were patches being worked on, however.
Regarding postgres `with`: I thought the same as you, that it'd be implemented as a macro, and only just found out that it's not the case thanks to your comment. I'd love to hear more about this decision.
Love the literate sql style. A lot of times we have to use temp tables instead of with statements though. In Vertica, all the withs of a query have fit in memory together. Using temp tables instead to create the logical sequence, you only need enough memory for each temp table and final query on there own. The cognitive effect is still the same. Define A, Define B based on A, Define C based on B, etc. but sequentially rather than recursively. When the queries are smaller, withs are more direct, a little less verbose, and usually more performant. However, with the datasets I'm working with, the queries are usually too big or I just don't know how big they will be as I start out writing. So I typically just use temp tables from the outset vs writing with withs only to have to rewrite it once I get "too big to fit in memory" errors.
We do the same thing with Redshift. Using temp tables also gives us more direct control over the query plan, since we can specify distribution and sort keys. On top of that, we can run unit tests on the temporary table (explained in another comment) and fail the query if tests fail.
I see that this is about a way of writing SQL, but as Literate Programming is cited in the title, I wanted to point out that you can already use the true Literate Programming tool noweb to write SQL with or without "with" statements.
The advantage is that you'll be able to write full documentation amongst the SQL, present it in any order, and reuse chunks.
The disadvantage is that it outputs to stdout, so if that's no good for your task then it's no help.
Well, stdout can be redirected in any sane environment, so that's not much pf a disadvantage. However, thanks for making the point that what they're describing (good variable naming practices, and the use of 'with' to prefix them) is hardly what Knuth was talking about when he coined "Literate Programming".
It took me a long time to learn SQL well enough to appreciate its beauty.
Simple queries, you know, "SELECT foo, bar FROM baz WHERE lastChange > CURRENT_DATE" are easy.
But if you are facing the database of your ERP software (as I often am) whose vendor is very reluctant to tell you about its internal structure and how that interfaces with the ERP system (my gut feeling, though, is that we're lucky - SAP and Oracle are probably much less friendly to people poking around in their databases to create custom reports, hehe), using SQL and its interactive nature to explore the database is a lot of fun as long as the database design is relatively sane. Thank God our ERP vendor's programmers were not creative enough to do insane things.
(Well, they did one crazy thing - there are NO foreign keys to be found anywhere in that database, instead it is all faked with triggers. I think that's how people used SQLite before it supported foreign keys. But we're talking about Microsoft freaking SQL Server here; being derived from another enterprise-y RDBMS, I find it hard to believe that it would at some point have lacked foreign keys. Since the triggers DO check for referential integrity, why on earth did they no just use Foreign keys? What were they thinking?)
It gets a little mind-bending at times, but in a good way.
But explaining SQL queries of the non-trivial kind to somebody is intimidating. One of our accountants at one point expressed interest in learning SQL, because she would bug me with questions that I answered by running a few carefully worded queries. For some reason I find SQL relatively easy to understand but really, really hard to explain.
So was the WITH clause until someone decided it should be valid and added it.
Every improvement to every language ever started out with someone saying, "Hey, here's an idea..." And none of those ideas were ever "valid X" at the time they were proposed.
I came to this page looking for ways to parameterize and re-use raw SQL. with (and create view for multiple queries) fulfill the case where the unit of reuse is a filtered view on one or more tables.
To make progress on this problem, I wrote a light SQL preprocessor to support INCLUDE statements. It enables me to write code like this:
WITH frequently_bought_together AS (
INCLUDE frequently_bought_together.sql
)
SELECT ...
This allows way better isolation and reuse of business logic than before. In Redshift, I combine this with an assert user-defined function to enable writing unit tests in raw SQL.
With all that together, I can trust analysts to update complex data assets and I can ask them to take any data issue investigation they've done and turn it into a re-usable test. Tests end up looking like:
CREATE TEMPORARY TABLE frequently_bought_together AS
INCLUDE frequently_bought_together.sql
;
SELECT f_assert(COUNT(*) > 0, 'Table is empty');
SELECT f_assert(COUNT(DISTINCT item_bought || item_recommended) = COUNT(*), 'Table is fanned out');
...
It has made a huge difference in how we write SQL.
It took me a second, but I can see why this would be handy. Whether you just use it in a with, or create a view from it, you have a single definition of the query that anyone can use. And you can test the INCLUDEd query. Neat.
It warms my heart to see people in the wild doing smart things like this (after getting laughed at by moron managers who wouldn't authorize any good ideas like this for years).
With SQL Server you can create TVFs (table-valued functions). So long as they are 'inline' (lacking BEGIN and END) they will be inlined into any query that uses them. If your SQL fu is up to scratch they are extremely powerful - e.g. CROSS APPLYing them is one of the best ways I've seen to do reusable row-level filtering.
It's great to see highly legible SQL in a tutorial. These days it seems as though everyone on the internet is secretly running a SQL obfuscation competition.
That said, we're not there yet. Using an ORM in 2017 without understanding what SQL you want it to produce IS NOT OKAY. Please learn SQL if you're going to use an ORM, and understand the SQL you are producing with the ORM.
Summed up, if you can, use an ORM to solve this problem.