In Oracle land you have no choice but to use a long list of OR because IN has a low clause limit (1000). So if you need e.g. 1.1K rows, even using a fast indexed PK, your choices are OR or pain. The workarounds are horrible, like splitting your IN clauses and UNION ALL two copies of the same query (runs twice), or using a global temp table (which is awkward as heck to setup/destroy each execution, particularly with IN clauses that could range from 1 through 1K+).
Postgres has no practical limit, there's examples of people using 100K+ IN clauses (not recommending). People keep telling me that Oracle is "enterprise grade" and that Postgres is not. To be honest all I see is Postgres is flexible and modern, and Oracle is kludgy and outmoded. I'm forced to use Oracle but I'd never voluntarily start any new projects on it, I'm all in on Postgres from here on out.
After being on Oracle for 15 years, we recently finished moving to Postgres. It has been amazing. Postgres does so many things better than Oracle and this IN issue is just one of them. I wish we had moved off of Oracle years ago. I don't think there is any reason to choose Oracle over Postgres these days. I will certainly never use it again.
Of course the issue for many companies is that they hit a size and usage of Oracle that moving off becomes a herculean effort. There's no direct replacement for the way people use RAC for example, which means changing the way you approach certain HA requirements. Add to that Oracle PLSQL is not fully portable to Postgres PGSQL. They're close enough that you think it's a simple task, but it's still significant.
Since many people have many things built on top of a single DB, it also generally means a single massive switch that will take a long time and is difficult to test piecemeal.
I'm only saying this so that people go into a switch with their eyes wide open (though, I'm sure you would anyway).
Oh, for sure. It took us 2 years to make the switch. But despite the time and effort it was totally worth it. I would hope that anyone starting a new project would not even consider choosing Oracle. In my opinion and experience there is no point anymore, especially considering the high cost of an Oracle license.
Actually, I am curious about that question, is Oracle being picked by any new businesses? My guess is that on HN we get a skewed view, but my guess is that even given that, Oracle is not being chosen for anything new at this point?
No, enterprise grade means they have a sales team that can take people out to lunch, and find round about ways to give kickbacks to decision makers. Being kludgy and outmoded is a common side effect, but not a requirement.
I think enterprise grade means, among other things, having the means to scale support to meet demands. A large enterprise runs databases in 100s or even 1000s of different applications, developed by maybe 100 different teams. It helps to know that the vendor is able to meet support needs, not just for the use by one team where you know the circumstances, but generally, even when you don't know them.
Used Oracle for a long time but have never had the need to do this. Can I ask what your use case is? I can’t imagine one which isn’t solved by a subquery or a built-in Oracle temp table which automatically restricts the contents of the temp table to one transaction, with no need to setup/teardown per execution (as long as you’re using sensible transactions).
We have a problem at work where one of our tables is 5 billion rows and we get a csv (or maybe a few csv's ;) ) with a total of 21 million primary keys. We now need to find which of these 21 million keys are already in our table and do some task on those that are not.
Originally, we batched 1k records at a time and used an IN. It was taking hours. Each query finished in 1.5 seconds. Then we moved to a global temp table. It was around 30% faster iirc. Still not good enough (open to suggestions!!!) but way better. Wish oracle would just do this for us internally.
How about this:
21 million primary keys are easily kept in memory. Sort them.
Then do "SELECT id FROM five_billion_rows ORDER BY id" and iterate over that with the CSV ids in lockstep.
Cursor management is pretty good in Oracle, this is likely much better than your own paging with 1k of IN values. I would assume that it would be IO bound regarding the index of the id column.
This also might be an interesting question for "Ask HN".
Just as a heads up for anyone considering using Oracle's built-in support for temporary tables, there are a few restrictions [1], no support for distributed transactions being the main one I ran into at a previous client.
Taking into consideration the fact that you'll want to avoid hard parses [2][3] whenever possible, this leaves you the option of mimicking a temporary table manually (which will probably include having to manually fix statistics on your fake temporary table), or maybe use some string manipulation in your query to split a delimited list (which is itself limited to 4000 characters [4]).
That being said, I've been told that distributed transactions should themselves be avoided, so maybe look into that before getting mad at Oracle for what seems to be an arbitrary limitation on temporary tables.
I think saying you "I can’t imagine [a usage] which isn’t solved by [workarounds]" kind of misses the core issue. It is a common use-case (to select a list of rows via PKs), and you're having to work around internal structures of the database, rather than it exposing a simple primitive and the database doing work behind the scenes to make it efficient.
If building a temp table for that many IN clauses is more efficient absolutely do it. But don't expose this ugly to the outside world, particularly as the problem space and solution are well defined, common, and old.
The fact that Oracle continues to ignore basic quality of life improvements and its defenders point to convoluted workarounds that push additional work onto its users is half the problem.
Both Oracle and Postgres have a way of creating an inline table from a list of constants. I wonder if doing that in the query and joining to it would be more efficient.
It's surprising to me that "where id = 42 or id = 4711" behaves differently than "where id in (42, 4711)".
I would have guessed the database optimizes this away. It seems like an easy to implement and safe optimization. (Look at abstract syntax tree, find "or" expressions, find equality test expressions under there, check if two or more compare the same sub-expression against constants, then combine.)
Did I misunderstand what the article is saying PostgreSQL does, or does it actually not do this optimization for you?
> I would have guessed the database optimizes this away
And one thing about this type of "beneath the covers" optimization is that, over time, the tool ends up doing it for you and applying these sorts of optimizations can end up hurting more than helping - so by all means learn it, but keep an eye on it as versions progress and make sure you're not fighting the optimizer.
Kind of an issue for optimizers in any situation. You're relying on implicit behavior which may change.
Unless you decide to document it and guarantee the optimization will be applied if certain conditions are met. Though that has its own downsides because now the system is locked into providing that guarantee. (Real-world example of a guarantee: in C, the "&&" and "||" operators are guaranteed to be lazy. Writing "0 && (1/0)" should be safe; you won't divide by zero because the spec guarantees it's optimized away.)
Or I guess you could try other approaches like having the database detect that it could be optimized, but then tell you to rewrite your own code instead of rewriting it for you.
I think the simpler takeaway is that OR consistently represents a branch in the query planning, where the planner now always has to do two things and merge back instead of one. It makes sense for the OR operations to always preserve this inefficiency instead of leaving the user guessing when it will or won’t.
I have long since resigned myself to the fact that "declarative language" in practice only means that you don't have to manually write the for-loops anymore. The concept of declarative languages seems like it has so much potential, but I haven't seen any that have even come close to it.
In MySQL, discovering the UNION hack a couple years ago was like a revelation. It’s something that I’ve since been preaching to everyone who’ll listen.
It’s extremely common to want to OR a small set of items. It feels like the compiler would be smart enough to use an index (presuming one exists matching the searches fields), but it doesn’t. UNION makes it use the index.
IDK about the profiler... but I have a hard time believing why the union of 2 joins will be faster than filtering first and then join the results :thinking:
Yeah, seeing "sort" in the union-based plan also threw me for a moment about why that would be any faster. Normally sorts kill performance, especially compared to the merge join in the "ugly" version.
I think the key is in the description afterwards, where each subquery only returns 1 row. Which just reinforces: measure everything, this advice may not be generally applicable.
explain (analyze, timing) SELECT id FROM a
WHERE a_val LIKE 'something%'
OR a_val LIKE 'other%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=8.87..12.88 rows=20 width=4) (actual time=0.012..0.012 rows=0 loops=1)
Recheck Cond: ((a_val ~~ 'something%'::text) OR (a_val ~~ 'other%'::text))
Filter: ((a_val ~~ 'something%'::text) OR (a_val ~~ 'other%'::text))
-> BitmapOr (cost=8.87..8.87 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=1)
-> Bitmap Index Scan on a_val_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Index Cond: ((a_val ~>=~ 'something'::text) AND (a_val ~<~ 'somethinh'::text))
-> Bitmap Index Scan on a_val_idx (cost=0.00..4.43 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: ((a_val ~>=~ 'other'::text) AND (a_val ~<~ 'othes'::text))
Planning Time: 0.088 ms
Execution Time: 0.033 ms
(10 rows)
using union:
explain (analyze, timing)
select id from a
where a_val like 'something%'
union
select id from a
where a_val like 'other%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=17.23..17.43 rows=20 width=4) (actual time=0.135..0.135 rows=0 loops=1)
Group Key: a.id
-> Append (cost=0.42..17.18 rows=20 width=4) (actual time=0.133..0.133 rows=0 loops=1)
-> Index Scan using a_val_idx on a (cost=0.42..8.44 rows=10 width=4) (actual time=0.128..0.128 rows=0 loops=1)
Index Cond: ((a_val ~>=~ 'something'::text) AND (a_val ~<~ 'somethinh'::text))
Filter: (a_val ~~ 'something%'::text)
-> Index Scan using a_val_idx on a a_1 (cost=0.42..8.44 rows=10 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: ((a_val ~>=~ 'other'::text) AND (a_val ~<~ 'othes'::text))
Filter: (a_val ~~ 'other%'::text)
Planning Time: 0.124 ms
Execution Time: 0.164 ms
(11 rows)
using gin index:
EXPLAIN (analyze, timing)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=176.15..247.11 rows=20 width=4) (actual time=0.036..0.036 rows=0 loops=1)
Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
-> Bitmap Index Scan on a_val_trgm_idx (cost=0.00..176.15 rows=20 width=0) (actual time=0.034..0.034 rows=0 loops=1)
Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
Planning Time: 0.091 ms
Execution Time: 0.193 ms
(6 rows)
Well fill a table with a few million rows, and compare the results. If they both come back with the same rough times then you've learned they're both equally good :)
Sometimes its good to ask people "Is this true of XX?", but but if you're curious it almost takes less time to expore and test than wait for somebody to give you an answer - an answer that might actually be wrong. (Some things change across DB-releases, as I've learned to my cost. If I'm ever unsure of whether one approach is faster than the other IN MY SPECIFIC SITUATION I would test. Random benchmarks are often not as useful as real-world tests.)
I've always said SQL is a bad abstraction. An artifact from the past that we have to deal with like css and JavaScript.
This article solidifies my point. Seriously, ors are bad so union two queries as a cheaper alternative? Theres no rhyme or reason to it at the API level. You have to look under the hood to see why this hack works. Hence why SQL is such a bad abstraction. We program and optimize sql not with logic but with memorization of hacks.
> Seriously, ors are bad so union two queries as a cheaper alternative?
They should be equivalent, letting you choose the notation that expresses the problem cleanly. If relations were sets of tuples, boolean expressions could be rewritten safely as straightforward set arithmetic and vice versa.
But SQL relations (tables) are bags of tuples (rows), which breaks that equivalence.
That's why the abstraction is bad. The abstraction is not zero cost.
Instead a high level API should be exposed that mirrors the underlying implementation with a bijection. Databases are the bottleneck of most web development and we need to optimize "bags of tuples" not "sets" in order to gain the performance we need. The problem is, the API (SQL) exposes things as if the underlying implementation was sets. Hence the need to memorize ugly hacks like this in order to optimize SQL.
SQL is a language designed for business users, not a language designed to optimize data retrieval speed. We need an API that can do the latter because web devs are doing all these optimizations in the business language.
I'm not saying there's a better solution out there. I'm exposing a problem. The problem is, SQL is not a zero cost abstraction. We need a "Rust" for databases. A high level query language where the optimization paths are isomorphic to the optimizations paths in the underlying implementation.
Maybe at first glance you don't agree with me. But think of it this way. If an optimization path in Postgresql involves rewriting a Query with an OR clause as a UNION of two queries and all SQL implementations are loaded with strange hacks like this... is there a problem? and what is the solution?
>Theres no rhyme or reason to it at the API level. You have to look under the hood...
Every abstraction is leaky. I have found that the best engineers are those who understand how things work under the hood. They can solve hard problems when they come up. People who rely upon a mountain of libraries are stumped and begin trying random things until something works.
Honestly, what is the problem with understanding how your tools work? I don't know of any other engineering profession that thinks this way.
>Honestly, what is the problem with understanding how your tools work? I don't know of any other engineering profession that thinks this way.
There is actually a huge problem with this. Human intelligence is limited and therefore much of the art and science of computer programming is targeted towards managing complexity. The fundamental concept that humans use to reduce complexity in everything from math, physics, language, chemistry and everything out there is abstraction. When look at a wall you are looking at an abstraction. To truly understand the nature of the wall you must understand that it is composed of plant cells, (wood) which is composed of carbon based molecules which at is center is all serialized as a single hereditary molecule called DNA which is itself composed of atoms that are governed by complex differential equations and on and on and on...
You don't need to know any of this to understand the wall because of abstraction. You encapsulate all that complexity into a concept called a wall. Programming is largely the same thing, with abstraction made explicit. To move forward, to produce things of greater complexity we have to move in the direction of good, non-leaky abstractions. Limitations in our intelligence will inevitably force us to become ignorant of the underlying implementations of many technologies so the abstractions around these things must be excellent. It is the only way forward and it is also arguably the reason why humans are different from other animals. We have the ability to abstract.
This concept is everywhere in computer science. The web programmer doesn't need to understand assembly language even though it's the underlying implementation of all his tools. But you are right in the sense that it doesn't doesn't mean a programmer can't learn assembly. But this still illustrates my point. HUmans have a limited ability in handling complexity and therefore we must use abstraction to handle it.
In fact, there is an area of computer science where this has become startlingly obvious. It involves machine learning. Take a visualization algorithm that can recognize objects. Currently the de facto way of doing this training a black box of neural networks. The complexity is so high that we have no idea how to program such algorithms by hand and the only way forward is to rely on these black box abstractions where we "train" the system. Even if the programmer pries open the neural network and tries to understand what each neuron is doing in the context of other neurons is impossible.
In short. To build systems of greater complexity developers must use tools to abstract complexity.
SQL has a different if not tangential problem. It's not leaking enough complexity. It's hiding too much, and producing an interface that is expensive in terms of performance. I'm proposing an abstraction that is one to one. Where if I learn the language and I understand how to optimize the language then a bijection allows me to optimize the underlying implementation as well without understanding it. To borrow the term from Rust and C++: Zero Cost abstractions.
A database is not on the same level as e.g. the CPU. Almost none of us could write assembler for a modern chip without a lot initial of research. That's ok because we have compilers, the OS, and other layers to handle that for us.
However, even for your bog standard web dev, a database is up at the top of the list of systems you need to understand in order to do your job. If you have no clue how to scale it out, how to optimize queries, hell, even how to write non-trivial queries (!), you're not doing your job. You sit there and write an N+1 on every page via your favorite ORM and no one knows why everything is so slow.
A carpenter doesn't have to be a blacksmith, but they do need to know how a saw works. Not knowing the basics of how a DB works makes you disposable, just another JS/HTML/CSS monkey. Can't solve the hard problems, but has no problem bloating every page they create beyond reason.
>That's ok because we have compilers, the OS, and other layers to handle that for us.
This is the essence of my argument. Why can't the database API occupy a layer similar to the OS or compilers?
If I don't understand the underlying implementation of compilers and OSs' why do I need to understand the underlying implementation of my database? Why is the OS abstraction different from the DB sql abstraction? Can't we write abstractions that are zero cost?
The answer is we can, however we haven't and therefore the only solutions available to us is to use SQL and understand the lower level details of the database.
>However, even for your bog typical web dev, a database is right up there in the list of systems you need to understand in order to do your job.
I'm not talking about your responsibilities on the job. I'm questioning the current state of tech. A web developer uses an API exposed by an OS and doesn't have to understand the OS. A web developer uses an API exposed by the compiler and doesn't have to understand Assembly. A web developer uses an API exposed by a database but he MUST understand the underlying features of a database. See the difference? I'm pointing out an issue that could be improved not what you need to know for your job.
>A carpenter doesn't have to be a blacksmith, but they do need to know how a saw works.
SQL is a saw that breaks in half on every use. In order to use such a saw a carpenter must have the knowledge of a blacksmith to reforge the metal and combine the two halves of the saw into a single metal blade. A leaky abstraction. That is a proper analogy for SQL.
>This is the essence of my argument. Why can't the database API occupy a layer similar to the OS or compilers?
>If I don't understand the underlying implementation of compilers and OSs' why do I need to understand the underlying implementation of my database? Why is the OS abstraction different from the DB sql abstraction? Can't we write abstractions that are zero cost?
Because it's closer to your application. An OS doesn't need to know how e.g. a Customer relates to a Business in order to read a file. Some of us do need to know the ins and outs of the CPU and OS because our application requires it. Yours doesn't, but it does require knowledge of how the DB works.
>I'm not talking about your responsibilities on the job. I'm questioning the current state of tech. A web developer uses an API exposed by an OS and doesn't have to understand the OS. A web developer uses an API exposed by the compiler and doesn't have to understand Assembly. A web developer uses an API exposed by a database but he MUST understand the underlying features of a database. See the difference?
Not really, no. As a web dev you are about 100 layers away from the OS most of the time, but your application logic and architecture is very much dependent upon the information store. I spent years working in digital imaging and automation and had to know what my compiler was doing and how my OS implemented all sorts of things because my application depended on it.
>SQL is a saw that breaks in half on every use. In order to use such a saw a carpenter must have the knowledge of a blacksmith to reforge the metal and combine the two halves of the saw into a single metal blade. A leaky abstraction. That is a proper analogy for SQL.
This is what irks me. I hear stuff like this all of the time from people who solve relatively trivial problems day to day, but for whatever reason think things should be better than they are. They, of course, have no idea what that would look like, but damn it if they don't have a strong opinion on the matter.
I'm not saying that's you necessarily, but if you hadn't noticed, these things are hard. This is the best we have come up with so far. Planes still crash, cars don't drive themselves, and my laptop crashed literally 30 minutes ago.
In my experience, the people who complain most loudly about poor abstractions, those who harp on you for not following "modern design patterns", and generally complain about everything write the worst code. I eagerly await your proposal for a better data store and application protocol that solves all of the problems SQL currently does without any of the warts.
>Because it's closer to your application. An OS doesn't need to know how e.g. a Customer relates to a Business in order to read a file. Some of us do need to know the ins and outs of the CPU and OS because our application requires it. Yours doesn't, but it does require knowledge of how the DB works.
SQL is 100% trying to achieve the goal opposite of what you say. Look at the query language. The literal intention of SQL is to hide everything from the user. EXPLAIN ANALYZE is tacked on later when people started using SQL in the wrong way causing the godawful and impossible to remove mess we have today.
I'm arguing for a language that's one to one with the implementation. Zero cost abstractions like C, C++ and Rust.
>Not really, no. As a web dev you are about 100 layers away from the OS most of the time, but your application logic and architecture is very much dependent upon the information store. I spent years working in digital imaging and automation and had to know what my compiler was doing and how my OS implemented all sorts of things because my application depended on it.
No to you dude. You're not 100 layers away from the OS. All the languages that you use to develop a web app have system calls, you are essentially 1 layer away from the OS. The reason why we can use high level languages like ruby or python is because the the database is a bottleneck and it doesn't really matter how slow the app is... But even these languages expose OS level system calls, so it's essentially the same experience as a "low level language" sans memory management.
Also the fact that you needed to understand what you're compiler was doing is important. Did you have to do SQL level hacks in order to get your compiler to do what you wanted it to do? I'm sure your compiler was DESIGNED to be low level so that you didn't see BS like UNION vs. OR.
>Because it's closer to your application. An OS doesn't need to know how e.g. a Customer relates to a Business in order to read a file.
You realize the entire initial purpose of SQL was for business questions by laymen. The things we do with it a lot nowadays are optimizations via hacks and query string construction via meta programming. I'm not saying don't understand your toolset implementation (even though in the ideal world you shouldn't need to) I'm saying there's a problem when an optimization involves a high level api hack. Literally a week ago I wrote a SQL optimization hack that improved speed by at least 3x, it involved serializing a data-structure into a string and doing string manipulation to change the data structure into something easier to deal with... before casting it back down... What other API in the universe requires you to do this? A bad one.
>In my experience, the people who complain most loudly about poor abstractions, those who harp on you for not following "modern design patterns", and generally complain about everything write the worst code. I eagerly await your proposal for a better data store and application protocol that solves all of the problems SQL currently does without any of the warts.
I have no proposal, I'm not trying to design a solution I'm saying there's a problem. I'm not a database developer. You trying to imply that I shouldn't talk unless I'm a database developer with a solution in mind? Are you a database developer?
Also I hate design patterns, I avoid design patterns like the plague.
>This is what irks me. I hear stuff like this all of the time from people who solve relatively trivial problems day to day, but for whatever reason think things should be better than they are. They, of course, have no idea what that would look like, but damn it if they don't have a strong opinion on the matter.
What's your opinion on "trivial problem"? I'm literally doing that everyday. I'm working on building SQL views for a database littered with technical debt and JSON typed columns. Literally all I'm doing is writing SQL views to cover up all this crap up so customers can query our database as if everything was properly sequestered into tables rather then JSON blobs. Day in and day out I try to poke at a query trying to coax more speed out of it using non-intuitive and illogical hacks. All I do is SQL (currently). Trivial as hell.
What do you do? Must be some super crazy dream project for solving some ultra hard problem. You talk as if you're elite.
>What do you do? Must be some super crazy dream project for solving some ultra hard problem. You talk as if you're elite.
I don't have any more time for this at the moment, but I promise you that your views are not nearly as complex as designing and maintaining a database engine. I'm sorry if that hurts your feelings. Remember; I'm not the one bemoaning "poor abstractions" and offering exactly zero ideas as to how it could be improved.
>I don't have any more time for this at the moment, but I promise you that your views are not nearly as complex as designing and maintaining a database engine. I'm sorry if that hurts your feelings. Remember; I'm not the one bemoaning "poor abstractions" and offering exactly zero ideas as to how it could be improved.
No time or more like nothing left to say?
Why be sorry? My feelings aren't hurt when you're completely and utterly wrong. You should be the one with your feelings hurt not that you've been shown how off-base you are.
Additionally who said I'm talking about a database engine?? I'm talking about the API man. I never mentioned anything about the engine itself. I'm talking about how a USER accesses lower level calls that conduct JOINS and SCANS. We don't have explicit control over the individual commands you see in an EXPLAIN, we can manipulate those calls via a SQL query. I'm literally only referring to SQL.
Man when you started speculating about what I do with "non-trivial problems," that's when you took this argument into no mans land.
The very idea you have to be an expert in something to comment on a problems is literal mental stupidity. I don't restrict myself and neither do you because clearly YOU aren't a database developer either.
Think about it this way, I'm not an expert but am I allowed to comment on the environment? About global warming? Am I not allowed to support either of these causes without a PHD in environmental science? Of course I'm allowed. As are you.
You're even allowed to speculate on my coding practices, patterns and career... "trivial problems" as you've already implied. But it's a dirty low blow when you do stuff like that, I tend to avoid it.
I think the main issue is that to replace SQL, you'll need to build a database which you then need some method of interfacing with.
You could expose the underlying structures and operations with some API of course, but so much work has gone into optimizing the high-level CRUD/join operations that you'll force your users to reinvent the wheel.
You're not wrong. Just note that in almost every other area of the stack language and technology transitions occur all the time. SQL is one of the few parts of the stack that has remained steadfast. Many argue for good reason, I would agree, but I would also agree that there is much room for improvement.
What's strange to me is why developers welcome change anywhere else on the stack but vehemently dislike suggestions for improvement in a relational database api.
Postgres has no practical limit, there's examples of people using 100K+ IN clauses (not recommending). People keep telling me that Oracle is "enterprise grade" and that Postgres is not. To be honest all I see is Postgres is flexible and modern, and Oracle is kludgy and outmoded. I'm forced to use Oracle but I'd never voluntarily start any new projects on it, I'm all in on Postgres from here on out.