Used Oracle for a long time but have never had the need to do this. Can I ask wh...

jzoch · on Nov 11, 2019

We have a problem at work where one of our tables is 5 billion rows and we get a csv (or maybe a few csv's ;) ) with a total of 21 million primary keys. We now need to find which of these 21 million keys are already in our table and do some task on those that are not.

Originally, we batched 1k records at a time and used an IN. It was taking hours. Each query finished in 1.5 seconds. Then we moved to a global temp table. It was around 30% faster iirc. Still not good enough (open to suggestions!!!) but way better. Wish oracle would just do this for us internally.

bhaak · on Nov 14, 2019

How about this: 21 million primary keys are easily kept in memory. Sort them.

Then do "SELECT id FROM five_billion_rows ORDER BY id" and iterate over that with the CSV ids in lockstep.

Cursor management is pretty good in Oracle, this is likely much better than your own paging with 1k of IN values. I would assume that it would be IO bound regarding the index of the id column.

This also might be an interesting question for "Ask HN".

Tarq0n · on Nov 15, 2019

Is an anti join not feasible in this scenario?

psYchotic · on Nov 11, 2019

Just as a heads up for anyone considering using Oracle's built-in support for temporary tables, there are a few restrictions [1], no support for distributed transactions being the main one I ran into at a previous client.

Taking into consideration the fact that you'll want to avoid hard parses [2][3] whenever possible, this leaves you the option of mimicking a temporary table manually (which will probably include having to manually fix statistics on your fake temporary table), or maybe use some string manipulation in your query to split a delimited list (which is itself limited to 4000 characters [4]).

That being said, I've been told that distributed transactions should themselves be avoided, so maybe look into that before getting mad at Oracle for what seems to be an arbitrary limitation on temporary tables.

[1]: https://docs.oracle.com/database/121/SQLRF/statements_7002.h...

[2]: https://docs.oracle.com/database/121/TGSQL/tgsql_sqlproc.htm...

[3]: https://blogs.oracle.com/sql/improve-sql-query-performance-b...

[4]: https://docs.oracle.com/database/121/SQLRF/sql_elements001.h...

Someone1234 · on Nov 11, 2019

Generating reports.

I think saying you "I can’t imagine [a usage] which isn’t solved by [workarounds]" kind of misses the core issue. It is a common use-case (to select a list of rows via PKs), and you're having to work around internal structures of the database, rather than it exposing a simple primitive and the database doing work behind the scenes to make it efficient.

If building a temp table for that many IN clauses is more efficient absolutely do it. But don't expose this ugly to the outside world, particularly as the problem space and solution are well defined, common, and old.

The fact that Oracle continues to ignore basic quality of life improvements and its defenders point to convoluted workarounds that push additional work onto its users is half the problem.