I have suggested MERGE as the last resort (due to the the main table having 40(!...

fuy · on Dec 11, 2022

My main gripe is the idea that MERGE is somehow better than INSERT ON CONFLICT for upserts, which I don't think is ever true. I don't think I agree that it was added as a workaround - it was added rather as a proper solution, but for a more limited use case which is popular in OLTP workloads: simple thread-safe upsert. MERGE definitely has its place in complex ETL-style workloads, where you may also want to delete stuff, or change data in the source table as well as in the target table.

Regarding SQL Server - it's not true that SQL Server will kill transactions based on some timeout (unless there's a deadlock detected). I worked with SQL Server extensively, including using MERGE for upserts (btw, you have to use serializable isolation to make MERGE thread-safe, so it's not so easy to get rid of transactions altogether). SQL Server doesn't kill transactions, except when there's a deadlock. FWIW, in SQL Server community there's a well-known notion of MERGE being buggy (even though it's been supported for like 15 years!) and hard to reason about (especially when there's triggers on the involved columns), see for example: https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

I agree with you that Postgres' MERGE is most likely modeled after Oracle, but haven't looked into it either.

inkyoto · on Dec 12, 2022

> My main gripe is the idea that MERGE is somehow better than INSERT ON CONFLICT for upserts […]

I, for one, find the MERGE syntax to be easier to read, more flexible and versatile and, most importantly, standardised. It is, effectively, UPSERT++ if you like, as it also allows one to delete rows from a table if there is a condition match. Compare

  MERGE
    target
  USING
    source
  ON
    target.col1 = source.col1 AND
    target.col2 = source.col2 AND
    …
  WHEN MATCHED THEN
    UPDATE SET -- or DELETE
      target.col3 = source.col3,
      …
  WHEN NOT MATCHED BY TARGET THEN
    INSERT (…) VALUES (…)

with

  INSERT INTO target (…)
    VALUES
    (source.col1),
    (…)
  ON CONFLICT (target.col1)
  DO UPDATE
    SET target.col3 = source.col3;

Personally, I prefer the MERGE option, but your mileage may vary.

I also deem MERGE, as a technical term, to be more concise and much closer semantically to the intent it describes as opposed to INSERT ON CONFLICT. Rows are routinely updated in a database (it is its job after all), and a row update operation is not inherently a conflicting update. Conflicts (semantically) confer exceptional situations that have to be dealt with, well, in exceptional ways. But I digress as it is more of a lingustic subject.

> […] it was added rather as a proper solution, but for a more limited use case which is popular in OLTP workloads: simple thread-safe upsert.

As far as SQL (the language and the standard) is concerned, SQL is unaware of threads or the thread safety; SQL is concerned with transactions and with the transactional integrity. The database provides and ensures the ACID behaviour and guarantees, and it may or may not even use threads to accomplish it (as a matter of the fact, we know that most do but it is an implementation detail).

And MERGE is perfectly suitable for a variety of processing scenarios OLTP including. In one of my past projects from a few years back, replacing a series of handrolled UPSERT's with a single carefully written MERGE yielded a 1500% performance increase for real time freight item scan events flowing into the main table (granted, the inability to improve the shoddily designed schema was a hard constraint therefore a certain level of creativity was necessary). It was for a 10+ million processing events per hour scale, which is not huge but substantial.

> MERGE definitely has its place in complex ETL-style workloads

As an aside note, ETL workloads are not expected to modify a database in place which would otherwise make them one of the most vile and reviled integration anti-patterns. The (E) and (L) steps imply a distinct source and a distinct target, and the (T)ransform step takes place outside the DB, either in the integration layer or is done by a specialised ETL tool. That aside, I fail to see what is so ETL specific about MERGE; if there is a complex UPSERT scenario, a MERGE could be a good candidate (or not), the onus is on the engineer to carry out the analysis and make the right choice.

> Regarding SQL Server - it's not true that SQL Server will kill transactions based on some timeout (unless there's a deadlock detected).

It is true (or it was in SQL Server 2012). In the project I have mentioned earlier, I inherited a design and an implementation that were nothing short of a unmitigated dumpster fire, which also included «let's add another few random compound indices because what could possibly go wrong?». That had to be scrapped and re-engineered from scratch, as it turned out that SQL Server had a penchant for allotting a specific time window for a transaction to complete (irrespective of whether the transaction was serialised or not), and if an index update was taking longer than the time window allotted to the transaction, the DB engine would kill the transaction off. Such a peculiar behaviour was so unexpected that it caused multiple catastrophic cascading failures during the first production deployment attempt, and the deployment had to be aborted and rolled back. I had to enlist a DBA to work on the post-mortem and the subsequent redesign to understand the root cause. It turned out to be the documented SQL Server transaction engine behaviour. A painstakingly difficult, lengthy and time consuming low level index performance analysis and a subsequent meticulous complete index redesign solved the problem in the end.

fuy · on Dec 13, 2022

> As far as SQL (the language and the standard) is concerned, SQL is unaware of threads or the thread safety; SQL is concerned with transactions and with the transactional integrity. The database provides and ensures the ACID behaviour and guarantees

When I say thread-safety, I just mean the general notion that running a certain SQL statement or procedure concurrently from multiple processes/threads/users/connections will result in correct execution without race conditions.

SQL is definitely concerned with something that is very close to the notion of thread-safety, namely transaction isolation (I in ACID). It's the property that controls concurrent execution of queries in the database, and it deals with what is called "phenomena" in SQL literature. Phenomena is essentially a set of specific types of race conditions which occur under different isolation levels.

And because default isolation level in most popular DBs is "Read Committed", - that is, a very relaxed isolation level allowing a lot of race conditions, - some of pretty basic operations such as upsert/merge are not thread-safe (or, if you dislike this term, you may say "have race conditions", or "do not avoid certain phenomena").

> It turned out to be the documented SQL Server transaction engine behaviour.

Would appreciate the link - it's either something that I haven't seen, or we're just using different terms for something deadlock-related.

>(T)ransform step takes place outside the DB In a perfect world probably yes, but in reality there's plenty of cases where you have some staging tables that are then merged with production tables - that's where MERGE is a good candidate, as it can handle all three of insert, update and delete.

Cheers!