> For KiWi, storing deletes separately could be interesting
The deletes in different list is super useful in row-major mode if the data bandwidth is an issue and the delete lists can entirely remove the row from the IO.
Apache Hive switched to splitting up the deletes and insert files a while back to get some row-read savings when parallelizing reads - this is pretty much standard tombstoning (but with an immutable fs), but the LSM style merge sucks when you have a scan project/SELECT + schema evolution across different files.
The delete records always have a fixed schema and a single insert sequence is compacted/inserted in the same schema, which prevents the need to switch schema between rows when being read.
Even if the schema rarely changes, but just assuming it can happen and checking for it is slow across row versions of the same row.
assuming that data is being inserted continuously at the same rate, periodic full compaction of an lsm tree with an upper bound on time between full compactions is O(n^2). depending on the quantity and distribution of deletes, you can potentially do much better with incremental compaction strategies while still maintaining an upper bound on the delay in processing a delete.
For KiWi, storing deletes separately could be interesting but I’m not sure about its exact disk layout and how the resorting works.