What about storing the data and thus, the indexes in Kafka. Would it make sense? Let's say currently, I'm storing SirixDB resources in files. However, instead of offsets into a file the index pages could be stored in Kafka optionally (or Pulsar...). Is Kafka too slow for this or only for specific row-tuples? We could make a combined storage caching the pages locally or also storing in the file system and asynchronous storing in Kafk, S3 or whatever.
It's fundamentally how SirixDB approaches this (basically also storing checksums) as also written in another reply :-)
Every commit directly syncs the binary data to the durable storage (currently a file) and incrementally adds data. Furthermore, it stores optionally the changes (type of change/ctx node/updatePosition... in JSON files). For instance, lately I've implemented a simple copy mechanism based on this. Copy a given revision and optionally apply all changes with intermediate commits to also copy the full history up to the most recent revision). However, the main idea is to use the change tracking also for diff visualizations... maybe even stream these via web sockets.
It basically turns updates into appends and is based on a persistent tree structure (the header with a reference to the (revision) root page has to be swapped atomically. Other than that the revision indexes for new data are always appended. In order to reduce copy-on-write overhead for updated page (fragments) a sliding snapshot for the data pages is applied.
Naturally, unchanged pages are simply referenced (e.g. through offsets into the file, thus sharing unchanged pages between revisions).
What's also special is a path summary of all unordered paths in a resource, which enables user-defined smaller tailored secondary indexes and other query rewrites :-)
(I ask because AFAIK LMDB derivatives do a similar-sounding thing: it updates pages within a write-transaction by first allocating freelist pages to use to write out new copies of those pages with the changes included; these changes recurse upward because the pages are storing a B-tree, until a modified copy of the root page is made; a commit-log pointer is updated to point to the new root page; and then the old rewritten pages are put into the freelist.)
Basically, it retains all revisions. Furthermore, the main document index is a keyed trie, much like hash array mapped tries. That is storing an array as a tree and using compact page layouts (bitmaps, 4 references pages, full pages) to reduce the page sizes if they are not full. However, Sirix assigns monotonically increasing, immutable, unique node identifiers, thus most inner pages are full with references to the next level pages (also checksums of the child pages are stored along with the references as in ZFS). The height of the tree increases dynamically. Currently every inner page stores at most 1024 references, thus it's a very wide tree, but we should experiment with other sizes.
The leaf pages of the trie store either the data itself/the nodes or nodes of the path summary, nodes of the secondary indexes...
Thus, we have the main document index, but a RevisionRootPage also has references to the tries, which store the secondary indexes. The secondary indexes are read into main memory / are reconstructed from the leaf pages of the tries (usually small), also a small path summary.
The data pages are not simply copied... only nodes, which changed or fall out of a sliding window. Thus, a page may have to be reconstructed in-memory from at most a small number N of page fragments in the worst case. Thus, it needs a device, which is suitable for fast random, small sized parallel reads and sequential writes.
Currently you have to copy a resource starting from a given revision and applying all updates up to the most recent revision with intermediate commits in order to get rid of old revisions, as it only uses one data file per resource (a resource is equivalent to a table in a relational system). Thus, the data files are basically logs. Another file simply stores offsets and timestamps read into memory to retrieve a given revision.
Note that if your updates are much smaller than a page, you're gonna have a bad time with LMDB. Optimizations like WAL and group commit exist for a reason.
I think one of the main problems was, that the marketing was especially terrible as they branded both consumer SSDs with an Optane cache as Optane as well as the "real" Optane DC persistent memory, which is put into the appropriate DIMM slots.
Another issue may be that Optane DC persistent memory simply was not fast enough to replace non persistent RAM.
Still, I hope that another technology will arise, which is byte-addressable and persistent/durable. I think it could radically change the design of database systems again. You wouldn't have to have a page cache / buffer manager, which retrieves same sized blocks (or multiples of blocks) for instance. You probably wouldn't even need serialization/deserialization to disk.
It would be great if it'd be possible to for instance read 512byte blocks from disk with current SSDs, but I guess the block overhead for meta data might be too big.