More

lichtenberger · 2023-12-17T19:04:16 1702839856

At least they are already discussing the next version.

lichtenberger · 2023-12-15T22:57:00 1702681020

What about storing the data and thus, the indexes in Kafka. Would it make sense? Let's say currently, I'm storing SirixDB resources in files. However, instead of offsets into a file the index pages could be stored in Kafka optionally (or Pulsar...). Is Kafka too slow for this or only for specific row-tuples? We could make a combined storage caching the pages locally or also storing in the file system and asynchronous storing in Kafk, S3 or whatever.

lichtenberger · 2023-12-15T22:29:53 1702679393

Well, it may be a B-tree, or an LSM-tree, a trie or whatever index structure suits...

Also, of course you may have covering indexes.

lichtenberger · 2023-12-15T22:28:09 1702679289

"Database Design and Implementation" by Edward Sciore is also a very great read with a lot of examples written in Java (actually a small DBS).

For actual research I'd recommend all stuff from Andy (Pavlo), Viktor Leis, Thorsten Grust, Thomas Neumann...

lichtenberger · 2023-12-15T22:16:42 1702678602

It's fundamentally how SirixDB approaches this (basically also storing checksums) as also written in another reply :-)

Every commit directly syncs the binary data to the durable storage (currently a file) and incrementally adds data. Furthermore, it stores optionally the changes (type of change/ctx node/updatePosition... in JSON files). For instance, lately I've implemented a simple copy mechanism based on this. Copy a given revision and optionally apply all changes with intermediate commits to also copy the full history up to the most recent revision). However, the main idea is to use the change tracking also for diff visualizations... maybe even stream these via web sockets.

A production ready system BTW may be Datomic.

And it also reminds me of this paper: https://dl.acm.org/doi/abs/10.5555/3275366.3284969

lichtenberger · 2023-12-15T18:55:02 1702666502

Have a look into my DB project: https://sirix.io | https://github.com/sirixdb/sirix

https://sirix.io/docs/concepts.html and in progress tutorial https://sirix.io/docs/jsoniq-tutorial.html may be especially helpful.

It basically turns updates into appends and is based on a persistent tree structure (the header with a reference to the (revision) root page has to be swapped atomically. Other than that the revision indexes for new data are always appended. In order to reduce copy-on-write overhead for updated page (fragments) a sliding snapshot for the data pages is applied.

Naturally, unchanged pages are simply referenced (e.g. through offsets into the file, thus sharing unchanged pages between revisions).

What's also special is a path summary of all unordered paths in a resource, which enables user-defined smaller tailored secondary indexes and other query rewrites :-)

derefr · 2023-12-15T20:45:27 1702673127

How does Sirix compare to LMDB (esp. MDBX)?

(I ask because AFAIK LMDB derivatives do a similar-sounding thing: it updates pages within a write-transaction by first allocating freelist pages to use to write out new copies of those pages with the changes included; these changes recurse upward because the pages are storing a B-tree, until a modified copy of the root page is made; a commit-log pointer is updated to point to the new root page; and then the old rewritten pages are put into the freelist.)

lichtenberger · 2023-12-15T21:52:17 1702677137

Basically, it retains all revisions. Furthermore, the main document index is a keyed trie, much like hash array mapped tries. That is storing an array as a tree and using compact page layouts (bitmaps, 4 references pages, full pages) to reduce the page sizes if they are not full. However, Sirix assigns monotonically increasing, immutable, unique node identifiers, thus most inner pages are full with references to the next level pages (also checksums of the child pages are stored along with the references as in ZFS). The height of the tree increases dynamically. Currently every inner page stores at most 1024 references, thus it's a very wide tree, but we should experiment with other sizes.

The leaf pages of the trie store either the data itself/the nodes or nodes of the path summary, nodes of the secondary indexes...

Thus, we have the main document index, but a RevisionRootPage also has references to the tries, which store the secondary indexes. The secondary indexes are read into main memory / are reconstructed from the leaf pages of the tries (usually small), also a small path summary.

The data pages are not simply copied... only nodes, which changed or fall out of a sliding window. Thus, a page may have to be reconstructed in-memory from at most a small number N of page fragments in the worst case. Thus, it needs a device, which is suitable for fast random, small sized parallel reads and sequential writes.

Currently you have to copy a resource starting from a given revision and applying all updates up to the most recent revision with intermediate commits in order to get rid of old revisions, as it only uses one data file per resource (a resource is equivalent to a table in a relational system). Thus, the data files are basically logs. Another file simply stores offsets and timestamps read into memory to retrieve a given revision.

https://sirix.io/docs/concepts.html

and

https://sirix.io/docs/jsoniq-tutorial.html

Should probably help to get a further understanding.

HTH and let me know if you're interested in more details :-)

Thanks for asking

senderista · 2023-12-16T01:34:08 1702690448

Note that if your updates are much smaller than a page, you're gonna have a bad time with LMDB. Optimizations like WAL and group commit exist for a reason.

hyc_symas · 2023-12-27T09:55:03 1703670903

For a 4KB pagesize, the breakeven point in write amplification vs WALs is around 768 bytes. http://www.lmdb.tech/bench/ondisk/

Above that record size, LMDB is more efficient.

For OpenLDAP, with LDAP entries typically being at least 1KB, the LMDB design is ideal. If it's not ideal for other applications, oh well.

lichtenberger · 2023-12-16T08:34:03 1702715643

Because it has to copy and write entire pages instead of only force a flush of log records due to a WAL?

lichtenberger · 2023-12-16T12:36:11 1702730171

Oh, seems also because of random in-place writes of the B-tree.

lichtenberger · 2023-12-12T12:08:51 1702382931

This thread is also a good short summary of the key points: https://twitter.com/jensdittrich/status/1734142079012323480?...

lichtenberger · 2023-12-06T08:41:23 1701852083

Here's the project website: https://www.tornadovm.org/

lichtenberger · 2023-12-05T11:46:34 1701776794

I think one of the main problems was, that the marketing was especially terrible as they branded both consumer SSDs with an Optane cache as Optane as well as the "real" Optane DC persistent memory, which is put into the appropriate DIMM slots.

Another issue may be that Optane DC persistent memory simply was not fast enough to replace non persistent RAM.

Still, I hope that another technology will arise, which is byte-addressable and persistent/durable. I think it could radically change the design of database systems again. You wouldn't have to have a page cache / buffer manager, which retrieves same sized blocks (or multiples of blocks) for instance. You probably wouldn't even need serialization/deserialization to disk.

It would be great if it'd be possible to for instance read 512byte blocks from disk with current SSDs, but I guess the block overhead for meta data might be too big.

tenebrisalietum · 2023-12-05T12:13:38 1701778418

What do you think of KV SSDs?

lichtenberger · 2023-12-05T10:18:07 1701771487

What I'm missing the most is byte granular reading (in case of Optane it was 256 bytes, because of checksums, but that's fine granular enough).

It means it would be possible to read/write data in much more fine granular chunks (potentially saving a lot of storage space in some cases).