This is one area where Rust, a modern systems language, has disappointed me. You can't allocate data structures inside mmap'ed areas, and expect them to work when you load them again (i.e., the mmap'ed area's base address might have changed). I hope that future languages take this usecase into account.
I'm not sure I see the issue. This approach (putting raw binary data into files) is filled with footguns. What if you add, remove or reorder fields? What if your file was externally modified and now doesn't match the expected layout? What if the data contains things like file descriptors or pointers that can't meaningfully be mapped that way? Even changing the compilation flags can produce binary incompatibilities.
I'm not saying that it's not sometimes very useful but it's tricky and low level enough that some unsafe low level plumbing is, I think, warranted. You have to know what you're doing if you decide to go down that route, otherwise you're much better off using something like Serde to explicitly handle serialization. There's some overhead of course, but 99% of the time it's the right thing to do.
I had a use case recently for serializing C data structures in Rust (i.e., being compatible with an existing protocol defined as "compile this C header, and send the structs down a UNIX socket"), and I was a little surprised that the straightforward way to do it is to unsafely cast a #[repr(C)] structure to a byte-slice, and there isn't a Serde serializer for C layouts. (Which would even let you serialize C layouts for a different platform!)
I think you could also do something Serde-ish that handles the original use case where you can derive something on a structure as long as it contains only plain data types (no pointers) or nested such structures. Then it would be safe to "serialize" and "deserialize" the structure by just translating it into memory (via either mmap or direct reads/writes), without going through a copy step.
The other complication here is multiple readers - you might want your accessor functions to be atomic operations, and you might want to figure out some way for multiple processes accessing the same file to coordinate ordering updates.
I kind of wonder what Rust's capnproto and Arrow bindings do, now....
It's likely that the "safe transmute" working group[1] will help facilitate this sort of thing. They have an RFC[2]. See also the bytemuck[3] and zerocopy[4] crates which predate the RFC, where at least the latter has 'derive' functionality.
The footguns can be solved in part by the type-system (preventing certain types from being stored), and (if necessary) by cooperation with the OS (e.g. to guarantee that a file is not modified between runs).
How else would you lazy-load a database of (say) 32GB into memory, almost instantly?
And why require everybody to write serialization code when just allocating the data inside a mmap'ed file is so much easier? We should be focusing on new problems rather than reinventing the wheel all the time. Persistence has been an issue in computing since the start, and it's about time we put it behind us.
>How else would you lazy-load a database of (say) 32GB into memory, almost instantly?
By using an existing database engine that will do it for me. If you need to deal with that amount of data and performance is really important you have a lot more to worry about than having to use unsafe blocks to map your data structures.
Maybe we just have different experiences and work on different types of projects but I feel like being able to seamlessly dump and restore binary data transparently is both very difficult to implement reliably and quite niche.
Note in particular that machine representation is not necessarily the most optimal way to store data. For instance any kind of Vec or String in rust will use 3 usize to store length, capacity and the data pointer which on 64 bit architectures is 24 bytes. If you store many small strings and vectors it adds up to a huge amount of waste. Enum variants are also 64 bits on 64 bit architectures if I recall correctly.
For instance I use bincode with serde to serialize data between instances of my application, bincode maps almost 1:1 the objects with their binary representation. I noticed that by implementing a trivial RLE encoding scheme on top of bincode for running zeroes I can divide the average message size by a factor 2 to 3. And bincode only encodes length, not capacity.
My point being that I'm not sure that 32GB of memory-mapped data would necessarily load faster than <16GB of lightly serialized data. Of course in some cases it might, but that's sort of my point, you really need to know what you're doing if you decide to do this.
> How else would you lazy-load a database of (say) 32GB into memory, almost instantly?
That's what the fst crate[1] does. It's likely working at a lower level of abstraction than you intend. But the point is that it works, is portable and doesn't require any cooperation from the OS other than the ability to memory map files. My imdb-rename tool[2] uses this technique to build an on-disk database for instantaneous searching. And then there is the regex-automata crate[3] that permits deserializing a regex instantaneously from any kind of slice of bytes.[4]
I think you should maybe provide some examples of what you're suggesting to make it more concrete.
You can't do that in C++ or any language. You need to do your own relocations and remember enough information to do them. You can't count on any particular virtual address being available on a modern system, not if you want to take advantage of ASLR.
The trouble is that we have to mark relocated pages dirty because the kernel isn't smart enough to understand that it can demand fault and relocate on its own. Well, either that, or do the relocation anew on each access.
Indeed. I don't know if there's a plan for the standard type to move to offset-ptr, or if there's even a std::offset_ptr, but it would be great if there was.
For us, some of the 'different data type' pain was alleviated with transparent comparators. YMMV.
Edit: It seems C++11 has added some form of support for it... 'fancy pointers'
I don't know enough about Rust to say. If it doesn't have the concept of a 'fancy pointer' then I assume no, you'd have to essentially reproduce what boost::interprocess does.
I'm still learning Rust, but iiuc you could do this by creating an OffsetPtr type that implements the Deref trait (https://doc.rust-lang.org/std/ops/trait.Deref.html). This is exactly a "fancy pointer" as you describe.
By that logic you can do it in unsafe Rust as well then. Obviously in safe Rust having potentially dangling "pointers and references to things outside of the mmap'd area" is a big no-no.
And note that even intra-area pointers would have to be offset if the base address changes. Unless you go through the trouble of only storing relative offsets to begin with, but the performance overhead might be significant.
Well, one approach is to parameterize your data-types such that they are fast in the usual case, but become perhaps slightly slower (but still on par with hand-written code) in the more versatile case.