This is one area where Rust, a modern systems language, has disappointed me. You...

simias · on Jan 23, 2021

I'm not sure I see the issue. This approach (putting raw binary data into files) is filled with footguns. What if you add, remove or reorder fields? What if your file was externally modified and now doesn't match the expected layout? What if the data contains things like file descriptors or pointers that can't meaningfully be mapped that way? Even changing the compilation flags can produce binary incompatibilities.

I'm not saying that it's not sometimes very useful but it's tricky and low level enough that some unsafe low level plumbing is, I think, warranted. You have to know what you're doing if you decide to go down that route, otherwise you're much better off using something like Serde to explicitly handle serialization. There's some overhead of course, but 99% of the time it's the right thing to do.

geofft · on Jan 23, 2021

I had a use case recently for serializing C data structures in Rust (i.e., being compatible with an existing protocol defined as "compile this C header, and send the structs down a UNIX socket"), and I was a little surprised that the straightforward way to do it is to unsafely cast a #[repr(C)] structure to a byte-slice, and there isn't a Serde serializer for C layouts. (Which would even let you serialize C layouts for a different platform!)

I think you could also do something Serde-ish that handles the original use case where you can derive something on a structure as long as it contains only plain data types (no pointers) or nested such structures. Then it would be safe to "serialize" and "deserialize" the structure by just translating it into memory (via either mmap or direct reads/writes), without going through a copy step.

The other complication here is multiple readers - you might want your accessor functions to be atomic operations, and you might want to figure out some way for multiple processes accessing the same file to coordinate ordering updates.

I kind of wonder what Rust's capnproto and Arrow bindings do, now....

burntsushi · on Jan 23, 2021

It's likely that the "safe transmute" working group[1] will help facilitate this sort of thing. They have an RFC[2]. See also the bytemuck[3] and zerocopy[4] crates which predate the RFC, where at least the latter has 'derive' functionality.

[1] - https://github.com/rust-lang/project-safe-transmute

[2] - https://github.com/jswrenn/project-safe-transmute/blob/rfc/r...

[3] - https://docs.rs/bytemuck/1.5.0/bytemuck/

[4] - https://docs.rs/zerocopy/0.3.0/zerocopy/index.html

amelius · on Jan 23, 2021

The footguns can be solved in part by the type-system (preventing certain types from being stored), and (if necessary) by cooperation with the OS (e.g. to guarantee that a file is not modified between runs).

How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

And why require everybody to write serialization code when just allocating the data inside a mmap'ed file is so much easier? We should be focusing on new problems rather than reinventing the wheel all the time. Persistence has been an issue in computing since the start, and it's about time we put it behind us.

simias · on Jan 23, 2021

>How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

By using an existing database engine that will do it for me. If you need to deal with that amount of data and performance is really important you have a lot more to worry about than having to use unsafe blocks to map your data structures.

Maybe we just have different experiences and work on different types of projects but I feel like being able to seamlessly dump and restore binary data transparently is both very difficult to implement reliably and quite niche.

Note in particular that machine representation is not necessarily the most optimal way to store data. For instance any kind of Vec or String in rust will use 3 usize to store length, capacity and the data pointer which on 64 bit architectures is 24 bytes. If you store many small strings and vectors it adds up to a huge amount of waste. Enum variants are also 64 bits on 64 bit architectures if I recall correctly.

For instance I use bincode with serde to serialize data between instances of my application, bincode maps almost 1:1 the objects with their binary representation. I noticed that by implementing a trivial RLE encoding scheme on top of bincode for running zeroes I can divide the average message size by a factor 2 to 3. And bincode only encodes length, not capacity.

My point being that I'm not sure that 32GB of memory-mapped data would necessarily load faster than <16GB of lightly serialized data. Of course in some cases it might, but that's sort of my point, you really need to know what you're doing if you decide to do this.

burntsushi · on Jan 23, 2021

> How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

That's what the fst crate[1] does. It's likely working at a lower level of abstraction than you intend. But the point is that it works, is portable and doesn't require any cooperation from the OS other than the ability to memory map files. My imdb-rename tool[2] uses this technique to build an on-disk database for instantaneous searching. And then there is the regex-automata crate[3] that permits deserializing a regex instantaneously from any kind of slice of bytes.[4]

I think you should maybe provide some examples of what you're suggesting to make it more concrete.

[1] - https://crates.io/crates/fst

[2] - https://github.com/BurntSushi/imdb-rename

[3] - https://crates.io/crates/regex-automata

[4] - https://docs.rs/regex-automata/0.1.9/regex_automata/#example...

quotemstr · on Jan 23, 2021

You can't do that in C++ or any language. You need to do your own relocations and remember enough information to do them. You can't count on any particular virtual address being available on a modern system, not if you want to take advantage of ASLR.

The trouble is that we have to mark relocated pages dirty because the kernel isn't smart enough to understand that it can demand fault and relocate on its own. Well, either that, or do the relocation anew on each access.

secondcoming · on Jan 23, 2021

It works with C++ if you use boost::interprocess. Its data structures use offset_ptr internally rather than assuming every pointer is on the heap.

amelius · on Jan 23, 2021

That introduces different data-types, rather than using the existing ones (instantiated with different pointer-types).

secondcoming · on Jan 23, 2021

Indeed. I don't know if there's a plan for the standard type to move to offset-ptr, or if there's even a std::offset_ptr, but it would be great if there was.

For us, some of the 'different data type' pain was alleviated with transparent comparators. YMMV.

Edit: It seems C++11 has added some form of support for it... 'fancy pointers'

https://en.cppreference.com/w/cpp/named_req/Allocator#Fancy_...

quotemstr · on Jan 23, 2021

Sure. But that counts as "doing your own relocations". Unsafe Rust could do the same, yes?

secondcoming · on Jan 23, 2021

I don't know enough about Rust to say. If it doesn't have the concept of a 'fancy pointer' then I assume no, you'd have to essentially reproduce what boost::interprocess does.

ekimekim · on Jan 24, 2021

I'm still learning Rust, but iiuc you could do this by creating an OffsetPtr type that implements the Deref trait (https://doc.rust-lang.org/std/ops/trait.Deref.html). This is exactly a "fancy pointer" as you describe.

whimsicalism · on Jan 23, 2021

What is being relocated?

ithkuil · on Jan 23, 2021

If you use offsets instead of pointers you're doing relocations "on the fly"

whimsicalism · on Jan 23, 2021

I don't see what the issue in doing this is in C++.

The only thing that'll break will be the pointers and references to things outside of the mmap'd area.

simias · on Jan 23, 2021

By that logic you can do it in unsafe Rust as well then. Obviously in safe Rust having potentially dangling "pointers and references to things outside of the mmap'd area" is a big no-no.

And note that even intra-area pointers would have to be offset if the base address changes. Unless you go through the trouble of only storing relative offsets to begin with, but the performance overhead might be significant.

Hello71 · on Jan 23, 2021

libsigsegv (slow) or userfaultfd (less slow) can be used for this purpose.

turminal · on Jan 23, 2021

This is impossible without significant performance impact. No language can change that.

Edit: except theoretically for data structures that have certain characteristics known in advance

amelius · on Jan 23, 2021

Well, one approach is to parameterize your data-types such that they are fast in the usual case, but become perhaps slightly slower (but still on par with hand-written code) in the more versatile case.

gpderetta · on Jan 24, 2021

Boost.interproces does exactly that for the STL.

the8472 · on Jan 23, 2021

Work on custom allocators is underway, some of the std data structures already support them on nightly.

https://github.com/rust-lang/wg-allocators/issues/7

comonoid · on Jan 23, 2021

Yes, you can.

You cannot with standard data structures, but you can with your custom ones.

That's all about trade-offs, anyway, there is no magic bullet.

remram · on Jan 23, 2021

What about Rust makes this more difficult than doing the same thing in C++?

jnwatson · on Jan 23, 2021

There's no placement new in Rust? That's disappointing.

steveklabnik · on Jan 23, 2021

Not in stable yet, no. It’s desired, but has taken a while to design, as there have been higher priority things for a while. We’ll get there!