Hacker News new | past | comments | ask | show | jobs | submit login

I don't think the two cases are analogous. First, block devices don't have any metadata to speak of. Second, filesystems have journals to "cover up" for the main-store operations being asynchronous and/or non-atomic. However, sync() is completely synchronous by definition - hence the name - so such "covering up" would be superfluous. There must be some metadata that's being written at etcd because it needs to be read from there later, but there's nothing in block-device semantics to require any such thing.

Thinking about it further, I think I can guess at what's going on here. The key observation is that there's no sync() at the block device level. It's a filesystem operation; block devices don't see it. Sure, there are queue flags and FUA and such, but those are different (and I'm not sure any of those exist in NBD). Where is this sync() path? I'm guessing it's internal on the data servers, to deal with data that's being buffered there. With both replication and erasure coding, correct recovery requires exact knowledge of what has been fully written where, and that's the kind of metadata I suspect is being put in etcd. There's not even necessarily anything wrong with it, unless updating that information only on sync() means that supposedly durable writes since the last (unpredictable to the client) sync could be lost on failure. I hope that's not the case.

Maybe I'll find time, in the midst of my work on an actual production-level distributed filesystem, to look at the code and see if my guess is correct.




Block devices (and NBD specifically) absolutely have a notion of sync(). We use sync() as the unit of write visibility. All writes up until a sync are effectively anonymous until a sync().


Please go look at the man page for sync(2), which is also the page for syncfs(2). It is explicitly a filesystem-level operation. Obviously, this will cause data to be flushed from the filesystem down to lower layers. Obviously, you can "sync" virtual (e.g. NBD or loopback) block devices by syncing the filesystems that contain their backing stores, but that's not the same thing. No filesystem, no sync(2). For block devices with no file-based backing store, sync(2) is inapplicable. Also, sync(2) is latency-inducing overkill if you're trying to ensure durability for anything less than all filesystems attached to a machine. More often, fsync(2) on the backing files is what you should be using.

> All writes up until a sync are effectively anonymous until a sync().

"Anonymous" means nothing in this context. Do you mean non-durable? First, please use the correct term. Second, if that is what you mean then you're probably doing it wrong. File writes are allowed to be asynchronous (unless O_SYNC and friends). Block device writes are expected to be synchronous, or at least to preserve order. This is exactly the kind of thing that needs to be thoroughly thought out before code is even written, and that thinking should be spelled out somewhere for people to help make sure all the nasty corner cases are covered. Your cart is way before your horse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: