Hacker News new | past | comments | ask | show | jobs | submit login

Ideally, whenever a file is modified, the checksums should be recomputed when the file is closed, after having been opened as read-write.

However, this behavior can obtained only by replacing the file system drivers with a custom version. That could be done on Linux but it cannot be done as a cross-platform solution.

So for cross-platform, you need a pair of scripts, one for verifying the checksums, recursively for all the files in a directory, and another script for updating all the checksums for the files in some directory, which you need to either invoke manually after editing some files or downloading some new files, or you could set the script to run periodically, e.g. once per day, maybe prior to a backup script.

Nowadays a bash script can run on any operating system, so it is cross-platform enough, but writing a new pair of scripts for each operating system for whatever tool is used there for scripting is not much work.

The 2 scripts should consist each of a pair of scripts, a top level script that gets all the files from a directory (e.g. on UNIX using 'find "$1" -type f') and invokes the second script for updating or verifying the checksums for a single file.

The scripts that do the work for one file just need to invoke some standard checksumming programs, e.g. sha256sum and so on, which are now available on any platform, if you take care to install them. For verifying, missing checksums or checksums that differ from a newly computed checksum give an error, for updating, the checksums are recomputed if they are missing or if their timestamp differs from the file modification timestamp.

In the beginning, I was storing the checksums in one extra file per directory, but that had many disadvantages. Then I have switched to the use of extended attributes, which allow the storage of the checksums together with the file, transparently.

Extended attributes are much more convenient, but their cross-platform use is more complex.

For updating or verifying the extended attributes, you need 3 commands, list, read and write extended attributes. These not only are different on each operating system, but they are some times different on a single operating system for different file systems.

So you either need to write different scripts for each platform, or you must detect the platform and choose the appropriate extended attribute commands. The work is about the same for both choices, so it might be simpler to just have different scripts for each platform. If you use per platform scripts, they can be small, less than a page each, so it is less likely to make mistakes in them.

When you use extended attributes, you must take care that on open-source systems, like Linux or FreeBSD, not all file systems support extended attributes and those that support them might have been configured to ignore them.

So, especially on Linux, you must check that you have enabled support for extended attributes and you might have to recompile the common utilities, like fileutils, because by default you might have commands like cp and mv that lose the extended attributes. Also neither all GUI file managers nor all archiver versions (e.g. of tar or pax), handle correctly the extended attributes.

The main pitfall on Linux is that tmpfs has only partial support for extended attributes, so on many Linux systems, copying a file to /tmp will lose its extended attributes.

When copying files through the network, the most reliable way is to use rsync, preferably rsync over ssh. That will copy any file together with its extended attributes, regardless what operating systems are at the 2 ends.

Otherwise, Samba can be used between most operating systems. I have used it in mixed configurations with Linux, FreeBSD and Windows computers, without problems with preserving the extended attributes.

There are new NFS versions with extended attribute support, but all the computers must be configured correctly, with non-default options. I have not used such a version of NFS.

So the conclusion is that the files themselves are cross-platform if you take care to not use for transferring them methods that would silently lose the extended attributes, but the simplest way is to rewrite a pair of updating/verification scripts for each platform, because those can be very simple, while a cross-platform script would be large and cumbersome.




Thanks.

You typed enough interesting stuff here for an above average blog/whatever post. Consider doing and submitting it.


Out of curiosity, how would you compare this particular distribution of complexity to that of setting up and using ZFS?

ZFS gives me set-and-forget sha256 checksums that are kept up to date in as data is written in real time. And the checksums are block-indexed, so 100GB files do not need to be-hashed for every tiny change.

I also never have to worry about the "oh no which copy is correct" logistics nightmare intrinsic to A->B backup jobs in nightmare scenarios where they fail halfway through, as both copies of my data (I'm using 2-way mirroring at the moment because that's what my hardware setup permits) are independently hashed and ZFS will always verify both sets on read. (And reads are fast, too, going at 200MB/s on a reasonably old i3. No idea how any of this works... :) )

As for cross-platform compatibility, my understanding is that if the pools are carefully configured correctly then zfs send/recv actually works between Linux and FreeBSD. But this kind of capability definitely exceeds my own storage requirements, and I haven't explored it. (I understand it can form the basis of incremental backups, too, but that of course carries the risk of needing all N prior tapes to the current one to be intact to restore...)

The two mostly-theoretical-but-still-concerning niggles that I think a lot of setups sweep under the carpet are ECC memory distribution, and network FS resiliency.

All the ranting and raving (and it does read like that) out there about ECC means nada if the RAM in the office file server in the corner implements defense in depth (if you will), while files are mindlessly edited on devices that don't have ECC, through consumer/SOHO grade network gear (routers, switches, client NICs, ...) that mostly also don't use ECC. Bit of a hole in the logic there.

I did a shade-beyond-superficial look into how NFS works, and got the impression the underlying protocol mostly trusts the network to be correct and not flip bits or what have you, and that the only real workaround is to set up Kerberos (...yay...) so you can then enable encryption, which IIUC (need to double check) if set up correctly has AEAD properties (aka some form of hashing that cryptographically proves data wasn't modified/corrupted in flight).

SMB gets better points here in that it can IIRC enable encryption (with AEAD properties) much more trivially, but one major (for me) downside of SMB is that it doesn't integrate as tightly with the client FS page cache (and mmap()) as NFS does, so binaries have to be fully streamed to the client machine (along with all dependent shared objects) before they can launch - every time. On NFS, the first run is faster, and subsequent launches are instantaneous.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: