This is a cool idea. Please don't take the below as gratuitous negativity, just a reminder that these are hard problems for which there are no general solutions.
The README says it was tested on ZFS, but I doubt its utility in real-world deployments. I don't know of anyone who has significant data in a ZFS pool that isn't one or more of: raidz, compressed, encrypted, or embedded_data.
raidz implies that logical blocks aren't allocated as single physical blocks, but instead striped across multiple drives. Finding the SBX magic isn't enough to get you the rest of the block, but the checksum might (but, given that's it's CRC16, probably won't) let you try appending blocks from other disks to find the remainder of the block.
Transparent compression prevents you from identifying the magic header on each block, unless you decompress every disk sector that could have data (which is certainly feasible, but complicates recovery if you don't know which compression was in use, and zfs supports at least 3 kinds, and pools will generally have at least 1 in use whether compression is on or not).
Encryption (present in Oracle ZFS) means there's no plaintext data to recover.
embedded_data is a feature flag (and on by default in supporting versions of zfs) that packs blocks into block pointer structs when the amount of data is small. I can easily imagine the final block of an SBX, which may be mostly padding, getting compressed into one of those block pointers, which itself may be embedded in a larger structure which is part of an array that's compressed by default. That array is also probably long enough the compressed stream takes multiple blocks, and you may have lost some of the early ones, making the rest of it unrecoverable.
Alba ( https://github.com/openvstorage/alba ) implements the same idea, but for a distributed object store:
The objects (files) are split into chunks and erasure coded into fragments, which are stored on storage devices (disks). The structure of the object and the metadata is stored in a distributed key value store. BUT The metadata is also attached to the fragments so that if the distributed key value store is lost, everything can be restored from the storage devices alone. Obviously, this is a very heavy operation and is the very last line of defense against data loss, only needed in case of a severe calamity.
The author Mark0 writes: "each SBX file can contain only 1 file and there are no error correcting informations (at least at the moment) ... But it's possible to create an SBX file out of a RAR archive with recovery records, for example."
IMHO that would be the win-win combination for data recovery. How common is the failure scenario? Common enough that there are people who make a living piecing lost files back together. See http://forensicswiki.org/wiki/File_Carving
The whole issue is that file carving (once filesystem structures are destroyed) works fine with contiguous data but it doesn't work (or it works only very partially, and depending on a number of factors) with any fragmented data.
But I will give you a common enough example, that I have seen happen many times.
The user wants to format a USB stick, but by mistake he/she selects a data volume instead.
Two possibilities (under Windows later than XP):
1) user chose NOT the "/q" (cli) or "quick" (GUI) format, all data is lost forever as starting from Vista if a "full" format is cosen the whole volume will be overwritten with 00's
2) user chose to "quick" format, only the filesystem structures are recreated (blank) and - given that the volume was originally formatted on the same OS - they will be exactly where previous filesystem structure were, thus cleanly replacing them.
In this latter situation all files are still there, what is lost is where they are, the address info for a contiguous file is two pieces of data, where it begins and how big in size it is, there are a number of programs that can "recognize" a file (or its beginning) by its "signature" (usually a header, but sometimes a combination of header and footer), among them the excellent TriD tool by the same Author Marco Pontello, and a large number of file formats do include some info on the filesize, and there are slightly more sophisticated programs that can usually recognize if data belongs to a given filetype.
The address info of a fragmented data is as many starting points and as many extensions as the fragments, and since a fragment - bar the first one - have no headers it is extremely difficult to find all the fragments of a file (and knowing to which file a fragment belongs) and re-build the file by re-assembling the fragments in the right order, often impossible.
In a nutshell, if the volume was only made of contiguous files, they can be recovered 100% or very near 100%, only losing their filenames and their path in the "previous" filesystem structure.
BUT any file that was fragmented will either be lost completely or will need (in some cases only this is possible) manual reconstruction, something that may take days or weeks of work and very often with only very scarce or partial success.
The idea of a "self-referencing" archive with sector-level granularity is simply great, you won't ever need it, but if you do, it could be a lifesaver.
When I was younger I worked a while in file carving problems and algorithms, and while your take on it is accurate, it's not complete. There are algorithms that can do what you call "manual reconstruction" in a more or less automated way. Of course, as with everything that is automated, there are edge cases where it doesn't work correctly, but it'll save you a lot of time.
What has been discussed here and in the OC, is mostly header-footer carving, but file-structure-based carving has been around for a long time already (I think Foremost included it in 2005/2006, and PhotoRec supports it for some formats). Using the file command (*NIX) or TRiD to find the headers is a bit of a hack, and useful only if you want to sit a long time with dd (or an equivalent tool) and direct yourself the carving process.
As for the fragmentation issue, there used to be around a great program called "Adroit Photo Forensic" which implemented the amazing SmartCarving(TM) approach (academically known as graph-theoretic-carving) in which the disk is scanned, pools of segments by file format are populated, and then a graph is constructed and analyzed with a Dijkstra's shortest-path variant which works on multiple paths from multiple starts-to-ends. Very interesting, though a bit computationally expensive. Plus you need an extremely precise similarity metric between the segments, which was actually the weak-spot for this. However, it seems as if the Adroit product has died (I can only find mirrors now which host old versions of it), and as it was a try-and-buy program I don't think you'd be able to use it now.
Thing is, there are easier ways to handle fragmentation. For example, PhotoRec itself lets you pass an agressive validation step, to discard most of the "probably wrong" recovered files, and then dump the remaining to an alternative disk image, on which you can try carving again. Of course, this iterative process takes time, but it can work, and of course it can be automated (and thus, saving you from manual intervention).
While my work has taken me away from it, I never quite understood why the digital forensics community "forgot" about file carving, or disregarded it as an important problem -- it seems as if there are still the same issues around as when I worked on it, while having less tools available.
As for the SeqBox format, more than an alternative to this tools, I find it as an interesting complement to them.
Sure, I tested a couple of times Adroit, anyway it is (was) only about some specific file formats (photos, aka jpeg's), and even the Authors' "pitch" was about (only) 20% more photos recovered, I remember not being that much impressed by the results of the actual tests (but maybe the devices on which I tested it at the time also suffered from other forms of corruption):
https://web-beta.archive.org/web/20120313073659/http://digit...
I am pretty sure that a similar tool (specific for photos/jpeg's) would be very useful, but extending the same principles to different file formats is - as I see it - extremely hard and in a large number of cases simply impossible (due to the actual structure of the file format itself).
Depends on the format, for example PNG has checksums per segment which make it extremely easy to find if the segment has been correctly rebuilt. Same goes for ZIP and RAR, and JPEGs can be matched against the Huffman tables for invalid byte sequences.
Also for JPEG there was a (sort of experimental) carver that validated the image against the thumbnail, and then decided if it was correctly reconstructed or not (and maybe it also helped to choose the correct block when there was an error). Still, most of this tools never left the alpha stage, and ended as curiositys presented in a conference and then left to rot.
Yes :), it is queer how some programs/tools that should exist and of which we do have the basic algorithms were never properly written/developed while we have tens or hundreds of tools similar among them missing these functionalities.
Another kind of tool that was never properly developed is interactive jpeg repair, besides the good jpegsnoop:
http://www.impulseadventure.com/photo/jpeg-snoop.html
(which has only some of these functionalities) there are a handful of tools that are of little or "very narrow" use, but nothing much effective at rebuilding a damaged jpeg.
I've seen this in the steganography/steganalysis "community": almost every time a script or program is written, its purpose is to rapidly fix the needs by some highly specialized expert who deems that development too simple (that is, to himself) to further develop into a full-fledged tool.
Depends on the media. I think this would be awesome for archival usage on optical and tape media. Or even printed out in paper, since this looks like it'd let you rescan it in any order.
I wonder if maybe it's intended for use when the filesystem is intentionally wiped out, maybe so you can deny any data exists on a disk at all. (Like anti-forensics?) I don't know, I've never really wanted to do anything like that.
This brought parchive/par2 to mind; as I recall, given an index file, which you could keep separate, you could do a rolling scan for the recovery blocks and file blocks on the disk, even without the file system structure. Unfortunately, it sorta looks like that software hasn't been worked on in a while.
The README says it was tested on ZFS, but I doubt its utility in real-world deployments. I don't know of anyone who has significant data in a ZFS pool that isn't one or more of: raidz, compressed, encrypted, or embedded_data.
raidz implies that logical blocks aren't allocated as single physical blocks, but instead striped across multiple drives. Finding the SBX magic isn't enough to get you the rest of the block, but the checksum might (but, given that's it's CRC16, probably won't) let you try appending blocks from other disks to find the remainder of the block.
Transparent compression prevents you from identifying the magic header on each block, unless you decompress every disk sector that could have data (which is certainly feasible, but complicates recovery if you don't know which compression was in use, and zfs supports at least 3 kinds, and pools will generally have at least 1 in use whether compression is on or not).
Encryption (present in Oracle ZFS) means there's no plaintext data to recover.
embedded_data is a feature flag (and on by default in supporting versions of zfs) that packs blocks into block pointer structs when the amount of data is small. I can easily imagine the final block of an SBX, which may be mostly padding, getting compressed into one of those block pointers, which itself may be embedded in a larger structure which is part of an array that's compressed by default. That array is also probably long enough the compressed stream takes multiple blocks, and you may have lost some of the early ones, making the rest of it unrecoverable.