There is a Better Way. Instead of using fixed sized blocks, use variable sized b...

willvarfar · on Feb 14, 2012

I've heard that called 'anchored hashing'. Its recently been used by FreeArc: http://encode.ru/threads/43-FreeArc?p=28237&viewfull=1#p...

How does ddar do it?

rlpb · on Feb 14, 2012

ddar calculates a 32-bit Rabin fingerprint over a sliding 48-byte window (a Rabin fingerprint is a type of checksum which allows quick calculation of a "slide" with just the previous fingerprint, the new byte coming in, and the old byte going away). The fingerprint is compared against a mask which has n bits set, where 2^n is the desired target block size. If there's a match, then the location of the window is taken to be a block boundary. With random data this should lead to a binomial distribution of block sizes centred around the target block size. But minimum and maximum block sizes are also enforced for pathological cases which skews this slightly.

Blocks whose boundaries are determined by this algorithm are hashed (SHA256 currently). The hash is used to key each block in a reference-counted key/value store.

Then each archive member is just a list of block hashes.

huhtenberg · on Feb 14, 2012

If you were to elaborate on and brush up the description, it would make an excellent HN submission. Good stuff anyways.

silentbicycle · on Feb 14, 2012

I wrote a library that does this. I haven't posted it publicly, because it's part of a much larger project (not yet complete), but could put it on github if anybody is interested.

There are implementations in Lua, Java, and C.

kragen · on Feb 14, 2012

This would be really cool.

silentbicycle · on Feb 14, 2012

I'll ping you on twitter when I post it.