Garage splis the data into chunks for deduplication, so it basically already doe...

j-pb · 2024-07-20T12:52:22 1721479942

Yeah, and as far as I understood they use the key hash to address the overall object descriptor. So in theory using the hash of the file instead of the hash of the key should be a simple-ish change.

Tbh I'm not sure if content aware chunking isn't a sirens call:

  - It sounds great on paper, but once you start storing encrypted (which you have to do if you want e2e encryption) or compressed blobs (e.g. images) it won't work anymore.

  - Ideally you would store things with enough fine grained blobs that blob-level deduplication would suffice.

  - Storing a blob across your cluster has additional compute, lookup, bookkeeping, and communication overhead, resulting in worse latency. Storing an object as a contiguous unit makes the cache/storage hierarchies happy and allows for optimisations like using `sendfile`.

  - Storing the blobs as a unit makes computational storage easier to implement, where instead of reading the blob and processing it, you would send a small WASM program to the storage server (or drive? https://semiconductor.samsung.com/us/ssd/smart-ssd/) and only receive the computation result back.