Yeah, and as far as I understood they use the key hash to address the overall object descriptor. So in theory using the hash of the file instead of the hash of the key should be a simple-ish change.
Tbh I'm not sure if content aware chunking isn't a sirens call:
- It sounds great on paper, but once you start storing encrypted (which you have to do if you want e2e encryption) or compressed blobs (e.g. images) it won't work anymore.
- Ideally you would store things with enough fine grained blobs that blob-level deduplication would suffice.
- Storing a blob across your cluster has additional compute, lookup, bookkeeping, and communication overhead, resulting in worse latency. Storing an object as a contiguous unit makes the cache/storage hierarchies happy and allows for optimisations like using `sendfile`.
- Storing the blobs as a unit makes computational storage easier to implement, where instead of reading the blob and processing it, you would send a small WASM program to the storage server (or drive? https://semiconductor.samsung.com/us/ssd/smart-ssd/) and only receive the computation result back.
They probably don't expose it publicly though.