Hacker News new | past | comments | ask | show | jobs | submit login
Rdedup – backup deduplication with asymetric encryption (in Rust) (github.com/dpc)
120 points by dpc_pw on May 5, 2016 | hide | past | favorite | 24 comments



I've been looking for something like this to allow a collection of clients to encrypt locally and store centrally with deduplication. Ideally without trusting the central store with the password/passphrase. I don't quite follow how Rdedup works though.

I get using bup -> chunks and tracking said chunks by their SHA256. Makes perfect sense. As does making digests of the collections of SHA256s.

However to dedupe across clients you'll be using the encrypted chunks right?

With a libsodium sealed boxes and an ephemeral key the same plaintext on different clients will be encrypted into different chunks. I'm probably just confused. You mention deduplicating backups across multiple systems. Maybe a 2 system example would help?

My main confusion is you say "Thanks to public key cryptography, secure passpharse is required only when restoring data, while adding and deduplicating new data does not." But if you don't need the passphrase for deduplication and the encrypted bits use an ephemeral key how could they possibly be deduplicated?


As I understand it (from reading the README) it's storing SHA(chunk) -> Enc(chunk) in its database. When adding to the backup, it checks whether SHA(chunk') is already there. Thus, it doesn't have to look at the encrypted data. However, it also can't verify that a chunk stores the correct data.


So in a multiuser environment, the "first" client to upload an encrypted chunk and lie about it's plaintext-hash... Which would poison the well and anybody else gets a nasty surprise when their "backup" is always corrupted.


Multi-user deduplication also leaks information via timing attacks (user2 can upload a million key files to see if user1 already stored one).

It's better to only deduplicate on a per-user basis.


If your encryption is deterministic, the second client can check with the server that Hash(Enc(chunk)) is the same on the client and server.


> If your encryption is deterministic, the second client can check with the server that Hash(Enc(chunk)) is the same on the client and server

but the chunk on the server was encrypted using a different public-key, so how can hash(pub-key-1(chunk)) == hash(pub-key-2(chunk)) ?


Isn't only the decryption keys encrypted to the public keys?


They use nacl cryptobox primitive.

This means that you are right. Alas, the decryption key (they symmetric key used to encrypt this particular message) is derived deterministically from the private key and nonce. The nonce they use is the hash of the chunk. Thus, the same chunk will always be encrypted with the same symmetric key.


> Isn't only the decryption keys encrypted to the public keys?

from the readme, it appears (to me at least), that chunks are encrypted using public-keys. concretely, the following lines :

"Every time rdedup saves a new chunk file, it's data is encrypted using public key so it can only be decrypted using the corresponding secret key. "


Ah, rats, I think you are right.

I was hoping for a client I could run on all the machines I use and it encrypt locally and decrypt centrally.

Now that I think about it I think you really need a 3 tier setup.

Clients have to be trusted, otherwise all keystrokes are recorded, all files could be corrupt, ransomware can strike, etc. So tier1 encrypts locally before sending to the tier2.

Tier2 is less trusted, never sees plain text, but you trust it enough to not spend significant resource attacking your client.

Tier3 is not trusted, may be run by other individuals or organizations, may ship copies of your data to entities that try to break your encryption. Think of an offsite backup service or some distant friend/relative that's willing to let you store offsite backups on their potentially insecure machine.

So tier1 encrypts locally with convergent encryption (symmetric is easy, not sure if there's an asymmetric version) and offers encrypted blobs to tier2.

Tier2 does a dedup check and either accepts an upload or just tells the client they are subscribed to that encrypted blob. Then applies a reed-solomon to add some redundancy in case one of the offsite backups dies.

Tier3 just receives fixed size encrypted blobs to provide additional copies of backups in case the tier2 site dies. Maybe even have the tier3 find each other with a DHT. People just decide how many copies they want to keep, what redundancy is acceptable, and the tier3 maintains that.

So trusting sorts would use the tier1, a shared tier2, and keep 3-4 copies in the tier3's. They would enjoy deduplication benefits.

Non-trusting sorts would run a tier1 and tier2 on the same client. More secure, but also no benefit from deduplication across clients.

The really paranoid sorts would run a tier1 and tier2 locally and only trust manually introduced tier3s that they have a high confidence in.


Yes, I think you're right in your conclusions.


Haven't looked at the code, but my guess is that it calculates SHA256 of chunk, checks if this digest already exists, and if not, only then it encrypts and adds a new chunk. Since SHA256 is take of plaintext, it indeed deduplicates, but also reveals some information to attackers.


So you're looking for the collection of clients to be able to make secure backups of their data, but be able to store those backups in a central location with de-duplication and without having to trust the central location with keys? That's certainly feasible, though it seems this particular software won't achieve it.

Off the top of my head, the scheme is simple. Break data down into chunks as usual. To encrypt a chunk: chunk_encryption_key = KDF (HMAC (shared_secret1, plaintext)). Encrypt. chunk_id = SHA256 (ciphertext). Upload ciphertext to central repo. Central repo can store and index using SHA256 (ciphertext). When you've got all chunks encrypted and uploaded, build the backup archive by listing all files, their properties, paths, etc. plus their respective chunks. Chunks are stored in the archive using (chunk_id,encryption_key) pairs. Encrypt the archive using a per-client asymmetric key with authentication. Upload encrypted archive to the central repo. Store by name or however you want to handle that.

When it comes time to decrypt, the client can pull the archive from the repo, authenticate, decrypt, and then begin extracting files. When it needs a chunk, it just asks the repo using the chunk_id. It can verify integrity and authenticity using the chunk_id (the chunk_id itself is already authenticated because we authenticated the archive it came from). Decrypt, and you're done.

The central repo doesn't need any of the keys. The clients all share shared_secret1, and then each have their own asymmetric keys. Clients can't upload corrupted data to mess with each other. They also can't pull random chunks and decrypt them, because decryption requires the "chunk_encryption_key" which can't be derived from chunk_id. They can only decrypt chunks that they've used in a backup.

Garbage collection can be handled by tagging archives on the central repo with a list of used chunk_ids. That's still secure.

The only caveat: You'll need to trust at least that clients won't attempt to exploit the flaws in convergent encryption (you can brute-force fields in common documents). Though there are ways to mitigate that in certain scenarios.

My WIP secure backup tool, written in Rust, does most of this: https://github.com/fpgaminer/preserve The only difference is that I decoupled the chunk_id from the ciphertext's hash, as part of an effort to reduce archive size (archive's only need to store 32 bytes per chunk, rather than 64 bytes). That means that it doesn't handle the case of malicious clients uploading corrupt chunks. That can be mitigated though.


Asymmetric is a nice touch (sets you apart from restic and attic/borg)!

I've PR'd this project into the restic/others list.


Thanks, merged!

Yes, the main reason I wrote it is asymmetric cryptography.


> the main reason I wrote it is asymmetric cryptography

Why do you prefer asymmetric cryptography instead of whatever restic and attic/borg are using?


Asymmetric cryptography means that the private key doesn't need to be stored on the client or the server.

If you have someone's public key, and no other key material whatsoever, you can encrypt something in a way that only they can read it. Even you can't decrypt it.

(Because asymmetric cryptography is slow, if your messages are large you'd typically encrypt the data with a symmetric key, and then encrypt just that key asymmetrically.)


Very nice! But how is this different from Obnam? AFAIK obnam also dedups and encrypts (gpg).


AFAIK Obnam doesn't use rolling hash for deduplication, but uses rather simple chunking method. As such, the space saving from Obnam deduplication is much worse than tarsnap and zbackup.


Or Borg backup?

Seems like a lot of wheel reinvention is going on in this space.


It seems like the main difference from attic/borg is the the asymmetric encryption.


So why not just dedup backups and then encrypt the backups instead of encrypting and backing up?

As I understand it this is doing `file -> encrypt -> dedup -> compress`


If you encrypt locally you don't have to trust the backup server with your privacy. This is doubly an issue when the server is running on some random cloud somewhere instead of the laptop that you have physical possession of.

The trick is deduplication across multiple client.

In response to your second comment, that's pretty close, the encrypt step is deterministic (unlike the usual with a unique IV and/or nonce) and compressing after encryption is useless. If your encrypted blocks are not effectively random and incompressible you are doing your encryption wrong.


I don't understand the question. Can you explain a bit more?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: