Summary: Jungledisk doesn't protect the integrity of encrypted data, and doesn't securely derive keys and is thus vulnerable to fast offline attacks. The thing Jungledisk right is to use the same block cipher mode as Tarsnap (and, incidentally, virtually every mainstream encrypted storage system).
The impact of using unauthenticated encryption to store data is that your backup provider could end up owning your machine. Attackers can carefully choose which data to corrupt. They can exploit the randomization of corrupted decryption to set up conditions for memory corruption exploits, and, in more sophisticated but totally realistic attacks, exploit guesses about known plaintext to produce attacker-controlled nonrandom plaintexts. A backup provider with client-authenticated crypto can't do that, because the keys that encrypt the data also ensure it's integrity.
The password storage issue is no different than any other password storage problem; again, direct your attention to http://codahale.com/how-to-safely-store-a-password/, mentally substituting "storage of password hash" to "derivation of AES key".
To my mind, the key derivation is the real problem here. A surprisingly large number of secure encryption storage products don't ensure data integrity. Realistic attacks against that vulnerability are feasible but difficult: you'd have to be targeted.
If you're going to write an article about how a competitor's encryption is inferior to yours and cast it as a vulnerability report, I'd suggest not recommending your own encryption scheme as the replacement. The scrypt recommendation in this article sticks out like a sore thumb. Virtually nothing uses scrypt.
We can nerd out on CTR mode vs. CBC mode; I'm starting to come around to Colin's take on CTR because of ciphertext indistinguishability as I see more practical vulnerabilities that take advantage of it. I think the padding issue is a red herring. CBC padding is easier to get right than absolute rock solid reliable generation of CTR nonces and absolute rock solid management of CTR counters, which are things I see people get wrong regularly. Distinguishability is the real problem with CBC.
To my mind, the key derivation is the real problem here. A surprisingly large number of secure encryption storage products don't ensure data integrity. Realistic attacks against that vulnerability are feasible but difficult: you'd have to be targeted.
I think the lack of integrity is more important than you're making it sound. There's a lot of situations where a lack of integrity can be exploited to create a lack of privacy too.
But the main reason I mentioned the lack of integrity first is that I needed to mention the lack of HMAC to explain why they had the ridiculous "salted key hash" construct.
If you're going to write an article about how a competitor's encryption is inferior to yours and cast it as a vulnerability report, I'd suggest not recommending your own encryption scheme as the replacement. The scrypt recommendation in this article sticks out like a sore thumb. Virtually nothing uses scrypt.
I think you're misstating what I wrote a bit. I said that scrypt is the state of the art in the field -- which it is -- and that given that Jungle Disk was around before I developed scrypt, they should have used PBKDF2 or bcrypt.
I'd rather geek out about CTR v CBC than harp on the scrypt recommendation. Consider the scrypt thing a friendly style note. You wrote an article about a competitor's insecurities. When you do that, don't recommend they adopt your own cryptosystem unless (like CRI had to do with DPA countermeasures) they have to. Here, it just made you look unnecessarily petty.
What privacy attacks were you thinking of? Call some of them out.
I think the author's point about privacy is valid, and a little silly. If I understand correctly (the article is very confusingly worded in some places), he is saying that weak passwords are weak. Anyone who cares about privacy should already be choosing long, complex, strong passwords for this kind of application.
Also, I'm confused about one feature of JD. When I signed up years ago, they allowed me to hold my key privately and it never left the client. I had the option to upload that key to the server, if I wanted to, or not. I understand from the article that the client might misbehave and, for example, share my key in ways I don't want it to. Am I getting this right?
When I looked into secure cloud-based storage two years ago, I found that JD was the best mix of privacy and convenience, if for no other reason than it could be deployed on a mix of Windows, Mac and Linux boxes. It was clear even then that data integrity was the weak link/trade off.
I'm interested in hearing about the latest, best solutions for easy, cross-platform, secure backups to cloud services that offer better data integrity.
This article points out two flaws. Neither of them are silly.
First, there's no integrity protection on data stored on Jungledisk. Jungledisk can own up your machine. That's not a good property for a secure backup system to have.
Second, the key derivation scheme it uses makes every passphrase, no matter how carefully chosen, drastically weaker.
I'm glad you like Jungledisk and I don't think you need to read stories like this as an indictment of your choice or a demand to change services. But it doesn't help to downplay them.
Second, the key derivation scheme it uses makes every passphrase, no matter how carefully chosen, drastically weaker.
I'd just like to repeat this point because it's so important. The password verification method in JungleDisk is fundamentally broken and needs to be rearchitected immediately.
For non-cryptography people, this is similar to the vulnerability that allowed passwords to be retrieved from the Gawker database hack a couple months ago (just not quite as vulnerable).
OK, I freely admit that I'm not an expert in this area, so I'll rescind my "silly" comment.
But, "drastically weaker" than what? If the password is strong, JD doesn't make it weak. JD just doesn't make it as strong as it should/could? Is this correct?
But, "drastically weaker" than what? If the password is strong, JD doesn't make it weak. JD just doesn't make it as strong as it should/could? Is this correct?
Correct. The vast majority of people can't remember strong passwords, so it's necessary to "strengthen" them using a good key derivation function. Jungle Disk doesn't do this.
OK, well, I guess I don't see how that's fairly described as a "flaw" in JungleDisk.
I can understand why a responsible developer should assume their users are simple-minded, mouth-breathers who can't be trusted to choose a proper password (and I'm sure there's plenty of evidence to support that assumption), but it just isn't right to characterize JungleDisk as compromised from a security perspective because it relies on the user to choose a strong password.
Saying that Jungle Disk is secure as long as users pick strong passwords is like saying that the Ford Pinto is safe as long as drivers don't get into rear-end collisions. In both cases you're asking for behaviour which we know perfectly well that users don't exhibit; and in both cases there is a simple fix for the problem.
The Ford Pinto is an unsafe car, and Jungle Disk is an insecure backup service.
I'm trying to understand this. Again, I'm no expert.
I can see why the data integrity issues may allow external factors to compromise the security of my buckets and/or local device. That's me in a Pinto, at the mercy of the bad driver behind me.
I don't see how password strength is open to any external factor; it would seem to be purely a matter of user error. That part doesn't seem to fit the Pinto analogy. That's where I'm struggling to follow your article.
The issue is how fast the password can be broken. MD5 is a very fast hash, so even a relatively slow computer can do a lot of attempts very quickly.
Bcrypt, on the other hand, can be tuned to go as slow as you want. You can force it to take 250 milliseconds, regardless of how good or bad the password is.
And that is the fundamental flaw. Jungle Disk's key derivation makes it possible to crack your password in a reasonable time; bcrypt does not. Because of that decision, everybody's data is much less safe as a result (I'm referring to everybody's data in a statistical sense: the average password sucks and is easily broken in this scheme, so the average file is at risk).
As a provider of security software (like my company is doing), Jungle Disk should be doing everything it can to help users keep their data secure. Jungle Disk isn't doing that.
OK. I think I understand now. I still don't think it's fair to call it defective design (and I'm not really certain that you ever did call JD's password privacy defective, BTW). More like a design that is unsafe in the hands of the typical driver, perhaps.
Why do I care? I just want to understand the risks for someone like me, who has taken care to choose very strong passwords.
My conclusions from all of this:
(1) The data integrity issue is serious because it presents an opportunity for introduction of malicious code, creates a risk of data loss, and may lead to security breaches.
(2) The local binary is opaque, and therefore presents a theoretical risk of compromising even the most "close to the vest" key management strategy.
(3) The password protection issue is a serious shortcoming that can, and should, be mitigated by choosing strong passwords.
One way requires the user to have a drastically stronger password to be safe, and the other significantly strengthens passwords, protecting a subclass of users that will always exist (those that are unable to remember strong passwords or don't know enough about the dangers of password cracking to know how to effectively choose passwords).
It is madness to defend the use of MD5 for password hashing these days. It is clearly not designed for that at all.
My understanding is that SpiderOak, Tarsnap, and Wuala all do this correctly (using one of PBKDF2, bcrypt, or scrypt.)
Colin - Perhaps the companies in the backup space that put effort into handling this carefully should work together and create a PSA style website with a matrix chart of how the varies providers handle "encrypted" data. Make it a separate domain and do our best to be elaborately objective about it. Any interest?
What block cipher mode does SpiderOak use, and how does it verify the authenticity of its data? Tarsnap goes through a lot of extra trouble to MAC its data; few other providers do. You'd hate to see everyone treat key derivation as a shorthand for "doing all of encryption right".
I looked on the SpiderOak site, saw a lot of material on how keys are derived and not stored on SpiderOak servers (great!), but didn't see a lot of details about the mechanics of actually encrypting and checking data.
Thanks for asking. If you're interested, would be very happy to discuss SpiderOak's crypto strategy in depth with you the next time I'm in Chicago. Could share source code, etc. IMO, most interesting parts are the key scoping, which allow users to selectively publish ("share") portions of stored data by publishing the appropriately scoped keys.
SpiderOak uses AES256 in CFB mode with authentication via HMAC. The code is careful about unique nonce/counter usage, crypto code is confined to specific modules that rarely change, and reviewed by cryptographers outside SpiderOak. Client and server have minimal trust relationship.
Being paranoid about data integrity (not only because of crypto issues, but also because bitrot happens routinely at petabyte scale) the data authentication happens repetitively at a few different layers. From all end user devices, we see about one bit error per 4.2tb of upload transactions.
Not really. I've seen far too many "best practices" and "standards" bodies go nowhere to think that a committee can put together a useful website like this.
@cperciva: Thanks for this; now i'll convert my 8char ascii system password to a 10char one. Do you have any data showing how large a password needs to be to make it ridiculously expensive for a TLA (gency) to commit a large amount of hardware to cracking? i.e. how much time past the 10chars does it consume ?
It's BSD licensed but probably not easy to integrate on your platform. BCrypt is an easier choice. When we see Java and .NET implementations of scrypt, we'll start recommending it, but I'll be honest and tell you that we rarely recommend scrypt today.
I wish I could find a link but,US military spec for secure passwords is 14 characters with capitals and special chars. And they have to be changed every 30 days.
True enough, if you're targeted it's not going to help very much. However, like outrunning a bear, you only need to be harder to catch than the guy behind you.
Cryptographers hate GPG. GPG is ugly as sin†. Unfortunately (and I mean that only with a little bit of snark), GPG mostly still works, in the sense of standing up to active, informed attackers with modern techniques.
† For instance, look how it handles message integrity.
This is a slippery slope argument that ends in you arguing that the best tested cryptosystem in common use (TLS) is also insecure. All cryptosystems have vulnerabilities; the question is, how workable is the system after those flaws are fixed.
For the record, I respect the critiques practitioners have of GPG. Unfortunately, their alternatives tend to be ad-hoc. There should be a clean, simple, GPG-like standard, perhaps based on ECC and AE cipher constructions, to replace GPG. But until that happens, in the choice between ugly and workable vs. simple and fragile, ugly and workable is the right choice for most people.
As always I think you drastically underestimate how dangerous this stuff is because you've dedicated your career to it, while normal implementors --- even crypto enthusiasts (look at Tor and SSH) --- have little of the nuance required to get it right.
I like the fundamentals of TLS more than you do; I don't think it's a bad or needlessly complex protocol (except maybe session resumption). I see that reasonable people can differ on that point. But, very importantly, TLS is also a vehicle for collecting and implementing the best known methods in cryptography. I think you tend to overlook that.
As always, my opinions are as a software security practitioner and not as a cryptographer, since I am not one.
The appearance and track record† of the code in OpenSSL does the credibility of TLS no favors, and it is totally understandable why someone who had to deal with software security for a platform that ships and depends on OpenSSL would become allergic to it.
But, two responses to that:
* First, what Joel Spolsky says about rewrites. Sometimes code is ugly for a reason. Clean rewrites of OpenSSL will inevitably introduce bugs. Introducing bugs in SSL†† implementations is perilous.
* Second, there are mature alternatives to OpenSSL. For instance, most? browsers don't use it.
† In fairness, that's because OpenSSL dates back to a time when nobody was getting C software security even close to right.
†† I use TLS and SSL interchangeably, which is a foible I should work on correcting, but the difference doesn't matter much here.
Hey, Since you mentioned TLS/SSL: I can't seem to find an answer to this question: Does my browser or system, need to contact the CA each time it encounters a new SSL Cert, or is having the root certificate enough?
Your browser does not need to contact a CA to verify the signature in an SSL certificate, but may in some cases want to contact the CA to check for revocation.
GPG is big and complicated. The more code you have the more likely it is that you'll have security vulnerabilities. (This is especially true for code like GPG which reads attacker-provided inputs, since it allows the attacker to pick which of many code paths get invoked.)
They don't publish their key derivation scheme, but I'd be shocked (and pleased) to find that they were savvy enough to actually use PBKDF2 or even stretched SHA1. Believe it or not, plenty of commercial vendors literally take the ASCII of the password as the key.
I'd also worry, based on that spec, that the Arq developers believe the SHA1 hashes they store are fully equivalent to a deliberate MAC.
I should have noted that Arq's git-like scheme makes them inherently more careful about storage and data integrity (under non-adversarial conditions) than Jungledisk. My perusal of their site was casual. I really don't know much about them and am not offering a professional opinion.
I emailed you asking for professional help in reviewing the security aspects of Arq. I'm not an expert, and I'd like to get it right. If anybody else has the expertise to do this review, please email me at stefan@haystacksoftware.com.
In general, if you're an indie developer and you're doing custom crypto stuff, I'm happy to do a consult free of charge. You'll probably find other software security firms are similarly willing to do that kind of stuff, just like the good law firms will tend to do up-front consults for free.
Full-on software reviews, particularly by consultants competent enough to review crypto, are very expensive. You can probably get away without doing one, as long as you get good advice and have people to bounce ideas and problems off of.
Karmically, being someone to bounce ideas and problems off of has paid off for Matasano dramatically, so, anyone else reading this thread, consider this an open invitation.
I'm in the process of getting an app review done by a security expert. Then I can answer that question (hopefully) definitively. (I'm the author of Arq)
I'm happy to cross-check this stuff with you in private, if you'd like a free consult from another professional (reiterating something I said on Twitter a minute ago).
I took a look https://www.tarsnap.com/gettingstarted.html and I don't understand why "If you have multiple machines, you almost certainly want to create a separate key file for each machine."? Can you explain why, if I'd like to access the same data from the different machines? Or is the main assumption that every machine has its own, nor shareable, backup? Isn't the main advantage of an online service to have the same data accessible from more machines?
Tarsnap isn't Dropbox. It's a backup system. Its cost structure and security model is optimized for backup, which is why you can't e.g. read your Tarsnap files from a web interface at Tarsnap.
I continue to not understand how people imagine these services working (de-dupe, block level updates, etc) without access to the unsecured version of the data. As for the claims about what Amazon could do to your data... there's even less sinister options. S3 is not 100% safe storage. There's a chance for bit rot and that may occur. If you don't check the file yourself, you won't know. Again, that seems a bit inevitable, no?
There's a sucker born every minute. Jungle Disk claims to be secure, and most people believe them -- most people have no way to assess whether they're doing things right or not.
De-dupe: Wuala encrypts the file with a key derived from the file itself. This key is then encrypted with the user's key and both (the file and the encrypted key) are uploaded to the cloud. Disadvantage: If the file is known to an attacker (i.e., a copyright holder) the attacker can possibly find out which users have access to this file. Advantage: Allows for de-duplication, but is more secure than Dropbox.
Block-level updates: I don't see a problem with this. Partition the file into blocks on the client (before the encryption). The server doesn't need access to the data for this.
As Steve Weis pointed out in an earlier thread about schemes like this, deriving keys from the contents of files breaks semantic security. Lay engineers reason about this problem the way you just did: "the RIAA can tell I have Lady Gaga MP3s". But practitioners are worried about much more subtle and devastating flaws, particularly in cases where attackers may exercise some control over the blocks being encrypted.
Any scheme that derives passwords from file contents gives me the willies.
Most file-level encryption that I'm aware destroys benefits of blocked data. For example, changing a few bytes in an encrypted file will cause MANY bytes to change in the actual file on the disk... at least with ecryptfs and truecrypt. If there is an encryption scheme that works well with striping, I'd really super appreciate you pointing me in that direction - it would greatly help with a problem I'm currently trying to solve.
The impact of using unauthenticated encryption to store data is that your backup provider could end up owning your machine. Attackers can carefully choose which data to corrupt. They can exploit the randomization of corrupted decryption to set up conditions for memory corruption exploits, and, in more sophisticated but totally realistic attacks, exploit guesses about known plaintext to produce attacker-controlled nonrandom plaintexts. A backup provider with client-authenticated crypto can't do that, because the keys that encrypt the data also ensure it's integrity.
The password storage issue is no different than any other password storage problem; again, direct your attention to http://codahale.com/how-to-safely-store-a-password/, mentally substituting "storage of password hash" to "derivation of AES key".
To my mind, the key derivation is the real problem here. A surprisingly large number of secure encryption storage products don't ensure data integrity. Realistic attacks against that vulnerability are feasible but difficult: you'd have to be targeted.
If you're going to write an article about how a competitor's encryption is inferior to yours and cast it as a vulnerability report, I'd suggest not recommending your own encryption scheme as the replacement. The scrypt recommendation in this article sticks out like a sore thumb. Virtually nothing uses scrypt.
We can nerd out on CTR mode vs. CBC mode; I'm starting to come around to Colin's take on CTR because of ciphertext indistinguishability as I see more practical vulnerabilities that take advantage of it. I think the padding issue is a red herring. CBC padding is easier to get right than absolute rock solid reliable generation of CTR nonces and absolute rock solid management of CTR counters, which are things I see people get wrong regularly. Distinguishability is the real problem with CBC.