I think this is a topic worth raising, I spend so much time auditing the "magic" in open source projects to try and find out how they work behind the scenes, and whether they are doing anything to protect me.
For example:
Pip (python package manager) tells you to run a "secure" script off the internet [0], where the only real protection is being behind SSL. But, surely I can just check the code of that script? Oh wait, it's embedding a zip file inside the python script, so I need to manually unzip it and look into it before I can start to consider it safe.
I like what CoreOS does, they don't even provide SSL download links for their isos, it's all plain text http, so they stress that you need to verify the image signature against their GPG signing key which is distributed via github.
>so they stress that you need to verify the image signature against their GPG signing key which is distributed via github.
If someone has the ability to MiTM HTTPS connections, they could just as easily MiTM the victim's connection to github to return a different GPG signing key. Also, the protection requires active checking rather than coming 'for free' like it does by downloading via HTTPS, so 95%+ of the users won't bother and won't have any protection.
I think CoreOS took a big step backwards here. Just offer the download links via HTTPS in addition to the GPG key. That way people that don't bother with GPG have protection.
If you use CoreOS, you just need to get the Core OS key one time, and sign it with your own key, then let the keyservers mirror it for you. An adversary can still make sure you can't connect to any key servers -- but it's an order of magnitude easier to verify a single GPG key (even if that involves getting on a plane and talking to a CoreOS developer in person) -- than it is to verify every single CA cert shipped with your browser.
Just look at the latest CNNIC key removal: essentially Google and Firefox are saying that SSL was never secure, because we couldn't trust CNNIC, even though everyone shipped their keys.
Before that, no browser was secure, because a number of CAs were compromised, and or incompetent.
While GPG isn't some magic dust that makes trust easy -- it's crazy to claim that SSL is somehow more secure, or more importantly, easier to trust.
[edit: And that doesn't even touch on the number of lines of code involved, the directness of the path between what is signed, and who will use it (security of the web server, dns etc not an issue if you know you can trust the GPG signature...]
SSL does nothing to guarantee the integrity of code you download and run off the internet.
I'm not worried about MITM attacks, SSL provides adequate protection against those in most cases; but if someone gains access to the coreos systems SSL would not help.
In this case, I have a GPG key saved for the coreos image signing, which hopefully is done offline and I got that through a third party which I have a certain amount of trust on (github), and I can always verify it on other channels.
But did you verify it on other channels? Or did you do like 99.9% of users and assume nobody would be attacking you? My point is mainly that gpg is such a usability nightmare that it's effectively a broken security model.
> I like what CoreOS does, they don't even provide SSL download links for their isos, it's all plain text http, so they stress that you need to verify the image signature against their GPG signing key which is distributed via github.
Sounds very much like TRTTD. I'll see about this with `sigpipe` (hashpipe with signatures). The fundamental problem is PKI is weak. using github isn't great either.
We need a chain of trust that's sane (CA system is not sane).
This is non-portable, as OS X has no sha256sum out of the box. But it does have shasum from Perl, and Linux distros typically come with Perl, so 'shasum -ba256' should work...
I wonder if there is any concise way to do this without needing to save a file to disk. I can't think of one, as it requires splitting the input in two, which bash can do with >(command) but not sequentially or with the ability to communicate from the subshell to the outer one.
You don't get your hash without consuming the entire input.
You can't start feeding the input to the shell, until you verify the hash.
You don't want to download it twice, it case it comes back different the second time (or in case your network is slow).
At the point that you are verifying the hash, the entire file must exist on your system somewhere. Disk, memory, OCR-friendly printout, whatever.
Buffering to memory can only work if the input is "small", whatever that means. And pipelines aren't meant to do this, so you'll have to do something a bit odd (ie, confusing to anyone trying to understand your code) to make it work.
> Buffering to memory can only work if the input is "small", whatever that means.
Well, increasingly it means "up to 8GB or more". Or put another way, RAM is increasing faster than network speed (or the speed of light for that matter...). So, I'd say that, yes, for many things you'd want to download from the internet, caching in RAM is absolutely an option?
Maybe we should just recommend that software is distributed as a (not detached) singed file. So:
curl http://yolo.example.com/lulz.gpg| gpg -d - | sh
(Which of course doesn't work either, as one could replace lulz.gpg with "#/usr/bin/env sh;rm -rf /" signed by some key one trusts... which isn't so far fetched assuming some trusted entity publishes a utility for wiping a system automatically (think: shred /dev/sd*))...
Still, gpg -d, seems preferable to detached signatures (not to mention detached hashes) for this type of thing? The files can't be used without verifying, or running through gpg [ed: or other manual intervention, for shooting oneself in the foot] -- and there is a single, sane way to do that.
I don't see how this gives additional security. When you run
curl https://project.com/script.sh | sh
...you're relying on three things:
1. That the people running the project are trustworthy
2. That the server hasn't been compromised
3. That the CA system will ensure you're talking to the correct server
(I can think of recent news stories where each of those were violated.)
If, instead, you go to `https://project.com`, read the instructions, and paste in the following command...
curl https://project.com/script.sh | hashpipe <somehash> | sh
..then you're relying on those same three things! Someone who wants to serve a modified version of `script.sh` just has to serve modified instructions as well. You also have a new requirement: you have to get a trusted install of hashpipe first.
It is trivially easy for a MitM to interrupt the download of the script between two TLS packets, without any CA or server compromises. When you do:
curl https://project.com/script.sh | sh
Then sh happily executes instructions as they come in. It may be that the script starts by moving important directories aside or by creating large temporary files, so if the script is incomplete the user may end up with a broken system. Or maybe if you're really unlucky an attacker might manage to truncate "rm -Rf /..." to "rm -Rf /".
With hashpipe, you are at least guaranteed to have the complete script before you run it. I still don't like the practice, but it is better.
Deployment one-liners with hashpipe will only work if hashpipe is installed, which would be equally difficult for users to install properly than the software itself. Then you'd need something like this:
In addition to your point, (and perhaps this point has already been made elsewhere in the thread), but the type of people that would install hashpipe would be the kind of people to avoid the practice entirely.
Perhaps there's some mechanism for convincing people to use the tool, and if that mechanism ends up being easier than convincing people to stop piping directly to bash, it sounds like it would be worth pursuing :)
Well, it'd be easier to get hashpipe into all the typical package managers than to get every piece of software ever into them (though that way would of course be preferable).
Since the main use case for this utility is verifying network shell scripts, it would be interesting to see a query param convention, so we could use a tool such as:
When the just-over-the-next-hilltop promised-land nirvana of content-centric networking arrives, the hash will be enough to locate & download the content – so you shouldn't even need an URL:
$ hashcurl QmUJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8 | sh
Maybe it's even a special filesystem path, that contains (but does not list) everything-that's-nameable-and-findable:
$ sh /everything/QmUJPTFZnR2CPGAzmfdYPghgrFtYFB6pf1BqMvqfiPDam8
(BTW, personally not a fan of the opaque 'multihash' format, which obscures the algorithm-in-use to save a few characters.)
If any of you are designing a system like ipfs, please use Merkle tree roots as the identifiers (and make sure leaf nodes are hashed differently than inner nodes as in THEX, or better yet Joan Daemen's Sakura construction).
The main reason is that at some point you probably want to support downloading from multiple untrusted sources. Properly implemented Merkle trees (such as using the Sakura construction) are provably as strong as the underlying hash algorithm and allow lightweight cryptographic proof that a given block belongs to the file in question and belongs in the claimed place.
Gnutella uses the SHA-1 of the whole file the identifier and then just assumes the first peer to give it a Merkle tree root is giving it the correct root. Because there isn't guaranteed consistency between the identifier and the tree root, this makes it vulnerable to an attack where an attacker tries hard to be first, and give it a root for a corrupted version of the file and the peer wastes a lot of time and bandwidth re-downloading uncorrupted blocks that fail to verify.
Alternatively, you could make the identifier the concatenation of a full-file hash and a Merkle tree root. However, since the Sakura construction (and several other constructions) are provably as strong as the underlying hash, this is a waste of space. Concatenating the SHA-256 of the file with a SHA-256 Merkle tree root only gives you as much strength as the first 257 bits of a SHA-512 Merkle tree root. If you can spare the space for the longer identifiers, you're better off using a longer hash instead of two shorter hashes.
In other words, it's a waste of space to concatenate a whole-file hash with a Merkle tree root if you use a tree construction that's provably as strong as the underlying hash function.
It's not save a few characters. It's to allow for flexibility of encoding. "sha1" all but requires ASCII. Sometimes you're limited to hex, base64, etc.
Not sure I understand; almost any context where "QmTpn…" can appear could also handle a "sha1:" human-readable prefix. Including, notably, this shell context.
If another context is sufficiently different, it would be fine to use a controlled fixed-length binary-vocabulary there. (But, the ASCII bytes for "sha1:" would still be a pretty robust and largely self-documenting choice.)
oh the "Qm..." string is the base58 representation. you can have a binary packed version as well. check out more at https://github.com/jbenet/multihash/
the important thing here is standardizing the encoding, i.e. there should be a valid repr in {binary, hex, b32, b58, b64, ... bemoji ...}
so that, say, if your input field _only takes hex_ you have a way to enter the hash. this problem surfaces when tools expect reprs to be in a particular encoding (all too common), and you dont have power or access to make the tool better (sadly all too common too).
So... it's readable in one out of dozens of possible encodings? And realistically, in that encoding (ASCII), the part following the readable prefix is unreadable garbage. In exchange for this convenience, you add a four bytes of unreadable prefix instead of one.
I'm not sure I see the point.
I deal with outputs from crypto functions on a daily basis. Never once have I intentionally rendered it as 8-byte ASCII. It's most often in hex. Mixing and matching encodings (e.g., prefixing a hex string with an ASCII string) for a single blob of data is silly and just causes headaches whenever you need to change encodings, either by having some data double-encoded or by having to specially handle the prefix separately.
My hunch is that the overwhelmingly dominant and important use case is where these identifiers appear in URLs, including URL fragments (and potentially in brand-new protocols). A few other important cases are also where they're visible to people, as in the example (or hypothetical) command-lines that kicked off this thread. In those cases, explicitness-at-a-glance helps, as will having one canonical encoding (such as b32, b64, or b58).
Compared to that, adaptation to other constrained systems is a case-by-case issue. And if those systems are already capable of squeezing in these non-native slightly-longer hash-like-strings, then a few more bytes usually won't hurt, or if they do then whatever deep-in-the-bits coder (like yourself) who's shoehorning things in can handle compactification. The display/exchange format should be as casually readable as possible.
Such an ASCII-name-inspired prefix is readable in all encodings. In some, it's just a magic number (but one that any coder can make educated guesses about); but in the one encoding that's likely most important for mutual comprehension between humans (user-exchanged strings and URLs), it's super-duper-readable.
And yes, the prefix should in general have special handling: as the multihash project README notes in an important "warning", the prefix lacks the same distribution as the other bytes. Treating the whole thing as an opaque-but-still-reliable identifier invites indexing misoptimizations right off the bat, and then other later bugs, if any of the hash functions become deprecated, or a new hash is added with variant semantics.
Not only could it, downloading additional code is often the entire point.
Additionally, it's common for the installer to include things like version numbers, which means the hash will change with each release.
Meteor suffers from the "|sh" install pattern. Creating a Docker packaging of it that felt safe required a lot of extra work as a result:
1) Transforming the installer file into a canonical form free of version numbers. This verifies if the assumptions made about the installer are still valid. It also enables a "latest" tag which installs whatever MDG has currently published.
2) The installer is patched so it checksums the tarball it downloads.
(To be clear, I'm aware of areas both upstream and downstream in the process where unverified code could sneak in easily. But at least I can feel good about the part I'm responsible for.)
Please please dont use hashpipe thinking you'll be super safe about everything. It only raises the bar a bit! It solves my biggest gripe with most `curl <url> | sh` things, which is that any MITM can own my machines without compromising the origin http servers.
(of course, if the HTTP server + page you got the checksum from is owned too-- good luck!).
Good point - of course, this would mean that the additional code was vetted (or at least consciously included!) by someone who cared enough to use hashpipe in the first place.
You might want to `set -o pipefail` on your bash because a failing process does not stop things from getting piped:
`echo OK | false | echo OK2` --> this command returns zero exit code even though false does return non-zero exit code. If you do `set -o pipefail` entire pipe will fail with non-zero exit code (of `false`).
It doesn't have to read into memory; if that becomes a problem they can fix it. The hashing algorithm just needs a small, constant amount of state. While it is hashing, it can divert the data to a temporary file. Then it can just pass the file to its standard output.
A "hashexec" command could also be produced which passes the name of this temporary file to a subprogram:
whatever | hashexec <hash> arbitrary --command with args
This "arbitrary" command's standard input is redirected from a temporary file created by hashexec (so no cat-like loop has to execute to feed the data).
Also, since this is for scripts, there could be an argument which limits the size. This could have default value, say one megabyte. Anyone pulling down scripts which are anywhere near one megabyte has to add explicit overrides for the size:
whatever | hashpipe --max=4M <hash> | ...
The real problem isn't that hashpipe buffers everything, because the scripting language running the script will also do that; rather that because it buffers everything, it can be DDoS'ed with an infinite stream or whatever.
By incorporating a size limit, hashpipe could provide an additional protection measure to the next pipeline element: it protects against content which doesn't match the hash, and against content which is too large.
Idea: the size of the input could be encoded as a few digits of the hash. Then the argument is unnecessary. The hash itself tells you that the script is exactly 6059 bytes long; if you read 6060 bytes, the input is not the right one. Or the program could just stop reading at 6059 and check the hash at that point, and either pass on the 6059 bytes or error out.
yeah, hashes are decided by every single bit including the last one, and not a single bit should be output until the hash matches. It currently buffers everything in memory, but might do this: https://github.com/jbenet/hashpipe/issues/1
(some settings dont have disk though).
hashpipe is intended for most executable use cases (usually under <50MB)
You should be able to detect if the stream is seekable (by checking the result of 'lseek(fd,0,SEEK_CUR)') and only buffer if it's not.
Of course, if you're really paranoid, the file could get changed out from under you. But honestly you're probably screwed either way with an attacker who can do that.
Loading everything into memory at once shouldn't be necessary to produce a hash of the entire input. All of the hash functions currently supported allow for incremental hashing. That means you can hash in blocks instead of all at once.
Another fun option would be to use a tree hash. The distributor of the content-to-be-verified would then envelope it in a format (to-be-defined) that includes proofs-up-to-root every N bytes. Then the verifier could stream, and know that everything it emits fits into the target hash, needing only N bytes of working space.
Caveats:
The source doing-the-enveloping will need two passes (and enough working space for the remainder-tree).
An attacker could still choose the moment-when-content-goes-bad; in the envisioned use of immediately-executing the verifier output, this might leave things in a problematic/resource-consumptive state. (Scripts could be hardened against such partial-execution failures.)
Cool. The use case that comes to mind is knowing you are going to get the file you think it is if you are looking it up on some local or remote hash indexed storage scheme.
It would be nice if we had (it prob exists already) a standard block based scheme so files larger than the RAM could be handled without writing to disk. Making something like a .torrent for every file and using that as the per block checksum and then checking the hash of the concatenated checksums could do it. Right? I'm sure there is a real name for that but idk it.
Really this seems like something that the FS should be asked to do. ZFS seems to be close, but I haven't found out how to address a block by it's hash yet.
How does it protect from man-in-the-middle attacks? The man in the middle can simply replace the hash. If there's an additional communication channel to provide the hash you could simply provide the whole command with it.
This is cool. What was your thinking about potentially hashing whole commands (and even their executables) in addition to their parameters? You could be the hash (with as you say PKI) gateway for all execution... :)
Maybe OSX developers should just stop asking people to pipe random scripts into their shell. This is a nice idea with a clever name, but it's kind of a false sense of security.
Getting this in before the gratuitous negativity brigade starts hammering down:
This an implementation of a stupid joke looking for a problem. If someone actually needed this in their toolbelt they could use perl or even (amazingly) bash to handle this problem.
I would personally not name anything I worked on after a song from one of Weezer's shittiest albums, but that's just me.
For example:
Pip (python package manager) tells you to run a "secure" script off the internet [0], where the only real protection is being behind SSL. But, surely I can just check the code of that script? Oh wait, it's embedding a zip file inside the python script, so I need to manually unzip it and look into it before I can start to consider it safe.
I like what CoreOS does, they don't even provide SSL download links for their isos, it's all plain text http, so they stress that you need to verify the image signature against their GPG signing key which is distributed via github.
[0] https://bootstrap.pypa.io/get-pip.py