Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Rain hashes – well designed, simple and fast variable sized hashes (github.com/dosaygo-research)
79 points by keepamovin 9 days ago | hide | past | favorite | 52 comments





This is my hash included in the best hash testing library out there "SMHasher3" maintained by Frank T Wojcik: https://gitlab.com/fwojcik/smhasher3 This test lib has many improvements over SMHasher 1 & 2, listed on the previous link. Results are: https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/REA...

Rainbow is the fastest 128-bit hash, and the fastest 256-bit hash (non-crypto). The 8th fastest 64-bit hash (by family, or 9th fastest overall, and 13-th fatest overall if you include 32-bit hashes). The advantage of rainbow is its easily scalable output size (64, 128 and 256), its high quality (passes all the tests), and its utter simplicity: it is under 140 source lines of code, easily readable

The random constants are primes that were selected based on their avalanche quality under a multiply modulo operation. This part of the development was interesting different primes had very distinctly different avalanche qualities. The highest quality primes caused good avalanche (~ 50% bit flip probability) across the widest possible set of bits, on average. These are quite rare. A lot of large primes only avalanche across a narrow range of bits, even in a 128-bit space. The search program took a couple of days to discover all these primes running on a 2020s era MBP. Primes are chosen because they give a complete residue set under the modulus, ensuring a long cycle length at least regarding the nominal test operation.

The rest of the hash was developed by trial and error using my intuition on developing hashes arising from long experience of doing so, and using SMHasher3 to evaluate the results, by iterating to improve and re-testing, over a period of a couple of weeks in the holidays a few years ago. I started making hash functions in my teens as a fun hobby.

There's also a fun little wasm powered dashboard you can play with, here: https://dosaygo-research.github.io/rain/


I am using fnv-1a to hash Objective-C method selectors, which are generally just identifier characters and zero or multiple ':'. At the time of my research, fnv-1a had the least collisions over my set of "real life" selectors. I think, it could be worthwhile some time, to try out other constants for maybe even less collisions. Is your list of good primes available ? (And maybe also those that are not quite perfect)

Everything is in the source code. I highly doubt any of the good hash functions listed in smhasher3 (ie all tests passed) would collide over identifiers.

So they should all have zero collisions, meaning there’s no ‘least’ among the good quality ones - they’re all equally collissionless (they differ in other tests).

Sounds like an interesting project. What’s its purpose?


Cool. I forgot to mention, that I am truncating the hash down to 32 bits to hopefully generate tighter CPU instructions. At these few bits, collisions are still rare enough, but they are a concern.

Now my understanding of the choice of prime is that, you are "weighing" the input bits and the computed bits, that will form the hash. So in case of identifiers its very likely that bit 7 of the input is always 0 and say maybe bit 4 is statistically more likely to be 1 by some margin. The other input bits would have some entropy as well. I would expect that certain (imperfect) primes would then aid me to get a better use of the 32 bit space and therefore less risk of a collision for my Objective-C runtime.

You can check out the project here: https://github.com/mulle-objc.


Ah, that's interesting. 32-bits yes you would get some collisions even from good hashes just statistically.

Also now I understand your constraints. Very interesting, so you are designing a custom hash function to use in this specific domain of keys with specific probabilistic properties, and you are thinking that there would be some way you could multiply by a certain prime that would ideally fan out these keys to be evenly distributed over the space?

mulle-objc looks fascinating: fast, portable Objective-C runtime written 100% in C11. I encourage you to post a Show HN I'm sure people here would like it.


Actually I already did, a few years ago: https://news.ycombinator.com/item?id=13030568

Hahaha! :) Good on you. Nothing to stop you posting again :)

Truly I have much experience with Posting Show HN's. There's very little quality difference between something you post that gets 3 points and something you post that gets 300 points. A lot of it depends on time, current dynamic audience, and other posts at the time.

Repeat post to get a better idea of the true interest in your work. I encourage you to do it again!!! :)


Spooky also has some good results on common identifiers.

But fnv-1a is in a completely different league. It's recommended for hash tables with other security measures than hash function security. This hash is a typical hybrid, but not universal. umash would be the perfect hybrid: insecure, pretty fast, passes all tests, universal


Thanks for the umash tip.

Broadly: I don't understand the design space of "cryptographic-ish" hashes. There's an intensely competitive design space of cryptographic hashes. What does it mean to be "more secure" than an insecure hash function? When would it ever be appropriate to use such a hash?

I have the same question.

Might be useful to clarify for people replying with suggested use cases: this repo has two hash functions, describing them as “General-purpose, non-crypto hash” (OK) and “Experimental cryptographic-like hashing” (???)

The question is, what is that second one for? Do they just mean it could eventually become a cryptographic hash?


Thanks for adding that clarification that yes it includes two hashes: one not meant as a crypto hash; one proposed to be a cryptohash. The text in the README means it (the second one) is proposed as a crypto-hash. But the wording you used looks unclear, is that what we really have in there?

edit: AH, I see you mean in the comparison table. Yeah that wording could be improved. I'll change it now. The rest of the text is hopefully clearer:

"Consider it (rainstorm) experimental, but designed to be secure. Note on Cryptographic Intent: Rainstorm's design includes concepts from cryptographic hashing, but without formal analysis, it should not be considered secure. Feedback and analysis are welcome."

Although without a bunch of accepted analyses people shouldn't trust a secure cryptohash that merely claims to be a secure cryptohash - which this doesn't claim.

It is funny people get obsessed with wording when what's really needed is evidence.

I will take a look at the wording again now and if there's any wording that's insufficiently clear that rainstorm is proposed as a crypto-hash and requires analysis, I'll update such wording.


That's a clear distinction, thanks for bringing it up.

For clarity, this lib has 2 hashes: rainbow, and rainstorm. "rainbow" the faster, is not intended to be crypto secure. "rainstorm" the slower, designed with strength, is, but it has not been analyzed publicly.

My intent is that rainstorm is proposed as a cryptohash and eventually is analyzed properly in order to be trusted, and used - or improved! We probably can't remove the "ish" until then.

In terms of saying "more secure"? Without formal cryptanalysis it is hard to say, but if you have good statistical properties (as measured by these tests in SMHasher3) it's fair to say you're more secure as a lot of the attacks on (cryptographic) hashes are to do with finding dependencies between input and output ("differentials").

I wouldn't make sweeping recommendations unless a hash was vetted because that's how we end up trusting primitives, through the trust in the collective work of people trying to attack them. Right now we just have the speed results, the SMHasher3 results and the design, and some limited analysis. But more is coming, some people are doing good stuff on GitHub right now.

In terms of aptness, to use either of my hashes in rain as a cryptohash without having a deep understanding of potential attacks against it would be rolling the dice, but you could probably trust that at least potential attacks would seem to be difficult given the lack of statistical predictabilities. And they're both fine to use as regular non-cryptographic hashes.

Of course I want to recommend people use them, but I couldn't say use them in crypto applications without being irresponsible and I don't want to do that. Also, to be more lawyerly about it, heh, it's provided "as is".

I guess the main draw right now is if you want good statistical properties and hashes that can be 64, 128 or 256 (or even 512 in rainstorm) bits wide, and fairly fast for that.


What does "vetted" mean to you here?

Just held to the same standards of cryptanalysis that the community decides imbues a cryptohash with trust for use.

Pretty standard, and consistent with the rest of the text.

What did you think it meant?


Nothing in particular, but, like, there was just a NIST competition that produced SHA3, and "vetted" Blake2, and this wasn't submitted to it. It's not like there's a test suite to which you can just submit a cryptographic hash to validate it. You actually have to convince a bunch of mathematicians to do studies of your hash, right?

I'm not trying to pick on you, it's just that this keeps coming up with constructions like PCG. There's a cryptographic literature, and then a side-hustle of people building stuff like this. We used to have a sci.crypt discourse about it, but now I worry that HN has lost the antibodies.


There's 2 hashes here. Rainbow is non-crypto, rainstorm is proposed as a crypto-hash, requiring cryptanalysis. It's not claiming to be validated as a cryptohash by SMHasher3, which tests statistical properties. You need (crypt)analysis to say it's a secure and trusted crypto-hash. This was developed a couple months past the NIST submission deadline for SHA3.

Maybe your immune response was wrongly triggered? It happens. But what we really need antibodies against is excessive negativity and vague criticism, combined with attempts to justify the same with "you're not someone who is supposed to do that" gatekeeping.

Could some of this be motivated by the fact that you want to do stuff like this yourself, but instead of doing so, or analyzing it, you just criticize it, inappropriately? I don't think you're trying to pick on me, just confused and powered by righteous protective instinct that is better were it more informed and discerning. Reflect some?


No immune response! Making new things and posting them is an absolute good thing. I'm literally asking, straight up: what's the use case for a hash that is "more secure" than an insecure hash, but "less vetted" than an actual cryptographic hash. That's all.

Alright - if you want a strict use case for rainstorm in 2024 before public cryptanalysis: as a subject of cryptanalysis and experiment; as a way to make your name as a junior cryptographer attacking it.

What other ideas can you think of? How about in a construct for password hashing? You can add more rounds to the loop to increase the cost like pbkdf1/2.

For rainbow you can use it in place of whatever other non-crypto hash you have, especially if you want fast 128 or 256 bit widths.


Yep. Dunno. A question! It's the Internet, I get it, I'd bucket responses into "this is amazing" and "this is the worst" too. I have a more philosophical question here. :)

You would? That’s not what I’m doing: "bucketing". I always assume good, but look at everything closely, and call it out if I see it. I respond to what’s there individually.

If you're trying to be friendly you got a funny way about it. What’s your philosophical question?


The one at the root of the thread: "why would I ever use a hash that's slower than the fast, 'insecure' hashes, and isn't a serious cryptographically secure hash". This comes up a bunch!

Rainstorm is serious, but not yet publicly vetted. Anyway, I thought that might be your question. You don't have any answers?

If not...Oh well, you might have to wait until it gets analyzed a lot. Use Rainbow now without fear: fast, statistically great, and unlikely collisions - which makes most uses better. As someone else put it: "fastest 128-bit and 256-bit non-crypto hash, passes all tests, and under 140 source lines of code."


Why do you expect cryptography researchers to analyze your hash? That's maybe the question I should have started with.

Probably is. Why do you not expect them to?

I'm sure they will eventually, not too long. What's your big concern / lack of confidence here?

In the meantime you can take care of it and nurture it by encouraging people to analyze it, using your influential voice for good.


I'd be happy to if I understood better why someone would want to undertake that project, which is why I asked. There is an enormous literature on cryptographic hashing; where does your hash fit into it?

Nice question! Rainstorm pulls from a rich history of ciphers and cryptohashes. It’s based on ARX-like mixing (though it swaps addition for subtraction—hence XSR: XOR, Subtract, Rotate) to create non-linearity, while borrowing structural ideas from Feistel networks, sponge constructions, and Merkle–Damgård iteration. It fits into the “experimental hybrid” category, exploring whether combining paradigms can offer fresh resilience or reveal new weaknesses. Definitely more to expand on, and you can read more, here: https://dosaygo-research.github.io/rain/paper/crypto-note.pd...

Hopefulyl this gives an overview of some of the historical influences and inspirations.


Checksums are a good use case, where you’re not worried about any sort of forged collisions and you want to calculate the checksum as fast as possible. Although cryptographic hashes are fast enough that there likely aren’t too many use cases where the difference in speed is meaningful.

If you don't care about forged collisions, what makes a secure-ish checksum bette than a faster insecure checksum? Both are likely to result in a different value for bit-flipped input, which is the point.

Note that you said “likely” and not “equally likely”. Therein lies your answer.

There are applications where hashes are used as identifiers, where it would be impossible to use the original data to resolve possible collisions. One example would be in RTTI (Run Time Type Information), when you want to check if two objects (of unknown-at-compile-time type) are instances of the same type. The only performant way to achieve this is if the compiler emits each type’s RTTI with a unique integer key to efficiently compare against, and this key must be unique across all object files and shared objects. In practice, this is done using hashing. If there is a collision, then the program behavior is undefined, so it is ideal to minimize probability of collision. You can see this issue in the Rust compiler project where there is ongoing discussion about how to handle this hashing, trying to balance compiler speed and collision likeliness:

https://github.com/rust-lang/rust/issues/129014

Another scenario would be like what Backtrace does: a hash is used to identify certain classes of errors in application code. The cost of analyzing each error is quite high (you have to load objects into memory, source maps (if available), perform symbolification, etc), so comparing any two errors every time you want to perform any metrics or retrieve associated data (like issue tracker identifiers) would reduce the performance to a point that any such application would be too slow to be useful. So a lot of thought was given to producing fast, non-cryptographic hashes (these hashes aren’t shared across tenants, and no customer is going to go out of their way to work out initialization parameters so they can be their own adversary) with a known likely hood of collision. You can read about the hash (UMASH) here:

https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...


Well, also notice that I said "bit-flipped input." I.e., the use case I believed to be talking about and responding to was data corruption detection (which is also associated with "checksum" as a term of art, as opposed to a generic "hash"). I agree that hashing for identity is a different use case than checksumming for data integrity.

FWIW I think commonly used checksums are basically ideal ("equally likely" to come to a different value) on single-bit flipped inputs, but don't quote me on that.


That’s a good question. A cryptographic hash must not leak any information other than “yes” or “no”. A checksum-type hash might have an additional property where data that only has a single bit flipped might hash to something close to the original hash, allowing you to refine your collision attack by knowing if you’re hot or cold.

For a checksum, this isnt a bad thing and actually could be good.


Hash tables and related data structures.

If the distribution of bits isn't better, you would prefer a faster hash for a table that doesn't care about adversarial inputs. (And if you do care about adversarial inputs, I think you need an actually secure hash? Medium confidence on that.)

For adversarial inputs, even a cryptographic hash won't save you, the cost of brute force searching for pathological keys is based the number of buckets not the cost of finding full collisions.

There's a whole literature of cryptographic hashes designed for this problem.

Content addressable storage. For example, the Git DAG would be one place to use it (depending on which camp you're in: should the DAG be cryptographically secure, or should the security be out-of-band e.g. signatures).

What operations would be faster in Git from using an insecure hash instead of SHA256? Commit is already reasonably fast.

Where in my comment do I say it should be used? I merely used Git as an example of a production CAS.

> For example, the Git DAG would be one place to use it

I read that as you thinking Git is a motivating use case for this hash.


The benchmarks are bogus at the least, you are essentially just measuring the startup time of the executable/runtime and not your hash function.

Right, how to improve that? That nodejs bench script is meant to compare between cpp and wasm.

For hash speed consult the smasher3 results.


Just to make sure I'm understanding correctly, I think that your intent is that rainbow is / should be treated as cryptographically weaker than SIPhash, right? (despite both of them not being something you would use in a high security context)

Right: rainbow has not been evaluated for HMAC or PRF (Sip's main security objectives), but it's not designed to be secure, just to have good stats and be fast af. I wouldn't be surprised if it was a good PRF, but also wouldn't be surprised if it was broken (by the standards of cryptohashes or HMAC or even PRF).

Some more about sip: https://docs.kernel.org/security/siphash.html


However there’s some chance it’s actually better and more secure than Rainstorm

The fastest 128-bit and 256-bit non-crypto hash, passes all tests, and under 140 source lines of code

Nice!


Yes. Thank you :)

It’s pretty good


I failed to integrate it into my smhasher, the rain branch. Lots of collisions and wrong results all over. Pretty fragile for a workday

Hey Reini, I'm no C++ expert, but yes your SMHasher2 is pretty fragile. I don't know for a workday or not, but it's fragile. But what you said is a pretty nasty thing to say given there’s lots of bugs in your SMHasher2 that have already been fixed in SMHasher3. You can’t blame a hash for what’s your fault. You could always just do fixes instead of saying toxic nasty stuff.

If you want to put it in, take a look at the SMHasher3 library code, and the rain integrations there. I think you’ll have to port it as there’s some internal differences with your SMHasher2. Also, I'm sorry for the unclear and wrong default branch: it should be 'boss' branch. That's corrected now.


smhasher2 is not fragile at all. The rain test api functions are static in a .cpp. This will not work anywhere else than in smhasher3 with its special wrappers. You need to put them in headers if you want them static inline.

So, questions for people that know more than I: What's up with the 'passing' and 'failing' hashes in SMHasher3?

Specifically, what happened to xxHash[1] to kick it down to the 'failing' category? I'm not certain about SMHasher's testing criteria, but comparing the benchmark numbers posed in Rain vs xxHash show xxH3 rather faster? Not that a have a personally vested intersted in xxHash, but it seems to be a rather widely used (e.g. RocksDB [2]) with a number of implementations - it would be useful to understand the criteria that make it no longer a passing SMHasher3 entry.

[1] https://github.com/Cyan4973/xxHash

[2] https://rocksdb.org/blog/2022/07/18/per-key-value-checksum.h...


> Specifically, what happened to xxHash to kick it down to the 'failing' category?

Looking at [1], it seems to collisions (search for "!!!!!")

[1] https://gitlab.com/fwojcik/smhasher3/-/blob/main/results/raw...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: