Hacker News new | past | comments | ask | show | jobs | submit login

You are describing a bad implementation of a bloom filter [1]. Anyway, people expect "uniq" to be correct in all cases (i.e., to never filter a unique line). A default implementation where it would possible (even with a minuscule chance) that this doesn't happen would be a recipe for disaster. It may be a cool option though ;)

http://en.wikipedia.org/wiki/Bloom_filter




No, this is not a bad implementation of a bloom filter, and the size of the minuscule chance matters; unless SHA-256 has a flaw in it that we don't know about, SHA-256 collisions are far less likely than undetected hardware errors in your computer. The universe contains roughly 2²⁶⁵ protons, 500 protons per distinct SHA-256 value, and has existed for roughly 2⁵⁸ seconds, which means there are roughly 2¹⁹⁸ SHA-256 values per second of the age of the universe.

Typical undetected bit error rates on hard disks are one error per 10¹⁴ bits, which is about 2⁴⁷. If your lines are about 64 characters long, you'll have an undetected bit error roughly every 2^(47-6-3) = 2³⁸ lines. SHA-256 will give you an undetected hash collision roughly every 2²⁵⁵ lines. That is, for every 2²¹⁷ disk errors, SHA-256 will introduce an additional error. If you're hashing a billion lines a second (2³⁰) then that will be 2^(217-30) = 2¹⁸⁷ seconds, while the disk is giving you an undetected bit error every minute or so. A year is about 2²⁵ seconds, so that's about 2¹⁶² years, about 10⁴⁹. By comparison, stars will cease to form in about 10¹⁴ years, all planets will be flung from their orbits around the burned-out remnants of stars by random gravitational perturbations in about 10¹⁵ years, the stellar remnants will cease to form cold, dark galaxies in about 10²⁰ years, and all protons will have decayed in about 10⁴⁰ years.

And if you somehow manage to keep running uniq on your very large file at a billion lines a second, in a mere 500 times the amount of time from the universe's birth to the time when nothing is left of matter but black holes, SHA-256 will have produced your first random false collision.


Another possibly relevant note: there are about 2¹⁴⁹ Planck times per second. None of the above takes into account the Landauer and related quantum-mechanical limits to computation, which may be more stringent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: