Engine Yard Programming Contest - Win an iPhone 3GS PLUS $2k in Cloud Credit

brianliu · on July 14, 2009

Hey, if you guys are working on this, check out these papers on SHA1 collisions:

generated in 2^39 perms - http://www.cs.cmu.edu/~dbrumley/srg/spring06/sha-0.pdf

generated in 2^52 perms - http://eprint.iacr.org/2009/259.pdf

Since it would take an average supercomputer 5-6 days to generate the collision, I think I'm going to give the contest a rest (generating something close isn't useful outside the contest).

Another way to go about this is to sort huge rainbow tables by distance, which have been already created by a few reverse hash lookups ( http://md5.rednoize.com/ , http://hashcrack.com, http://sha1-lookup.com ) - You'll probably need to contact them directly or hurry up and make your own.

Here's the SHA1 algorithm itself for anyone interested: (Google Cache) http://74.125.95.132/search?q=cache:QjZpAaNXJr0J:https://www...

moonpolysoft · on July 15, 2009

Those papers reference a collision attack, which is not the same thing as the point of this contest. The papers describe ways to generate pairs of messages that will hash to the same value.

The contest describes a preimage attack, where one has a hash value and must guess an input which would have generated that hash. SHA-1 has no known preimage vulnerabilities, so the contest winner will either have to be lucky or break SHA-1.

The best advice would probably be to iterate the keyspace in a novel way and hope you happen upon a good match that someone else doesn't get.

MicahWedemeyer · on July 14, 2009

I'm glad to see it's a fun challenge and not a "code an app for our platform" type contest. Those have really started to annoy me as they seem very close to spec work.

akeefer · on July 14, 2009

I might suggest that there's some irony in suggesting that one use Ruby to write an algorithm that's going to basically be 100% compute bound . . . seems like this contest is a classic example of exactly the sort of problem where you do, in fact, care about raw language performance.

gamache · on July 14, 2009

I don't see anywhere on the page where they suggest Ruby. As you mention, it is a terrible choice.

akeefer · on July 14, 2009

Well, the example they give includes the words Ruby, DHH, active, record, rspec, mongrel, jruby, and rubinius as the potential dictionary, and they close the post with "The Ruby community is nothing if not persistent, creative and intelligent — so show us what you’ve got!" So I'm kind of assuming the contest is aimed at the Ruby community.

ezmobius · on July 14, 2009

feel free to write the code in any language or platform you want, we don't care.

luccastera · on July 14, 2009

This is smart. After the contest, anytime you search for one of the terms in their dictionary on search.twitter.com, then @engineyard will show up (assuming there are a lot of entries).

All they have to do is make the dictionary include terms that potential engineyard customers might search for on Twitter.

johns · on July 14, 2009

Not exactly 'anytime' since Twitter search history doesn't go back that far. I can't find the link now, but I'm pretty sure it only searches up to about 30 days back.

brown9-2 · on July 14, 2009

Just to be clear, there is probably no realistic way of winning this without having lots and lots and lots of computing resources available to you once the dictionary is released, is there?

Since you need to choose a twelve word permutation from a possible set of one thousand words, plus append a 0-5 character random string, calculate the SHA1 hash for all possible permutations, and calculate the Hamming distance of each to the hash of the challenge phrase... correct?

  1000! / 982! permutations of dictionary words * 36^5 permutations of random chars

jcl · on July 14, 2009

Agreed... The average developer has no hope of winning this. It will almost certainly be won by someone with a roomful of PS3s or graphics cards who already cranks through hashes for a living. (like these guys: http://www.win.tue.nl/hashclash/Nostradamus/)

timmaah · on July 14, 2009

What about..

"You may permute capitalization for the dictionary words (i.e. you may use Ruby, rUby, RUBY, and RUBy)"

brown9-2 · on July 14, 2009

Forgot to consider that; along with the fact that the 5 random characters can be any printable ASCII characters, not just alphanumeric.

So this is somewhere upwards of 10^50 phrases to hash (in a naive implementation), right?

sophacles · on July 14, 2009

Well, are you also including the fact that there is nothing forbidding repeated words?

jcl · on July 14, 2009

While this has a marginal effect on the total number of possible hashes, repeated words are still an important consideration: If you can guess even one word that is likely to appear in their 1000-word dictionary, you can precompute the hashes for all possible variants of the corresponding phrase (that word repeated twelve times) prior to the contest, giving you a jump on other contestants.

I'd guess that we'll either see the rules revised to prevent this or see a dictionary filled with non-dictionary words.

bcl · on July 14, 2009

According to the rules you can append up to 5 random chars. But you don't have to.

The potential for case changes in the words really makes the keyspace much larger, since we currently don't know the length of the words.

ncarlson · on July 14, 2009

My thoughts exactly. Also, there are enough additional permutation rules to balloon the result space way beyond the reach of even a modest super computer.

One approach would be to lease a massive cluster of cloud computers to carry out the calculations. However, this option isn't financially viable for a lone developer.

Maybe there's an alternative way of approaching the problem.

smwhreyebelong · on July 14, 2009

You can permute the capitalization of the words in the dictionary, and the random characters as well ..

andreyf · on July 14, 2009

If you can guess 5 words in the dictionary, you can pre-compute over 7 billion guesses...

jah · on July 14, 2009

IMO, this seems to be more of a "lottery for nerds" than a programming contest.

ncarlson · on July 14, 2009

Here's some math:

http://www38.wolframalpha.com/input/?i=%281000+choose+12%29+...

The real killer is the allowed capitalization of any letter. If we assume that the average word length will be 6 characters, then we can assume that the average character length of any 12 word string will be 72 (not counting spaces).

With capitalization, the permutations act like a bit string. That is, the capitalization is either on or off. This gives us 2^72.

The resulting hash space is somewhere near (1000 choose 12) * (2^72) * (95 choose 5), or roughly 2^188.

edawerd · on July 14, 2009

Engine Yard seems to suggest that renting cloud computing resources is the key to win this contest, but I'd suggest using parallel computing resources to anyone doing this. If you have an Nvidia CUDA-enabled GPU laying around, this might be perfect for you.

sanj · on July 14, 2009

Any tips for walking to nearby hashes from HN's crypto-gurus?

jrockway · on July 14, 2009

This is the sort of thing that hash functions are designed to avoid.

Since getting closer words does not get you a closer hash, usual optimization algorithms (like genetic algorithms, simulated annealing, etc.) are not going to be helpful.

Perhaps I'm missing something, but I think brute force and luck are going to win this contest, assuming that SHA1 is not broken between now and July 20th. If you want an iPhone 3G S, you should probably just head over to the Apple store and buy one and spend your valuable time on something else.

sanj · on July 14, 2009

I'm aware that this is what hashes are supposed to avoid. However, many crypto attacks seem to build on weaknesses in their design.

Hence my question.

javanix · on July 15, 2009

If someone had come up with something like you suggested it would have been a lot bigger news and this "contest" really wouldn't have happened.

notaddicted · on July 14, 2009

(Not a guru)

I think the main challenges are deciding how to manage the brute force effort.

You can precompute a lot of hashes, and then search them on the contest day.

Or you can run it only during the 24H of the contest, and only store the result which is nearest.

And the second issue: how much resources to invest. A high CPU on EC2 is $20 for the day. Another doubles the price, and doubles the number of hashes you can check.

EDIT: The restrictions they place on the input makes rainbow tables kind of pointless. Just partition the search space (i.e. the permutations) pick a piece of it at random, and send it to your brute forcers.

EDIT2:

Back of the envelope: you will be able to try about 100million hashes for $20.

jf · on July 14, 2009

(not a crypto-guru)

If you use a library function to do your SHA-1, you're doing it wrong.

Pad out your test strings to a fixed length of 448 bits (assuming that the words in their dictionary allow this).

Use a custom hashing algorithm that skips the "pre-processing" step by assuming a fixed length string and skips the for loop since it only needs to process one chunk. http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1_pseudo...

gojomo · on July 14, 2009

It's like Swoopo for compute cycles!

callmeed · on July 14, 2009

I'm definitely not great at this kind of stuff, but I'll give it a try ...

Check out the Amatch ruby gem ... it can compute the hamming distance ... might be easy to automate with ruby ...

http://amatch.rubyforge.org/doc/index.html

potatolicious · on July 14, 2009

Except doing this through Ruby is a guaranteed bet to not get done in time.

sho · on July 15, 2009

I agree that the gem mentioned might not be suitable, but Ruby could still have a part to play, especially as it allows direct use of native libraries, or just managing processes running on the OS.

Someone doing this should probably write the fast-spinning cogs in C, but it might be worth using Ruby to manage threads, processes, data segments, etc - indeed, maybe prototype the C code there first too. You'd gain an awful lot of convenience for very little overhead.

jpwagner · on July 14, 2009

Anyone able to break 10 using the example they gave?

  "I am not a big believer in fortune telling"

sunir · on July 15, 2009

10 is a stretch. After a couple hours of computing several hundreds of millions of attempts, the best I achieved was a distance of 43 bits (assuming duplicate words are acceptable).

jruby rubinius MRI Cloud postgresSQL record MRI exception record mongrel tokyo Cloud

This task is so unbalanced as a contest. The thought of spending a couple hundred dollars of cloud computing time to just buy more 'lottery tickets' for $3000 worth of prizes (assuming you could use the cloud time) is too much of a stretch for me.

0wned · on July 16, 2009

10 is a huge stretch. My testing in C++ agrees with you. The lowest score I've gotten is 41 after about 12 hours. Sure, I may get lucky and get one in the high 30's or something, but it would just be blind luck and persistence, nothing more. Not a very cool contest IMO.

icey · on July 14, 2009

Cool contest, but starting it on a Monday morning sucks pretty badly.

profquail · on July 14, 2009

You should be coding it now, you know what all of the rules are. Come Monday, you would just plug in the desired hash value and word dictionary and let it run for the 24 hours of the contest.

I think I'm going to try to put something together this weekend just for the heck of it. It would be cool if they published the code (with permission) of the top submitters, just to see how they went about solving it.

icey · on July 14, 2009

If you can seriously write software and not have to tweak it to run as you go for this type of challenge, then my hat is off to you. Unfortunately, I'm a mere mortal and would mostly likely need to make changes once I see the "real" data / dictionary.

profquail · on July 14, 2009

You could make your own word dictionary, pick 12 random words (and append a few random characters to the end of the string) and come up with your own hash.

Now that you have the hash and the word dictionary, you "forget" that you know the real string, and you set your program to try to find it.

At least that would make sure there weren't any major bugs in it (logic bugs not included) before you tried to solve the real problem. Since you only have 24 hours to crack it, every extra second will count ;)

icey · on July 14, 2009

I don't think we're talking about the same issue.

Of course you'd have to write the application before the 20th. My point is that for people with day jobs, Mondays are almost always the busiest day of the week so I'm sure that will preclude a number of people from participating.

0wned · on July 16, 2009

I've got some C++ up and running now (using my own test hash and word list). The lowest hamming score I get (12 hours run time) is in the low 40s... anyone getting lower than this? I think there will be a lot of tied submissions... no clear winner.

wifelette · on July 14, 2009

Heya! That's a great idea -- will definitely work on that :)

oliverkofoed · on July 14, 2009

Is this a marketing or a recruitment exercise? Or both? :-)

profquail · on July 14, 2009

At least marketing, perhaps a bit of recruiting if you come up with a clever solution. If they aren't hiring, perhaps some else that is will see your name up on the leaderboard.