I couldn't resist the challenge, so I wrote a non-recursive solution [0] in C++. It takes about 0.73s on an E5-4650 to loop 10000 times. Since writing it I've tried testing it against bluecalm's solution and how they compare seems to be fairly sensitive to the particular compiler and hardware combination. But it's at least a slightly different take.
[0] http://pastebin.com/EciqQBg5