I wonder if there are a lot of edge cases (large numbers that aren't easily decomposed into shifts/additions) where the performance decreases a lot. If these optimizations are cheap, why aren't they used everywhere?
No, actually the performance is good for almost all divisors. I had the same thought you did, so I got the SVN and ran the benchmark. If we are try trust their timing, on a Core i7 the fast divisions were almost all around 4x the speed of the native. There were surprisingly small differences between the 2^n numbers and the harder ones --- the magic of long pipelines, I guess.
These optimizations aren't used much because they are specific to the divisor used. For a single division, it would be faster just to let the silicon do the work. They only help when you are repeatedly doing division by the same denominator, which is rarely a bottleneck. But when it is (repeated integer divisions but not hardcoded), this looks like a fine solution!