There's a scheme described in this paper:
http://gmplib.org/~tege/division-paper.pdf
which will work in practice for 2^b bit divisors (e.g. 2^b = 32 or 64). It's trivial to implement and amazingly doing a precomputation plus division actually beats the chip. Normally these methods only beat the chip when doing the division part after some precomputation, not when you include the time for the precomputation too! This trick is used internally in the GMP library and its fork MPIR (and various other places).