I completely agree with all of the above. I was doing a linear scan, and I know that that is artificial. Your stride suggestion does slow things down dramatically; thanks for the simple-to-implement idea.
(Shrinking the lookup table by 1/8, I don't know how to do that fast.)
The annoying thing about lookup tables is that they are hard to benchmark properly. But superficially they often look like a good idea. (Here faster than the fastest reported result, when tested naively.)
> I was doing a linear scan, and I know that that is artificial.
May be artificial :). Don't forget that your application is the best benchmark. If it uses linear access you can take advantage of that. Let's just say that linear access is a special case and random access is a worst case result. The behavior of the random access is the one to remember, IMO.
> Shrinking the lookup table by 1/8, I don't know how to do that fast.
Here is how I do bit vectors. I'm not suggesting that anyone should use a lookup table for this problem since there are very fast solutions that uses O(1) memory, but let's use this as an example since we're all familiar with it.
Since you mention a 2GiB table I guess you use a byte vector. The article uses an unsigned int as the type for the argument which would need 4GiB for all values, I guess you used int instead. We want to use every bit of memory in an array to store boolean flags. It's probably faster to use the native word size of the machine than to use byte size elements, so let's use unsigned int. I assume that we use a 32-bit machine, just like in the article. If you have a 64-bit machine it will be obvious what to change.
We need 2^31 bits (1 << 31 in C), unsigned int is 32 bits and 32 is 2^5 so we need 2^31/2^5 = 2^26 elements.
When looking up things in a bit vector we need to find the right element in the array and the right bit in the array element that holds the boolean flag. To find the right element we divide the input by the number of bits in each element. To find the right bit we do mod (%) by the number of bits in each element and then shift a bit flag by that amount and AND (&) with the array element.
Since the constant 32 is a power of two, a good C compiler will change the divide and modulo operations to shift and AND. If you use some other language and/or your compiler doesn't optimize that you may want to do that optimization by hand. x / 32 == x >> 5, if x >= 0. x % 32 == x & 31, if x >= 0. NB: It's not that simple for negative numbers!
The lookup table must be populated before use of course, that is left as an exercise for the reader ;-).
(imurray, if you think I explained things you already know, I did it for other readers.)
> The annoying thing about lookup tables is that they are hard to benchmark properly.
If you want some general rule: If you can get the table small, lookup tables are fast, sometimes the fastest solution. But maybe you're application (not an artificial benchmark) doesn't use the function very often and the table won't be in the cache. Then the computation have to slower than fetching memory with a cache miss for the lookup table to be worth it. Also, memory bandwith has been a bottleneck for a long time and it will get even worse. This diminishes the value of lookup tables and trading memory for computation time. As always when it comes to performance and optimizations: benchmark your application with your data. General results may or may not apply in your context.
Edit: If you want the answer to always come out as 0 for false and 1 for true, you can do this instead:
(Shrinking the lookup table by 1/8, I don't know how to do that fast.)
The annoying thing about lookup tables is that they are hard to benchmark properly. But superficially they often look like a good idea. (Here faster than the fastest reported result, when tested naively.)