Due to multilevel caches I doubt this would be so applicable to modern systems - the index-set representation takes several times more memory and naturally requires lots of random accesses, which are bad for caches. In some ways, it's no longer a time-space tradeoff: smaller is faster.
The 'magic' here is that you can quickly iterate over the data because it lives in 'dense'. In dense cases 'm'='n' and we loose. The apparent faster clear performance is because the length ('n') is stored as an auxiliary value, which is not unique to this scheme. As pointed out before, the extra logic will break vectorization and result in poor performance.
You can't just assert "we win" or "we lose"; you need to measure different the different possible implementations for your particular application. In the paper I wrote with Linda Torczon (cited by Russ Cox), we did exactly that. In that application (a graph coloring register allocator), we were very sparse (n was significantly less than m) and we cleared the working set quite often - the win was pretty significant.
If your universe has 2^64 elements, you don't want to use bit vectors or this scheme due to giant memory consumption. Instead, you'll probably want to use a hash table.
One data structure can't be the best for every situation. It's why they pay engineers the big ducks [sic].
Python uses a similar trick in its internal allocater (pymalloc) [1]. Given a memory address, pymalloc determines if it controls it by computing what would be the base of the arena containing said address. It then finds the index of the arena (or trash) at a constant offset from the base. It then uses the index to check a vector of all the arenas that it controls.
This is designed for a fast clear, but doesn't actually have constant time clear, it's just O(n) clear with very small (amortized) constant. It's a bit more cache friendly, though, I think, and gave us some nice speedups in practice.
I think I have seen this method used in a data structure called something like a 'B.* set' or 'B.* index', but I can't remember the exact name. Does anyone have a clue as to what I am (mis)remembering?
Update: I remembered that the algorithm had something to do with checking if an item is not in a set, which quickly led me to Bloom Filter - not particularly similar, as it turns out.
I didn't read all that, but it seems like the issue there is that, if the compiler sees you are reading from uninitialised memory, then it can effectively 'choose' a value for the contents of that memory to suit it's optimisation purposes. Is that right?
If so, then this isn't actually a problem in the case of sparse sets. With sparse sets we don't care what is in the uninitialised memory and if the compiler wants to get tricky and choose values for this memory (which I don't think actually applies in this context) that doesn't change the correctness of the data structure..
> “Xoring an uninitialized variable with whatever other source of entropy you already have cannot hurt”, the conventional thinking goes. Conventional thinking is wrong. Your typical modern compiler deletes the code that gathers the original entropy, since it is only going to be xored with an uninitialized variable.
The sparse array in the OP can't be optimized in this way at compile time.
Also, initialising the space doesn't take so long; REP STOS on x86 is fast since it can write an entire cache line at once: http://ptspts.blogspot.ca/2014/10/on-speed-of-memset.html