I really like this algorithm, and I think it's instructive to think about a naiv...

danieldk · on March 6, 2015

Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time,

Reservoir sampling is normally used on large streams where you do not want to or cannot keep the data you are sampling from in memory.

javajosh · on March 6, 2015

Then I'm a little bit worried about this algorithm, because if the probability of picking is n/idx then the odds of picking a long tail item get asymptotically close to 0. I'm not sure if it really qualifies as a sample of the entire thing.

loqi · on March 6, 2015

Intuitively, the reason why it balances out is that while earlier items are more likely to be picked, they're also more likely to get kicked out after being picked.

c0ldfusion · on March 6, 2015

proof is in the article http://en.wikipedia.org/wiki/Reservoir_sampling