Hacker News new | past | comments | ask | show | jobs | submit login

I really like this algorithm, and I think it's instructive to think about a naive solution, and how it differs.

Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time, especially in the case where "putting the element in the output" is expensive (as is implied by the "kajillions of records" premise).

The problem with this solution is that you might pick the same random number twice or more in a row. This means you have to keep track of your random numbers if you absolutely can't tolerate a slightly smaller sample.

The reservoir avoids reusing random numbers in a clever way, by only applying the random number as the target for replacement, and iterating over every N. So the same elt in the sample might be replaced twice, but no elt in N will be added to the sample twice.

One interesting thing is that when N is slightly larger than n, those extra elements are very likely to replace something in the base sample. In fact, the probability is n/N.




Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time,

Reservoir sampling is normally used on large streams where you do not want to or cannot keep the data you are sampling from in memory.


Then I'm a little bit worried about this algorithm, because if the probability of picking is n/idx then the odds of picking a long tail item get asymptotically close to 0. I'm not sure if it really qualifies as a sample of the entire thing.


Intuitively, the reason why it balances out is that while earlier items are more likely to be picked, they're also more likely to get kicked out after being picked.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: