Hacker News

timr · on March 6, 2015

If you have N locations with known addresses and you can select a sample of k from those addresses randomly with uniform probability, then sure, do that.

But there are lots of situations where you don't have that: say, you're sampling from a database join, or you're just using a table that has non-contiguous id values (e.g. a table with deletions). Or, in the case of truly "Big Data", you have the data spread over multiple machines. In these cases, you need an approach suitable for when you don't know the size of the stream, you can't load it all into memory and/or you can't select randomly from an address pool.

detrino · on March 6, 2015

Sorry, I deleted this comment immediately after posting but you managed to respond to it too quickly, it originally said:

>Having a large input set of known size (and your result set is not large) is not a time when you would want to use reservoir sampling. Such a problem can be solved in O(k) where k is the size of the result set, while reservoir sampling is O(n) where n is the size of the input set.

and I have responded here: https://news.ycombinator.com/item?id=9158903