> But these problems aren't solved by reservoir sampling. Reservoir sampling is ...

timr · on March 6, 2015

As I said on another comment in this thread: if you have the addresses (i.e. database ID values) for all of your records in memory, then sure, sampling directly from that set is faster than iterating over every record in a table.

There are lots of cases where this assumption doesn't hold: if you're sampling from a join, or from a table with non-contiguous ID values, for example.

azernik · on March 7, 2015

Or, even better, if you're sampling from a data set as it's being generated. For example, let's say you want to grab 1,000 random tweets out of the output of Twitter's firehose for the next 2 hours - when you decide whether to keep tweet #1,329, you have no idea how many tweets are going to come in after that one, because they haven't been written yet!

joe_the_user · on March 6, 2015

Another way to put it is that reservoir sampling only makes sense if you have to iterate over the entire table regardless. That means, no keys, row numbers or unique values that can chosen at random and found more quickly.

It seems less, not more, likely that this kind of situation would exist in a huge table - what use would such a table be if you could only process it sequentially?

defen · on March 6, 2015

Ignore the database part - that's just a single concrete example. More abstractly, say you're receiving a continuous stream of events, and you don't know when it will end, and you want to take n samples from that. For example, maybe you have a web service that is receiving click event logs as a stream, and you want to sample 1000 of those, randomly, without storing every single one as it comes in.

skybrian · on March 6, 2015

On the other hand, you're likely to be taking a sample of rows that match a query. It's unlikely that the database already keeps track of the count of how many rows will match an arbitrary query, unless you set it up in advance. So you'd need to load the row ids of the matching rows, then sample from them.

A sufficiently smart database might understand "order by rand() limit n" and do reservoir sampling on row id's behind the scenes. I wouldn't count on it, though.

If you set it up in advance, you could also use reservoir sampling to avoid even storing all the data in the first place, while still saving a random sample.

twic · on March 7, 2015

Once-through sequential processing is a pretty common paradigm for large datasets. That's basically what Hadoop and column stores are about.

In fact, i think this is the underlying meme of 'big data' - that it's now plausible and useful to do analysis over all of a large dataset, so you might as well build everything around iterating over the entire table regardless.

detrino · on March 6, 2015

You can still use this approach as long as you know the total size upfront, just sort the generated index array and use it to skip elements from your input stream, this will save you from having to generate random numbers for every element.

bosie · on March 6, 2015

Could you please expand on your linked code, not really a C++ programmer and line 22/25 confuse me a bit?

Unfortunately I don't understand how each element has the same probability of being chosen. I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?

sebleblanc · on March 7, 2015

Regarding the C++ code:

    /* This is standard modern C++ random number generation.
       It means, using the random generator "engine", return 
       a number according to distribution "dist". In this
       "dist" is a uniform distribution from "i" to "max-min".
       "auto" is the new way to type variables in C++: it
       infers the type of the variable from the type of the
       expression */
    22: auto index = dist(engine);

    /* The following lines are a bit more complex, as they
       contain a nested expression. Let us focus on the inner
       expression:
           (index < size ) ? array[index]
              : map.insert({index, min + index}).first->second

       This is in the form:
           test-expr ? when-true-expr
              : when-false-expr

       This expression is a ternary condition. What it means
       is that the expression before the question mark is
       computed (index < size), and if it evaluates to true,
       the overall expression will resolve to the result of
       the expression between the question mark and the colon,
       otherwise it resolves to what is after the colon.

       Here, it means that if "index" is smaller than "size",
       use "array[index]", otherwise insert a new entry in the
       map, with key "index" and value "min+index", then reuse
       the result of the operation by accessing member "second"
       of the member "first". I am not familiar with STL maps;
       here is an explanation of the map::insert method from
       cplusplus.com:

           The single element versions (1) return a pair, with
           its member pair::first set to an iterator pointing 
           to either the newly inserted element or to the 
           element with an equivalent key in the map. The 
           pair::second element in the pair is set to true if 
           a new element was inserted or false if an equivalent
           key already existed.
    */

    23: std::swap(array[i], index < size
    24:                         ? array[index]
    25:                         : map.insert({index, min + index}).first->second);

    /* Then, the std::swap operation simply swaps the two indicated
       values in the array. */

detrino · on March 7, 2015

> Could you please expand on your linked code, not really a C++ programmer and line 22/25 confuse me a bit?

I've annotated the program here: http://ideone.com/gmaBri

> Unfortunately I don't understand how each element has the same probability of being chosen. I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?

This is just a partial Fisher-Yates shuffle[1] using a sparse sequence instead of an array. Keep in mind that if an element is not chosen it is swapped with the chosen element and will be considered again. I think the best way to think about it is that you are choosing a random element from the set and then "removing" it from the set with the swap and then repeating.

[1] http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Th...