Algorithms Every Programmer Should Know: Reservoir Sampling

lilyball · on March 6, 2015

This algorithm was introduced with the premise

> "I have a database table with #{17.kajillion} records, and I want to take a random sample of #{X} of them. How can I do this efficently?"

And goes on to say that something like `YourBigTable.all.sample(X)` is bad in time and space. And the link that explains why doing `ORDER BY RANDOM()` is bad lists one reason being that generation of random numbers is relatively expensive.

But these problems aren't solved by reservoir sampling. Reservoir sampling is still O(N), and it requires generating a random number for every element. It still beats `ORDER BY RANDOM()` because it doesn't have to sort the numbers (or even keep them all in memory at the same time). But that doesn't make it a good choice.

More generally, reservoir sampling is appropriate when you don't know the size of the set beforehand. But in the premise, we do know the size of the set. And if you know the size of the set (and have some way to address arbitrary elements), then randomly generating a set of X indices is going to be much simpler and faster. Yeah, you have to deal with duplicate indices, but that's an easy problem to solve.

cperciva · on March 7, 2015

one reason being that generation of random numbers is relatively expensive

Say again? My laptop can read from /dev/urandom at over 1 Gbps. If you're running in userland on modern hardware you can easily hit 10 Gbps.

Seeding is hard, but once you've done that, generating random numbers is insanely cheap.

lilyball · on March 7, 2015

Yeah, I dunno why it says generating random numbers is expensive. I was just quoting his linked source.

_hyn3 · on March 7, 2015

> Seeding is hard, but once you've done that, generating random numbers is insanely cheap.

Depends on your use case. Generating random numbers is indeed relatively expensive:

    $ time dd if=/dev/zero of=/dev/null bs=1024 count=1M
    1073741824 bytes (1.1 GB) copied, 0.348118 s, 3.1 GB/s
    real    0m0.350s
    user    0m0.092s
    sys     0m0.260s

    $ time dd if=/dev/urandom of=/dev/null bs=1024 count=1M
    1073741824 bytes (1.1 GB) copied, 102.332 s, 10.5 MB/s
    real    1m42.334s
    user    0m0.152s
    sys     1m41.994s

(I7, 3.2.0-4-amd64)

IgorPartola · on March 7, 2015

You are comparing doing nearly nothing vs an PRNG. Presumably, you would spend your computation cycles doing something more useful than generating zeros, at which point I am not certain the difference would be quite so drastic.

cperciva · on March 7, 2015

Your /dev/urandom is astonishingly slow. Is this normal for Linux?

SamReidHughes · on March 8, 2015

Yes. For example, if you choose the encrypted hard drive option when installing Debian, it'll take forever, like, hours and hours, two digit numbers of hours, because it's reading from some /dev/urandom (I presume) to initialize the drive with random data. You can skip that step though. It's way better to do it beforehand using some script that uses something other than /dev/urandom. And maybe, don't put a 1 TB hard drive in a netbook or an older machine.

_hyn3 · on March 10, 2015

btw, your sly remark wasn't lost .. ;)

(fyi, Dr. Colin Percival is an eminent security developer for FreeBSD, author and CEO of the trusted Tarsnap, and developer of the AWS AMI's for FreeBSD; when he takes a whack at Linux's PRNG, he's got some standing to do it from :).)

But, yep, this is normal for Linux.. although, frankly, I'm just glad that we, like FreeBSD, are not taking rdrand at face value (thx Ted T'so!) Perhaps post-seed performance will improve down the road.

cperciva · on March 10, 2015

I wasn't talking about RDRAND. I can't remember what throughput it gives, and it's a pain to use "live" in that it's allowed to fail and return all zeroes.

The inner loop of a PRNG microbenchmark should be identical to the inner loop of an AES-CTR microbenchmark (or your favourite block cipher or MAC if you prefer a different one) and those are fast these days.

detrino · on March 6, 2015

> But these problems aren't solved by reservoir sampling. Reservoir sampling is still O(N), and it requires generating a random number for every element.

I had the same thought reading this. This solution is only really suitable if the number of elements to select is within some constant of the total number of elements.

> More generally, reservoir sampling is appropriate when you don't know the size of the set beforehand. But in the premise, we do know the size of the set. And if you know the size of the set (and have some way to address arbitrary elements), then randomly generating a set of X indices is going to be much simpler and faster. Yeah, you have to deal with duplicate indices, but that's an easy problem to solve.

Last time I had to solve this, I came up with the following solution[1] which doesn't generate any duplicates and runs in O(k) where k is the size of result set.

[1] http://ideone.com/CAwuxW

timr · on March 6, 2015

As I said on another comment in this thread: if you have the addresses (i.e. database ID values) for all of your records in memory, then sure, sampling directly from that set is faster than iterating over every record in a table.

There are lots of cases where this assumption doesn't hold: if you're sampling from a join, or from a table with non-contiguous ID values, for example.

azernik · on March 7, 2015

Or, even better, if you're sampling from a data set as it's being generated. For example, let's say you want to grab 1,000 random tweets out of the output of Twitter's firehose for the next 2 hours - when you decide whether to keep tweet #1,329, you have no idea how many tweets are going to come in after that one, because they haven't been written yet!

joe_the_user · on March 6, 2015

Another way to put it is that reservoir sampling only makes sense if you have to iterate over the entire table regardless. That means, no keys, row numbers or unique values that can chosen at random and found more quickly.

It seems less, not more, likely that this kind of situation would exist in a huge table - what use would such a table be if you could only process it sequentially?

defen · on March 6, 2015

Ignore the database part - that's just a single concrete example. More abstractly, say you're receiving a continuous stream of events, and you don't know when it will end, and you want to take n samples from that. For example, maybe you have a web service that is receiving click event logs as a stream, and you want to sample 1000 of those, randomly, without storing every single one as it comes in.

skybrian · on March 6, 2015

On the other hand, you're likely to be taking a sample of rows that match a query. It's unlikely that the database already keeps track of the count of how many rows will match an arbitrary query, unless you set it up in advance. So you'd need to load the row ids of the matching rows, then sample from them.

A sufficiently smart database might understand "order by rand() limit n" and do reservoir sampling on row id's behind the scenes. I wouldn't count on it, though.

If you set it up in advance, you could also use reservoir sampling to avoid even storing all the data in the first place, while still saving a random sample.

twic · on March 7, 2015

Once-through sequential processing is a pretty common paradigm for large datasets. That's basically what Hadoop and column stores are about.

In fact, i think this is the underlying meme of 'big data' - that it's now plausible and useful to do analysis over all of a large dataset, so you might as well build everything around iterating over the entire table regardless.

detrino · on March 6, 2015

You can still use this approach as long as you know the total size upfront, just sort the generated index array and use it to skip elements from your input stream, this will save you from having to generate random numbers for every element.

bosie · on March 6, 2015

Could you please expand on your linked code, not really a C++ programmer and line 22/25 confuse me a bit?

Unfortunately I don't understand how each element has the same probability of being chosen. I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?

sebleblanc · on March 7, 2015

Regarding the C++ code:

    /* This is standard modern C++ random number generation.
       It means, using the random generator "engine", return 
       a number according to distribution "dist". In this
       "dist" is a uniform distribution from "i" to "max-min".
       "auto" is the new way to type variables in C++: it
       infers the type of the variable from the type of the
       expression */
    22: auto index = dist(engine);

    /* The following lines are a bit more complex, as they
       contain a nested expression. Let us focus on the inner
       expression:
           (index < size ) ? array[index]
              : map.insert({index, min + index}).first->second

       This is in the form:
           test-expr ? when-true-expr
              : when-false-expr

       This expression is a ternary condition. What it means
       is that the expression before the question mark is
       computed (index < size), and if it evaluates to true,
       the overall expression will resolve to the result of
       the expression between the question mark and the colon,
       otherwise it resolves to what is after the colon.

       Here, it means that if "index" is smaller than "size",
       use "array[index]", otherwise insert a new entry in the
       map, with key "index" and value "min+index", then reuse
       the result of the operation by accessing member "second"
       of the member "first". I am not familiar with STL maps;
       here is an explanation of the map::insert method from
       cplusplus.com:

           The single element versions (1) return a pair, with
           its member pair::first set to an iterator pointing 
           to either the newly inserted element or to the 
           element with an equivalent key in the map. The 
           pair::second element in the pair is set to true if 
           a new element was inserted or false if an equivalent
           key already existed.
    */

    23: std::swap(array[i], index < size
    24:                         ? array[index]
    25:                         : map.insert({index, min + index}).first->second);

    /* Then, the std::swap operation simply swaps the two indicated
       values in the array. */

detrino · on March 7, 2015

> Could you please expand on your linked code, not really a C++ programmer and line 22/25 confuse me a bit?

I've annotated the program here: http://ideone.com/gmaBri

> Unfortunately I don't understand how each element has the same probability of being chosen. I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?

This is just a partial Fisher-Yates shuffle[1] using a sparse sequence instead of an array. Keep in mind that if an element is not chosen it is swapped with the chosen element and will be considered again. I think the best way to think about it is that you are choosing a random element from the set and then "removing" it from the set with the swap and then repeating.

[1] http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Th...

timr · on March 6, 2015

Author here.

Yes, reservoir sampling is O(N), but my point there was that it doesn't require instantiating every record at once. Also, the cost of the random number generation isn't the big deal in an ORDER BY RANDOM() implementation -- it's that even the smart implementations end up doing multiple table scans. So even at O(n) this probably beats most implementations of that approach, in practice.

Reservoir sampling is appropriate with more than just a set of unknown size -- you very frequently know the size of a set, but it's still too big to sample directly. But yes, if your sets are small, you have a lot of options.

Retric · on March 6, 2015

While good in theory, it's often really slow in practice as even small sample sets can trash your cache and it still needs a random number for every element.

You can make a better approach using a much longer list, but then the math is far less elegant. EX: Add the first X * N elements to the list. Then remove 1/X of them. Then 1/X of the times add an element to the list until it’s full. Cull 1/X of them and then add new elements (1/X)^2 of the time. Repeat until you run out of elements at which point you cull the list back to X elements.

Downside is your generating ~N log N random numbers so scanning the list and then sorting a random set may be faster.

If you have an approximate number elements AND want a small percentage of them then randomly generate X * N element list which is then culled to an N element list.

Finally, if N happens to be close to the full number of elements you’re better off creating an exclusion list.

PS: Reservoir Sampling is only good if your memory constrained, need a constant time algorithm, and generating random numbers is expensive.

lilyball · on March 6, 2015

If you know the size of the set, then you can generate a random set of (non-duplicate) indices, and use those to fetch data.

Offhand I'm not sure how to do that in SQL, since data isn't usually fetched by index (unless the row has an explicit index key). Worst-case you could iterate over the rows just as you do with reservoir sampling, and it would still be faster (since you aren't generating tons of random numbers and doing a lot of array manipulation). Or you could do one query fetching just the primary key, record the keys of the indices you're interested in, and then do a second query to fetch that set. Or maybe SQL does have some way to fetch by index after all.

dlubarov · on March 6, 2015

Yeah - unfortunately there isn't a great way to select n random rows in standard SQL.

If you have an integral primary key, you could select its max and generate random numbers up to that. Continuity isn't guaranteed so you would need to check for nonexistent primary keys in addition to checking for duplicates.

If there's no integral primary key, then there's no standard SQL solution faster than O(n).

lilyball · on March 6, 2015

If N is sufficiently larger than k, it might actually be faster to do something like

  let count = query("SELECT COUNT() FROM table")
  let indices = computeIndices(count, k)
  for index in indices {
    results.append(query("SELECT * FROM table LIMIT 1 OFFSET ?", index))
  }

Yeah, you're doing a bunch of queries, but you're removing all of copying all the data from all the rows.

dbenhur · on March 7, 2015

OFFSET j is O(j) in most RDMSes. You made an o(N/2 * k) algorithm given j is a random number in [0,N).

lilyball · on March 8, 2015

Depending on the size of k compared to N, and how much data is in each row, it's still probably better than iterating over the entire contents of the table.

Also, do you have any citation for OFFSET j being O(j)? I can believe that's the case, but it also seems fairly obvious to optimize OFFSET for the case of a non-filtered unordered SELECT (or one ordered in a way that matches an index).

detrino · on March 8, 2015

Assuming that the DB is backed by some kind of tree, it would need to maintain extra metadata in the branches in order to achieve even O(log n) offsetting. For this reason I could understand it being a linear operation.

seivan · on March 6, 2015

My first solution would have been getting a count, maybe a cached count and just doing random between 0 and that count, fetching by index? Or if you want more than one, just get a slice? Probably not the best, but at the top of my head.

colefichter · on March 6, 2015

I might be misunderstanding something, but this seems much more complex than systematic sampling, which is well understood even in the case where the number of records (read: population size) is unknown (this situation occurs in quality control on an assembly line).

Can anyone comment on why this reservoir method would be a better choice than the old fashioned systematic sample? (which, BTW, only requires the generation of one random number to determine the starting record).

dthal · on March 6, 2015

>>Reservoir sampling is appropriate with more than just a set of unknown size -- you very frequently know the size of a set, but it's still too big to sample directly.

I'm not sure that I buy this...Do you have an example? Still, it is a neat algorithm. And it is a true stream query algorithm, that is, if you have an open-ended stream, it maintains a random sample at all times.

on March 6, 2015

[deleted]

timr · on March 6, 2015

If you have N locations with known addresses and you can select a sample of k from those addresses randomly with uniform probability, then sure, do that.

But there are lots of situations where you don't have that: say, you're sampling from a database join, or you're just using a table that has non-contiguous id values (e.g. a table with deletions). Or, in the case of truly "Big Data", you have the data spread over multiple machines. In these cases, you need an approach suitable for when you don't know the size of the stream, you can't load it all into memory and/or you can't select randomly from an address pool.

detrino · on March 6, 2015

Sorry, I deleted this comment immediately after posting but you managed to respond to it too quickly, it originally said:

>Having a large input set of known size (and your result set is not large) is not a time when you would want to use reservoir sampling. Such a problem can be solved in O(k) where k is the size of the result set, while reservoir sampling is O(n) where n is the size of the input set.

and I have responded here: https://news.ycombinator.com/item?id=9158903

keithgabryelski · on March 6, 2015

doesn't your implementation require you to fetch all data into the client process for ruby to iterate over?

isn't the movement of such data -- when you have billions of rows simply impossible in reasonable time?

timr · on March 6, 2015

This implementation is useful at some scales, but not all: it uses find_each to iterate over the ActiveRecord scope, which is going to be slower than doing it all in the database (if that's possible), but still way better than instantiating every record in a large table.

(find_each instantiates a few records at a time, then throws them away)

keithgabryelski · on March 6, 2015

it will, in general, instantiate N-n/N records (even distribution, right - N=total records, n=sample size) that is still too much data to fetch to the client for any large data set.

timr · on March 6, 2015

You don't have to instantiate the records; that's just the way I did it here. For example, you could do two passes: one to randomly select IDs, and one to fetch a small number of records for the sample set. That's O(2N), but still better than what most databases will do for an ORDER BY RANDOM() query.

There's also another, less-known variant of this algorithm that I'll go into in a future post that alleviates the concern.

MrBuddyCasino · on March 6, 2015

Thanks for the clarification, thats exactly what I was wondering about. I have two questions:

- is it a common problem to have, not knowing the set size?

- is it O(N), or rather O(n)? My intuition tells me it depends on whether or not I have to retrieve every record in the set or not, because I/O should still be much more expensive than random number generation.

bradleyjg · on March 6, 2015

Imagine you are writing software to sit on a router to:

1) collect data for some length of time, and

2) thereafter have available for querying the (maybe approximate) 25th, 50th, 95th, and 99th percentile TCP connection bandwidth thresholds, and

3) signal when a connection grows beyond the 95th percentile.

You only have a fairly limited amount of memory.

x0x0 · on March 7, 2015

While it's great for medians, reservoir sampling for calculating high or low percentiles, particularly with high dynamic range data, will probably have very high error.

csl · on March 6, 2015

> is it a common problem to have, not knowing the set size?

Yes. It could be as simple as receiving values from a Python generator (or equivalent mechanism in your favourite language), where you don't know the number of items beforehand (and querying for the length is either impossible or expensive).

In such a cases, using reservoir sampling is an elegant solution, because you're probably in a situation where you'll be iterating through all values anyway.

bo1024 · on March 6, 2015

> - is it a common problem to have, not knowing the set size?

One motivating case is when the items arrive over time, like you have a website and you want to sample uniformly from your visitors over the course of the next hour. You don't need to know how many people will arrive.

eru · on March 7, 2015

> Reservoir sampling is still O(N), and it requires generating a random number for every element.

I agree with your comment in general. Only: you can do better than generate a random number for each element. (Generate a random number to tell you how many elements in the steam to skip.)

alok-g · on March 7, 2015

Can you show that this will still maintain a uniform probability distribution?

shoegumfoot · on March 8, 2015

> Can you show that [generating a number of records to skip instead of a yes/no for each record] will still maintain a uniform probability distribution?

That should work if you pull the skip count from the appropriate distribution (negative binomial, I think).

eru · on March 10, 2015

Yes, you have to work out the right distribution, and how to sample from it (at all, and perhaps how to sample from it efficiently).

eru · on March 8, 2015

edc117 · on March 6, 2015

Do you have to make sure to address the scenario where a generated index may possibly not exist in a table (deletion, whatever)? It says nothing about the index being contiguous. This can potentially end up with your algorithm returning less than #{X} if your generated indices don't exist.

Veedrac · on March 6, 2015

Just re-sample the ones that "miss". It won't affect the distribution.

gearoidoc · on March 6, 2015

You still have to loop over your returned records to verify uniqueness so wouldn't this automatically be solved?

gearoidoc · on March 6, 2015

Nicely put, sir/ma'am!

gearoidoc · on March 6, 2015

Downvoted for a compliment. Way to go HN, way to go.

kbenson · on March 6, 2015

You didn't add anything to the conversation. It's fine to compliment someone, but add to the conversation otherwise the comment is just noise. If all you have to say is that, just upvote.

gearoidoc · on March 6, 2015

Awesome.

Let's keep this going.

barking · on March 6, 2015

It's just the way that you're signalled not to make these kind of posts. It's not personal even if it feels like it.

cnlwsu · on March 6, 2015

https://news.ycombinator.com/newsguidelines.html

Resist commenting about being downvoted. It never does any good, and it makes boring reading.

Please don't bait other users by inviting them to downvote you.

thaumaturgy · on March 6, 2015

Wow, OmniRef is cool. StackOverflow-style Q & A embedded right into source code.

Reservoir sampling: I'm not sure that applying this algorithm to database sampling is the right thing to do. By its nature, the algorithm has to touch every single row in a database, and it does that because it's designed for data streams where you don't know in advance the size of the stream -- which isn't the case with database tables.

Assuming a uniform distribution in your database, you could instead do something like:

    SELECT *
    FROM (
        SELECT
            @row := @row +1 AS rownum, [column, column, columns...]
        FROM (
            SELECT @row :=0) r, [table name]
        ) ranked
    WHERE rownum % [n] = 1

to get every nth record of your table, calculating n ahead of time for n ~= (table size)/(sample size). That should be a little bit faster and in most cases still provide acceptable results.

screwedup · on March 6, 2015

Clever, but that gives you every nth row - you might want a truly uniform sample.

willvarfar · on March 6, 2015

I used to work on a hospital database, back before the internet.

It happened a few times that we were asked for a random sample of records. It was quite common to want to recalculate the rehabilitation rates and other stats.

This was a problem, as patient records filled several tapes and, being the youngster, I had to sit loading tapes for hours.

Because our data set was bigger than memory, we had trouble deciding what to sample. We'd read everything in, filtering the records, and recording just their offsets. Then we'd pick some number, then re-read all the tapes to fetch the actual records. Naive and tedious.

Luckily the IT manager knew all about reservoir sampling, and explained it to us.

We made a little Turbo Pascal program that ran over the data files each night before they were archived to tape and kept a large (but small enough to fit on one machine) reservoir!

Every time we were asked after that for a random sample after that we just handed out the reservoir.

The care managers, who thought they were still causing us to sit for hours shuffling tapes, were terribly grateful at the quick turnarounds. We never did let them in on our secret.

These days of course hospital records fit in RAM. Not the same kind of problems.

jsprogrammer · on March 6, 2015

So the hospital was expecting random samples each time, but kept getting the same sample (with some appended records)?

eru · on March 7, 2015

They were getting a new random sample each night. Same (or probably better!) distribution as before.

practicalpants · on March 6, 2015

I think in any real life scenario I would have an idea about the number of records in my table, or could figure that out very cost effectively. So to follow in OP's example with Rails, I would just do Table.where(id: [array_of_random_ids]), which is going to be extraordinarily more efficient than his batch processing Table.find_all.

Something should really only be filed under "Every Programmer Should Know" if it will be encountered in the real world more than once or twice. So in this case, the title feels like click-bait.

syntern · on March 6, 2015

Reservior sampling does have legitimate use in monitoring: to track real-time percentiles (e.g. over the last 15 minutes), without logging everything and re-calculating on each new data entry. I do think it is important and useful to know about it, but probably would not use it for the database sampling (as it was in the article).

jandrewrogers · on March 6, 2015

Just looking at the comments, I think their is a tacit assumption on the part of some commenters that a stream of unordered records from a database has a random distribution relative to the value you care about.

That is almost never true. In fact, I have seen clever performance optimizations in database applications that exploit I/O scheduler and layout induced bias in index-less record order. In practice, sampling 25% of very large database tables will often not be representative of sampling 100% of that database table even if the value is not explicitly indexed. Database engines automatically do a lot of low-level optimizations that introduce bias in a nominally random dump of records as a stream.

It is actually quite difficult to randomly sample a large database table.

radicality · on March 7, 2015

Does this matter for this example? It scans every single record, and adds it to the output with a certain probability. The ordering of the original stream doesn't matter I think.

dude_abides · on March 6, 2015

FWIW, the pseudo-code in the Wikipedia article is easier to grok: http://en.wikipedia.org/wiki/Reservoir_sampling

tedunangst · on March 7, 2015

I don't know... Can you really trust a wikipedia page that has needed attention from an expert for five years?

paragraft · on March 6, 2015

Had a great opportunity to use this in anger a few months ago. Working on a mapping application, where the map could only handle a few hundred items at a time, but the customer account had tens or hundreds of thousands. And the user could set a realtime filter which would select an unpredictable subset of the larger collection.

QA raised a ticket complaining the subset we were using (just taking the first n items from the collection that matched the filter, collection size was enough that we didn't want to double enumerate or store the whole thing in memory) was unbalanced, eg we were giving the impression that areas of the map that had plenty of items were actually empty.

Switched to reservoir sampling to make sure the subset was a fair representation of the collection, with only one pass and not blowing our memory budget. And everyone you explain it to gets to admire the algorithm.

acqq · on March 6, 2015

I find the paper mentioned much easier to follow:

http://www.cs.umd.edu/~samir/498/vitter.pdf

phunge · on March 6, 2015

Indeed I've always thought this was a very cool algorithm.

I once had to extend it for a distributed setting (think mapreduce) where you can cheaply iterate all records but not sequentially. Instead you have multiple reservoirs, and you can't resample those uniformly since each saw a different number of records.

The unbiased way to aggregate 2 reservoirs is to draw a random number from a hypergeometric distribution. You aggregate a larger number by combining 2 at a time.

Fun story, an interviewer at Google once asked me to derive reservoir sampling a long time ago, which I completely wiffed.

mark-r · on March 6, 2015

I was able to derive it on the fly fast enough to make a StackOverflow answer once. I had never heard of or used the technique before, and certainly didn't know what it was called.

I'd probably wiff it in an interview too. For some reason my brain doesn't work as well in an interview setting. Thankfully I haven't had to worry about that too many times.

javajosh · on March 6, 2015

I really like this algorithm, and I think it's instructive to think about a naive solution, and how it differs.

Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time, especially in the case where "putting the element in the output" is expensive (as is implied by the "kajillions of records" premise).

The problem with this solution is that you might pick the same random number twice or more in a row. This means you have to keep track of your random numbers if you absolutely can't tolerate a slightly smaller sample.

The reservoir avoids reusing random numbers in a clever way, by only applying the random number as the target for replacement, and iterating over every N. So the same elt in the sample might be replaced twice, but no elt in N will be added to the sample twice.

One interesting thing is that when N is slightly larger than n, those extra elements are very likely to replace something in the base sample. In fact, the probability is n/N.

danieldk · on March 6, 2015

Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time,

Reservoir sampling is normally used on large streams where you do not want to or cannot keep the data you are sampling from in memory.

javajosh · on March 6, 2015

Then I'm a little bit worried about this algorithm, because if the probability of picking is n/idx then the odds of picking a long tail item get asymptotically close to 0. I'm not sure if it really qualifies as a sample of the entire thing.

loqi · on March 6, 2015

Intuitively, the reason why it balances out is that while earlier items are more likely to be picked, they're also more likely to get kicked out after being picked.

c0ldfusion · on March 6, 2015

proof is in the article http://en.wikipedia.org/wiki/Reservoir_sampling

raydev · on March 6, 2015

I didn't want to use a public account (Omniref only accepts Github, Google, and FB) asking this question, so can someone here help me with the probability explanation?

Consider the simplest non-trivial example: a sample of 1 element, from a two-element reservoir. At iteration one, we accept the first element. At iteration two, if the sample is to be random, we have to decide to keep the new element with a probability of 1/2:

First element: (1/1) (1/2) = 1/2*

Why are we multiplying the acceptance of the first element with the probability of keeping it? And then...

But what if we're sampling from a pool of three elements instead? We know that we therefore have to keep the new element with a probability of 1/3…which means that we need to keep the element we already have with probability 2/3:

First element: (1/2) (2/3) = 1/3*

Where did the 1/2 come from? Why is it not 1/1 here?

defen · on March 6, 2015

The left side is the cumulative probability that it's been kept "so far". The right side is the probability it will be kept, rather than replaced by the new element (probability that the new element will not replace this one). Multiply them together and you get the overall probability that the element remains after the new element has been processed.

earino · on March 6, 2015

The timing of this post is pretty great! I work a lot with "largish files" in R, and while R has gotten much better at reading big files (thanks to data.table's fread), often times you really don't want to work with the whole file. I was surprised that there wasn't a tool to sample from a stream. I wrote a simple perl script https://github.com/earino/fast_sample to do just that. I was "getting around" the issue of needing reservoir sampling by allowing the user to pass in a "proportion of lines" (e.g. .01 for 1%).

I realized that I needed to add the ability to sample an exact number of records, so I went ahead and implemented reservoir sampling. It's a neat algorithm and the performance is pretty stellar. Now I just need to expand it to allow stratified sampling on a column! :)

pacaro · on March 6, 2015

IIRC this is how fortune[1] selects a message fairly given the crappy database format it uses (possible fortunes are separated by lines containing just '%'

[1] http://en.wikipedia.org/wiki/Fortune_(Unix)

dangayle · on March 7, 2015

This is awesome. I asked about this question on Stack Overflow back in September, but my question go shut down.

http://stackoverflow.com/questions/25942333/how-do-reservoir...

I use this algorithm in one of my tweet bots that tweets out a random tweet from a txt file containing 10k different quotes, works fantastically and has been for several months with no duplicates or malfunctions.

damian2000 · on March 7, 2015

This reminded me of the card shuffling algorithm from Sedgewick

http://algs4.cs.princeton.edu/lectures/21ElementarySorts.pdf

In that he assigns a random number to each element. Sorts the elements based on the random number. Then takes the first N elements from the beginning.

new299 · on March 7, 2015

This seems similar to selecting a single random line from a file in a single pass:

http://41j.com/blog/2015/02/select-random-line-file-single-p...

Which is a question I've had come up in interviews. It's a neat trick.

taltman1 · on March 7, 2015

Reservoir sampling in a few lines of awk:

https://github.com/taltman/scripts/blob/master/EDA/samp

It's shorter than this ruby implementation, and includes copious documentation, file handling, and corner-case support.

danieldk · on March 6, 2015

Java implementation:

https://github.com/weblicht/conll-utils/blob/master/src/main...

(Also returns items that were replaced.)

jkestelyn · on March 6, 2015

Also of interest:

"Algorithms Every Data Scientist Should Know: Reservoir Sampling" http://blog.cloudera.com/blog/2013/04/hadoop-stratified-rand...

throwaway6497 · on March 6, 2015

Imagine cracking this question in an a 45 min phone interview, if you have never heard about it.

scotty79 · on March 6, 2015

Interesting, although I'd need to do the math on this to believe it.

Another interesting problem is how to pick 3 million rows out of 4 million at random, without duplicates. With O(N) time and picks streaming from the algorithm.

everyone · on March 6, 2015

I dont need to know this. I've never used a database in my programming career ever (I'm a game programmer btw) I'm not saying its not interesting and I wouldnt want to familiarise myself with it for purposes of general interest, but not only do I dont need to know this but it would waste valuable headspace at the moment. Theres lots of kinds of programming. We dont all do webapp / internety stuff. Essentially the title of this article is not just hyperbolic, its wrong. There are so many article with '10 things programmer must know' (they are usually all related to internet) If I actually went and learned them all I wouldnt be able to remember how to write a shader or rotate stuff.

d23 · on March 6, 2015

This discussion should actually be about the site omniref, since that's why it was posted. Not to say that's a bad thing, since the site seems cool. Just wanted to point that out.

touristtam · on March 6, 2015

yeah I never came across that website before (probably because I don't touch ruby) and I was pretty impressed. Thanks for sharing. :)

namuol · on March 7, 2015

It would be really great to see some plain-english explanation of what the algorithm does, for those of us who aren't already knee-deep in ruby/ActiveRecord.

kghose · on March 7, 2015

Is there a pointer to the other algorithms?

puthre · on March 7, 2015

1. Count N

2. compute p = n/N

3. select * from table where random()<p*(some lambda >1) order by random() limit n

4. Repeat 3 until you got n rows

tiborsaas · on March 7, 2015

Does it work with a database of dogs? :)

fsk · on March 6, 2015

That isn't a uniform distribution. It's biased in favor of records with a bigger index. (I.e., rand() returns the sane value twice.)

optimiz3 · on March 6, 2015

Yes - the code is wrong. A correct implementation would adjust the probability of sampling based on how many items have been trialed.

EDIT: I misread it as rand(n), rand(idx) is correct.

danieldk · on March 6, 2015

But that's happening here as well, right? The random number is an indexed picked between 0 and the current index.

Though, I think there is an off-by one in this implementation (assuming that Ruby indexing starts at 0):

    j = rand(idx)
    out[j] = i if j < n

Say that the index is n, it will call rand(n), which gives a random number [0..n). However, the index should be picked from [0..n].

timr · on March 6, 2015

You may be right...I spent a lot of time thinking about that, and I convinced myself that this was correct, but I could be wrong. My thinking is that rand(N) in ruby returns a value from [0, N-1] inclusive, which is the complete index range of the sampled stream so far.

That sounded right to me. I could be convinced otherwise.

danieldk · on March 6, 2015

You are not taking the index of the element that you are currently sampling into consideration.

Suppose that the sample size is 1 and you are getting the second item (index 1). You will call rand(1), which has 0 as the only possible outcome. So, you will always replace the first item (index 0). Whereas if you would call rand(2) (possible outcomes: 0 and 1), you replace the item in the sample with probability .5 (assuming that the random number generator is uniform).

timr · on March 6, 2015

D'oh. Yes, I think you're right. That'll teach me to try to mentally debug code at 2AM!

I'll fix the code and publish a new gem a bit later today. Thanks!

dthal · on March 6, 2015

I'm not a ruby guy at all (never used it), but if I understand this correctly:

  j = rand(idx)
  out[j] = i if j < n

..it is keeping the sampling probability at n/idx where n is the sample/reservoir size and idx is the number of items seen so far. Then if this item is selected for sampling, each of the n items in the reservoir is equally likely to be replaced. I think the implementation is correct, assuming that idx is the number of items seen so far.

hackaflocka · on March 6, 2015

Probability is something else. I consider myself fairly good at understanding it. I like to think (to myself) I'm in the 95th percentile in the world in understanding probability. Yet these 4 concepts will regularly throw me off: http://tempr.org/54f9f429588f8.html