> "I have a database table with #{17.kajillion} records, and I want to take a random sample of #{X} of them. How can I do this efficently?"
And goes on to say that something like `YourBigTable.all.sample(X)` is bad in time and space. And the link that explains why doing `ORDER BY RANDOM()` is bad lists one reason being that generation of random numbers is relatively expensive.
But these problems aren't solved by reservoir sampling. Reservoir sampling is still O(N), and it requires generating a random number for every element. It still beats `ORDER BY RANDOM()` because it doesn't have to sort the numbers (or even keep them all in memory at the same time). But that doesn't make it a good choice.
More generally, reservoir sampling is appropriate when you don't know the size of the set beforehand. But in the premise, we do know the size of the set. And if you know the size of the set (and have some way to address arbitrary elements), then randomly generating a set of X indices is going to be much simpler and faster. Yeah, you have to deal with duplicate indices, but that's an easy problem to solve.
You are comparing doing nearly nothing vs an PRNG. Presumably, you would spend your computation cycles doing something more useful than generating zeros, at which point I am not certain the difference would be quite so drastic.
Yes. For example, if you choose the encrypted hard drive option when installing Debian, it'll take forever, like, hours and hours, two digit numbers of hours, because it's reading from some /dev/urandom (I presume) to initialize the drive with random data. You can skip that step though. It's way better to do it beforehand using some script that uses something other than /dev/urandom. And maybe, don't put a 1 TB hard drive in a netbook or an older machine.
(fyi, Dr. Colin Percival is an eminent security developer for FreeBSD, author and CEO of the trusted Tarsnap, and developer of the AWS AMI's for FreeBSD; when he takes a whack at Linux's PRNG, he's got some standing to do it from :).)
But, yep, this is normal for Linux.. although, frankly, I'm just glad that we, like FreeBSD, are not taking rdrand at face value (thx Ted T'so!) Perhaps post-seed performance will improve down the road.
I wasn't talking about RDRAND. I can't remember what throughput it gives, and it's a pain to use "live" in that it's allowed to fail and return all zeroes.
The inner loop of a PRNG microbenchmark should be identical to the inner loop of an AES-CTR microbenchmark (or your favourite block cipher or MAC if you prefer a different one) and those are fast these days.
> But these problems aren't solved by reservoir sampling. Reservoir sampling is still O(N), and it requires generating a random number for every element.
I had the same thought reading this. This solution is only really suitable if the number of elements to select is within some constant of the total number of elements.
> More generally, reservoir sampling is appropriate when you don't know the size of the set beforehand. But in the premise, we do know the size of the set. And if you know the size of the set (and have some way to address arbitrary elements), then randomly generating a set of X indices is going to be much simpler and faster. Yeah, you have to deal with duplicate indices, but that's an easy problem to solve.
Last time I had to solve this, I came up with the following solution[1] which doesn't generate any duplicates and runs in O(k) where k is the size of result set.
As I said on another comment in this thread: if you have the addresses (i.e. database ID values) for all of your records in memory, then sure, sampling directly from that set is faster than iterating over every record in a table.
There are lots of cases where this assumption doesn't hold: if you're sampling from a join, or from a table with non-contiguous ID values, for example.
Or, even better, if you're sampling from a data set as it's being generated. For example, let's say you want to grab 1,000 random tweets out of the output of Twitter's firehose for the next 2 hours - when you decide whether to keep tweet #1,329, you have no idea how many tweets are going to come in after that one, because they haven't been written yet!
Another way to put it is that reservoir sampling only makes sense if you have to iterate over the entire table regardless. That means, no keys, row numbers or unique values that can chosen at random and found more quickly.
It seems less, not more, likely that this kind of situation would exist in a huge table - what use would such a table be if you could only process it sequentially?
Ignore the database part - that's just a single concrete example. More abstractly, say you're receiving a continuous stream of events, and you don't know when it will end, and you want to take n samples from that. For example, maybe you have a web service that is receiving click event logs as a stream, and you want to sample 1000 of those, randomly, without storing every single one as it comes in.
On the other hand, you're likely to be taking a sample of rows that match a query. It's unlikely that the database already keeps track of the count of how many rows will match an arbitrary query, unless you set it up in advance. So you'd need to load the row ids of the matching rows, then sample from them.
A sufficiently smart database might understand "order by rand() limit n" and do reservoir sampling on row id's behind the scenes. I wouldn't count on it, though.
If you set it up in advance, you could also use reservoir sampling to avoid even storing all the data in the first place, while still saving a random sample.
Once-through sequential processing is a pretty common paradigm for large datasets. That's basically what Hadoop and column stores are about.
In fact, i think this is the underlying meme of 'big data' - that it's now plausible and useful to do analysis over all of a large dataset, so you might as well build everything around iterating over the entire table regardless.
You can still use this approach as long as you know the total size upfront, just sort the generated index array and use it to skip elements from your input stream, this will save you from having to generate random numbers for every element.
Could you please expand on your linked code, not really a C++ programmer and line 22/25 confuse me a bit?
Unfortunately I don't understand how each element has the same probability of being chosen.
I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?
/* This is standard modern C++ random number generation.
It means, using the random generator "engine", return
a number according to distribution "dist". In this
"dist" is a uniform distribution from "i" to "max-min".
"auto" is the new way to type variables in C++: it
infers the type of the variable from the type of the
expression */
22: auto index = dist(engine);
/* The following lines are a bit more complex, as they
contain a nested expression. Let us focus on the inner
expression:
(index < size ) ? array[index]
: map.insert({index, min + index}).first->second
This is in the form:
test-expr ? when-true-expr
: when-false-expr
This expression is a ternary condition. What it means
is that the expression before the question mark is
computed (index < size), and if it evaluates to true,
the overall expression will resolve to the result of
the expression between the question mark and the colon,
otherwise it resolves to what is after the colon.
Here, it means that if "index" is smaller than "size",
use "array[index]", otherwise insert a new entry in the
map, with key "index" and value "min+index", then reuse
the result of the operation by accessing member "second"
of the member "first". I am not familiar with STL maps;
here is an explanation of the map::insert method from
cplusplus.com:
The single element versions (1) return a pair, with
its member pair::first set to an iterator pointing
to either the newly inserted element or to the
element with an equivalent key in the map. The
pair::second element in the pair is set to true if
a new element was inserted or false if an equivalent
key already existed.
*/
23: std::swap(array[i], index < size
24: ? array[index]
25: : map.insert({index, min + index}).first->second);
/* Then, the std::swap operation simply swaps the two indicated
values in the array. */
> Unfortunately I don't understand how each element has the same probability of being chosen. I.e. the first element seems to have a probability of 1/n, while the second has 1/n + 1/(n-1) etc. Am i wrong about this?
This is just a partial Fisher-Yates shuffle[1] using a sparse sequence instead of an array. Keep in mind that if an element is not chosen it is swapped with the chosen element and will be considered again. I think the best way to think about it is that you are choosing a random element from the set and then "removing" it from the set with the swap and then repeating.
Yes, reservoir sampling is O(N), but my point there was that it doesn't require instantiating every record at once. Also, the cost of the random number generation isn't the big deal in an ORDER BY RANDOM() implementation -- it's that even the smart implementations end up doing multiple table scans. So even at O(n) this probably beats most implementations of that approach, in practice.
Reservoir sampling is appropriate with more than just a set of unknown size -- you very frequently know the size of a set, but it's still too big to sample directly. But yes, if your sets are small, you have a lot of options.
While good in theory, it's often really slow in practice as even small sample sets can trash your cache and it still needs a random number for every element.
You can make a better approach using a much longer list, but then the math is far less elegant.
EX: Add the first X * N elements to the list. Then remove 1/X of them. Then 1/X of the times add an element to the list until it’s full. Cull 1/X of them and then add new elements (1/X)^2 of the time. Repeat until you run out of elements at which point you cull the list back to X elements.
Downside is your generating ~N log N random numbers so scanning the list and then sorting a random set may be faster.
If you have an approximate number elements AND want a small percentage of them then randomly generate X * N element list which is then culled to an N element list.
Finally, if N happens to be close to the full number of elements you’re better off creating an exclusion list.
PS: Reservoir Sampling is only good if your memory constrained, need a constant time algorithm, and generating random numbers is expensive.
If you know the size of the set, then you can generate a random set of (non-duplicate) indices, and use those to fetch data.
Offhand I'm not sure how to do that in SQL, since data isn't usually fetched by index (unless the row has an explicit index key). Worst-case you could iterate over the rows just as you do with reservoir sampling, and it would still be faster (since you aren't generating tons of random numbers and doing a lot of array manipulation). Or you could do one query fetching just the primary key, record the keys of the indices you're interested in, and then do a second query to fetch that set. Or maybe SQL does have some way to fetch by index after all.
Yeah - unfortunately there isn't a great way to select n random rows in standard SQL.
If you have an integral primary key, you could select its max and generate random numbers up to that. Continuity isn't guaranteed so you would need to check for nonexistent primary keys in addition to checking for duplicates.
If there's no integral primary key, then there's no standard SQL solution faster than O(n).
If N is sufficiently larger than k, it might actually be faster to do something like
let count = query("SELECT COUNT() FROM table")
let indices = computeIndices(count, k)
for index in indices {
results.append(query("SELECT * FROM table LIMIT 1 OFFSET ?", index))
}
Yeah, you're doing a bunch of queries, but you're removing all of copying all the data from all the rows.
Depending on the size of k compared to N, and how much data is in each row, it's still probably better than iterating over the entire contents of the table.
Also, do you have any citation for OFFSET j being O(j)? I can believe that's the case, but it also seems fairly obvious to optimize OFFSET for the case of a non-filtered unordered SELECT (or one ordered in a way that matches an index).
Assuming that the DB is backed by some kind of tree, it would need to maintain extra metadata in the branches in order to achieve even O(log n) offsetting. For this reason I could understand it being a linear operation.
My first solution would have been getting a count, maybe a cached count and just doing random between 0 and that count, fetching by index?
Or if you want more than one, just get a slice? Probably not the best, but at the top of my head.
I might be misunderstanding something, but this seems much more complex than systematic sampling, which is well understood even in the case where the number of records (read: population size) is unknown (this situation occurs in quality control on an assembly line).
Can anyone comment on why this reservoir method would be a better choice than the old fashioned systematic sample? (which, BTW, only requires the generation of one random number to determine the starting record).
>>Reservoir sampling is appropriate with more than just a set of unknown size -- you very frequently know the size of a set, but it's still too big to sample directly.
I'm not sure that I buy this...Do you have an example? Still, it is a neat algorithm. And it is a true stream query algorithm, that is, if you have an open-ended stream, it maintains a random sample at all times.
If you have N locations with known addresses and you can select a sample of k from those addresses randomly with uniform probability, then sure, do that.
But there are lots of situations where you don't have that: say, you're sampling from a database join, or you're just using a table that has non-contiguous id values (e.g. a table with deletions). Or, in the case of truly "Big Data", you have the data spread over multiple machines. In these cases, you need an approach suitable for when you don't know the size of the stream, you can't load it all into memory and/or you can't select randomly from an address pool.
Sorry, I deleted this comment immediately after posting but you managed to respond to it too quickly, it originally said:
>Having a large input set of known size (and your result set is not large) is not a time when you would want to use reservoir sampling. Such a problem can be solved in O(k) where k is the size of the result set, while reservoir sampling is O(n) where n is the size of the input set.
This implementation is useful at some scales, but not all: it uses find_each to iterate over the ActiveRecord scope, which is going to be slower than doing it all in the database (if that's possible), but still way better than instantiating every record in a large table.
(find_each instantiates a few records at a time, then throws them away)
it will, in general, instantiate N-n/N records (even distribution, right - N=total records, n=sample size)
that is still too much data to fetch to the client for any large data set.
You don't have to instantiate the records; that's just the way I did it here. For example, you could do two passes: one to randomly select IDs, and one to fetch a small number of records for the sample set. That's O(2N), but still better than what most databases will do for an ORDER BY RANDOM() query.
There's also another, less-known variant of this algorithm that I'll go into in a future post that alleviates the concern.
Thanks for the clarification, thats exactly what I was wondering about. I have two questions:
- is it a common problem to have, not knowing the set size?
- is it O(N), or rather O(n)? My intuition tells me it depends on whether or not I have to retrieve every record in the set or not, because I/O should still be much more expensive than random number generation.
While it's great for medians, reservoir sampling for calculating high or low percentiles, particularly with high dynamic range data, will probably have very high error.
> is it a common problem to have, not knowing the set size?
Yes. It could be as simple as receiving values from a Python generator (or equivalent mechanism in your favourite language), where you don't know the number of items beforehand (and querying for the length is either impossible or expensive).
In such a cases, using reservoir sampling is an elegant solution, because you're probably in a situation where you'll be iterating through all values anyway.
> - is it a common problem to have, not knowing the set size?
One motivating case is when the items arrive over time, like you have a website and you want to sample uniformly from your visitors over the course of the next hour. You don't need to know how many people will arrive.
> Reservoir sampling is still O(N), and it requires generating a random number for every element.
I agree with your comment in general. Only: you can do better than generate a random number for each element. (Generate a random number to tell you how many elements in the steam to skip.)
> Can you show that [generating a number of records to skip instead of a yes/no for each record] will still maintain a uniform probability distribution?
That should work if you pull the skip count from the appropriate distribution (negative binomial, I think).
Do you have to make sure to address the scenario where a generated index may possibly not exist in a table (deletion, whatever)? It says nothing about the index being contiguous. This can potentially end up with your algorithm returning less than #{X} if your generated indices don't exist.
You didn't add anything to the conversation. It's fine to compliment someone, but add to the conversation otherwise the comment is just noise. If all you have to say is that, just upvote.
Wow, OmniRef is cool. StackOverflow-style Q & A embedded right into source code.
Reservoir sampling: I'm not sure that applying this algorithm to database sampling is the right thing to do. By its nature, the algorithm has to touch every single row in a database, and it does that because it's designed for data streams where you don't know in advance the size of the stream -- which isn't the case with database tables.
Assuming a uniform distribution in your database, you could instead do something like:
SELECT *
FROM (
SELECT
@row := @row +1 AS rownum, [column, column, columns...]
FROM (
SELECT @row :=0) r, [table name]
) ranked
WHERE rownum % [n] = 1
to get every nth record of your table, calculating n ahead of time for n ~= (table size)/(sample size). That should be a little bit faster and in most cases still provide acceptable results.
I used to work on a hospital database, back before the internet.
It happened a few times that we were asked for a random sample of records. It was quite common to want to recalculate the rehabilitation rates and other stats.
This was a problem, as patient records filled several tapes and, being the youngster, I had to sit loading tapes for hours.
Because our data set was bigger than memory, we had trouble deciding what to sample. We'd read everything in, filtering the records, and recording just their offsets. Then we'd pick some number, then re-read all the tapes to fetch the actual records. Naive and tedious.
Luckily the IT manager knew all about reservoir sampling, and explained it to us.
We made a little Turbo Pascal program that ran over the data files each night before they were archived to tape and kept a large (but small enough to fit on one machine) reservoir!
Every time we were asked after that for a random sample after that we just handed out the reservoir.
The care managers, who thought they were still causing us to sit for hours shuffling tapes, were terribly grateful at the quick turnarounds. We never did let them in on our secret.
These days of course hospital records fit in RAM. Not the same kind of problems.
I think in any real life scenario I would have an idea about the number of records in my table, or could figure that out very cost effectively. So to follow in OP's example with Rails, I would just do Table.where(id: [array_of_random_ids]), which is going to be extraordinarily more efficient than his batch processing Table.find_all.
Something should really only be filed under "Every Programmer Should Know" if it will be encountered in the real world more than once or twice. So in this case, the title feels like click-bait.
Reservior sampling does have legitimate use in monitoring: to track real-time percentiles (e.g. over the last 15 minutes), without logging everything and re-calculating on each new data entry. I do think it is important and useful to know about it, but probably would not use it for the database sampling (as it was in the article).
Just looking at the comments, I think their is a tacit assumption on the part of some commenters that a stream of unordered records from a database has a random distribution relative to the value you care about.
That is almost never true. In fact, I have seen clever performance optimizations in database applications that exploit I/O scheduler and layout induced bias in index-less record order. In practice, sampling 25% of very large database tables will often not be representative of sampling 100% of that database table even if the value is not explicitly indexed. Database engines automatically do a lot of low-level optimizations that introduce bias in a nominally random dump of records as a stream.
It is actually quite difficult to randomly sample a large database table.
Does this matter for this example? It scans every single record, and adds it to the output with a certain probability. The ordering of the original stream doesn't matter I think.
Had a great opportunity to use this in anger a few months ago. Working on a mapping application, where the map could only handle a few hundred items at a time, but the customer account had tens or hundreds of thousands. And the user could set a realtime filter which would select an unpredictable subset of the larger collection.
QA raised a ticket complaining the subset we were using (just taking the first n items from the collection that matched the filter, collection size was enough that we didn't want to double enumerate or store the whole thing in memory) was unbalanced, eg we were giving the impression that areas of the map that had plenty of items were actually empty.
Switched to reservoir sampling to make sure the subset was a fair representation of the collection, with only one pass and not blowing our memory budget. And everyone you explain it to gets to admire the algorithm.
Indeed I've always thought this was a very cool algorithm.
I once had to extend it for a distributed setting (think mapreduce) where you can cheaply iterate all records but not sequentially. Instead you have multiple reservoirs, and you can't resample those uniformly since each saw a different number of records.
The unbiased way to aggregate 2 reservoirs is to draw a random number from a hypergeometric distribution. You aggregate a larger number by combining 2 at a time.
Fun story, an interviewer at Google once asked me to derive reservoir sampling a long time ago, which I completely wiffed.
I was able to derive it on the fly fast enough to make a StackOverflow answer once. I had never heard of or used the technique before, and certainly didn't know what it was called.
I'd probably wiff it in an interview too. For some reason my brain doesn't work as well in an interview setting. Thankfully I haven't had to worry about that too many times.
I really like this algorithm, and I think it's instructive to think about a naive solution, and how it differs.
Looping n times, calling rand(n) and putting the random element in the output. This mostly works, and is actually far better, in space and time, especially in the case where "putting the element in the output" is expensive (as is implied by the "kajillions of records" premise).
The problem with this solution is that you might pick the same random number twice or more in a row. This means you have to keep track of your random numbers if you absolutely can't tolerate a slightly smaller sample.
The reservoir avoids reusing random numbers in a clever way, by only applying the random number as the target for replacement, and iterating over every N. So the same elt in the sample might be replaced twice, but no elt in N will be added to the sample twice.
One interesting thing is that when N is slightly larger than n, those extra elements are very likely to replace something in the base sample. In fact, the probability is n/N.
Then I'm a little bit worried about this algorithm, because if the probability of picking is n/idx then the odds of picking a long tail item get asymptotically close to 0. I'm not sure if it really qualifies as a sample of the entire thing.
Intuitively, the reason why it balances out is that while earlier items are more likely to be picked, they're also more likely to get kicked out after being picked.
I didn't want to use a public account (Omniref only accepts Github, Google, and FB) asking this question, so can someone here help me with the probability explanation?
Consider the simplest non-trivial example: a sample of 1 element, from a two-element reservoir. At iteration one, we accept the first element. At iteration two, if the sample is to be random, we have to decide to keep the new element with a probability of 1/2:
First element: (1/1) (1/2) = 1/2*
Why are we multiplying the acceptance of the first element with the probability of keeping it? And then...
But what if we're sampling from a pool of three elements instead? We know that we therefore have to keep the new element with a probability of 1/3…which means that we need to keep the element we already have with probability 2/3:
First element: (1/2) (2/3) = 1/3*
Where did the 1/2 come from? Why is it not 1/1 here?
The left side is the cumulative probability that it's been kept "so far". The right side is the probability it will be kept, rather than replaced by the new element (probability that the new element will not replace this one). Multiply them together and you get the overall probability that the element remains after the new element has been processed.
The timing of this post is pretty great! I work a lot with "largish files" in R, and while R has gotten much better at reading big files (thanks to data.table's fread), often times you really don't want to work with the whole file. I was surprised that there wasn't a tool to sample from a stream. I wrote a simple perl script https://github.com/earino/fast_sample to do just that. I was "getting around" the issue of needing reservoir sampling by allowing the user to pass in a "proportion of lines" (e.g. .01 for 1%).
I realized that I needed to add the ability to sample an exact number of records, so I went ahead and implemented reservoir sampling. It's a neat algorithm and the performance is pretty stellar. Now I just need to expand it to allow stratified sampling on a column! :)
IIRC this is how fortune[1] selects a message fairly given the crappy database format it uses (possible fortunes are separated by lines containing just '%'
I use this algorithm in one of my tweet bots that tweets out a random tweet from a txt file containing 10k different quotes, works fantastically and has been for several months with no duplicates or malfunctions.
Interesting, although I'd need to do the math on this to believe it.
Another interesting problem is how to pick 3 million rows out of 4 million at random, without duplicates. With O(N) time and picks streaming from the algorithm.
I dont need to know this. I've never used a database in my programming career ever (I'm a game programmer btw)
I'm not saying its not interesting and I wouldnt want to familiarise myself with it for purposes of general interest, but not only do I dont need to know this but it would waste valuable headspace at the moment.
Theres lots of kinds of programming. We dont all do webapp / internety stuff.
Essentially the title of this article is not just hyperbolic, its wrong.
There are so many article with '10 things programmer must know' (they are usually all related to internet) If I actually went and learned them all I wouldnt be able to remember how to write a shader or rotate stuff.
This discussion should actually be about the site omniref, since that's why it was posted. Not to say that's a bad thing, since the site seems cool. Just wanted to point that out.
It would be really great to see some plain-english explanation of what the algorithm does, for those of us who aren't already knee-deep in ruby/ActiveRecord.
You may be right...I spent a lot of time thinking about that, and I convinced myself that this was correct, but I could be wrong. My thinking is that rand(N) in ruby returns a value from [0, N-1] inclusive, which is the complete index range of the sampled stream so far.
That sounded right to me. I could be convinced otherwise.
You are not taking the index of the element that you are currently sampling into consideration.
Suppose that the sample size is 1 and you are getting the second item (index 1). You will call rand(1), which has 0 as the only possible outcome. So, you will always replace the first item (index 0). Whereas if you would call rand(2) (possible outcomes: 0 and 1), you replace the item in the sample with probability .5 (assuming that the random number generator is uniform).
I'm not a ruby guy at all (never used it), but if I understand this correctly:
j = rand(idx)
out[j] = i if j < n
..it is keeping the sampling probability at n/idx where n is the sample/reservoir size and idx is the number of items seen so far. Then if this item is selected for sampling, each of the n items in the reservoir is equally likely to be replaced. I think the implementation is correct, assuming that idx is the number of items seen so far.
Probability is something else. I consider myself fairly good at understanding it. I like to think (to myself) I'm in the 95th percentile in the world in understanding probability. Yet these 4 concepts will regularly throw me off:
http://tempr.org/54f9f429588f8.html
> "I have a database table with #{17.kajillion} records, and I want to take a random sample of #{X} of them. How can I do this efficently?"
And goes on to say that something like `YourBigTable.all.sample(X)` is bad in time and space. And the link that explains why doing `ORDER BY RANDOM()` is bad lists one reason being that generation of random numbers is relatively expensive.
But these problems aren't solved by reservoir sampling. Reservoir sampling is still O(N), and it requires generating a random number for every element. It still beats `ORDER BY RANDOM()` because it doesn't have to sort the numbers (or even keep them all in memory at the same time). But that doesn't make it a good choice.
More generally, reservoir sampling is appropriate when you don't know the size of the set beforehand. But in the premise, we do know the size of the set. And if you know the size of the set (and have some way to address arbitrary elements), then randomly generating a set of X indices is going to be much simpler and faster. Yeah, you have to deal with duplicate indices, but that's an easy problem to solve.