Hacker News new | past | comments | ask | show | jobs | submit login
Vitter's reservoir sampling algorithm D: randomly selecting unique items (getkerf.wordpress.com)
117 points by colinprince on April 4, 2016 | hide | past | favorite | 74 comments



> The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).

I don't think this is true!

First note that if you want to pick 98 values from 100, it is much simpler to instead decide which 2 to skip. Similarly when picking 75 out of 100 - better to just cross out 25. The hardest case is picking half: N/2 out of N, without replacement.

This is the Coupon Collector's problem, except we're stopping at half the coupons. We can use the analysis given in the Wikipedia page [1], except we only take the second half of the series.

This gives an expected time of

    N * (H(N) - H(N/2))
where H(X) is the Xth Harmonic number.

In the asymptote this yields:

    N * (γ + ln(N) - γ - ln(N/2))
    = N * (ln 2)
so the expected time of the "retry on duplicates" approach is linear in N. (Of course, the worst case is infinite, if you keep redrawing the same card.)

edit: To complete the analysis, the requirement is that the answer be linear in k, the number of items selected. But we have k < N, so if we are O(N) we are also O(k) or better. So I think the argument works.

[1] https://en.wikipedia.org/wiki/Coupon_collector%27s_problem


I think it is true. We head off discussion here by referring to the naive approach, but what you're doing is past naive coupon collector's and past the technical scope of the context.

There are several steps where your method could (will) fail. Since you don't introduce any machinery for the hand, you imply an O(k^2) checking algorithm. You also imply an unsorted hand. If you generate the hand in random order, then you fall to an O(k*lg(k)) sort. "Sorted" is an implicit requirement that we initially omit, but revisit later, because introducing it early makes the discussion harder to follow.

If you try to introduce machinery like a hashtable, the next thing you might suggest, you violate an implicit O(1) additional space requirement, which I've grudgingly reintroduced to the more-formal description ("why is this here?"). This conversation keeps going, but other failure paths we leave unexamined.


In addition to not being sorted it's also not "online." You won't get the first value out of it in constant time if k is too big, and you need to access elements in a random order.


This is not correct as written: "But we have k < N, so if we are O(N) we are also O(k) or better."


That's right. We can fix it up though. The E(t_i) we care about are bounded above by a constant (2), so the sum of k such expectations is linear in k.

For the cases where k > n/2, the "full table scan" portion is of course also linear in k.


Nice save! FWIW, here's the fix I wrote up before I saw yours. Yours is much better ;)

We only need to consider values in the range 0 < k <= N/2. If k > N/2, we "pick the losers instead of winners" and that reduces k to that range.

The expected time E to pick k out of N, is N times the last k terms of H(N):

    E(k) = N * (H(N) - H(N-k))
Applying the asymptote:

   E(k) ~ N * (γ + ln(N) - γ - ln(N-k))
        = N * (ln(N) - ln(N-k))
We wish to determine its behavior as a function of k. Taking the derivative:

   dE/dk ~ N/(N - k)
which is positive and bounded above by 2 for 0 < k <= N/2:

   dE/dk <= 2
Using the fact that E(1) = 1, we have:

   E(k) <= 2k
So E cannot be worse than linear in k.


This crossing out is, of course, not possible if you have a stream coming in, which is the case for which these sampling algorithms are used the most.


A much cleaner, GPL-ed implementation is provided here: https://github.com/WebDrake/GrSL/blob/master/sampling/vitter...

(I personally find Kevin's code very hard to follow)


IMO the Wikipedia article on this is better than this article: https://en.wikipedia.org/wiki/Reservoir_sampling


I only see an O(n) algorithm there, there is no O(k). Am I missing something?


Wiki shows Algorithm R, this article talks about Algorithm D, both by the same author.

The former is well known, while the latter is more obscure. This is probably because the extra complexity of Algorithm D is rarely necessary. If k and n are on the same order of magnitude, there is not much difference, and if k << n then simple random sampling would work nearly as well.


Here is an alternative:

Use a multiplicative linear congruential generator with a period 2^n , where n is ceil(log2(m)), where m is the size of the list.

Seed the generator, start generating values and drop all values that are larger than m. In the worst case you will have to drop half of the values, in the best case you will drop none. This depends on how close to a power of 2 m is.

The generator is very short, one line of code, and very fast: one multiplication, one addition, and one bitshift. Space complexity is O(1).

If your tweets are generated out from an integer value, then you can use this value to generate them. The tweets will be unique (assuming the tweet generator will generate unique tweets from unique values), and you will have m of them.


This is very practical, but the result is not uniform; in fact it only returns a tiny fraction of the possible combinations.

There are "N choose k" combinations, which in the worst case (k = N/2) grows as ~ 2^N. The number of combinations quickly outstrips the number of possible states of the LCG, and most combinations will never be found. For example, we have a list of length 16 and wish to choose 8. We have a choice of 16 distinct seeds, so there are only 16 different combinations (out of 12870 possible) that we can get for any given LCG.


By randomly adding skips, can definitely get to all possible combinations even with only 16 seeds (since skips can be seeded from a much larger space), although there could be a speed implication.


Good point. You would get different sequences by choosing different constants for the generator though, but you couldn't cover all of them by a longshot.



This works, and is so much simpler.

It's not quite the same problem though; the algorithm described returns samples in sequential order. They mention performance reasons for this (C-f "tape sampling"). It also has the limitation that you need to know the sample size in advance.


MLCG is sequential. No value will be repeated within the period.


"sequential" in the sense that if returned values are i1, i2, ..., in, then 0 <= i1 < i2 < ... < in <= N.

A MLCG will return numbers all over the place; sorting them will take an additional O(n log n) time.


I see.


He means: outputs are sorted.


The generator is also highly predictable and suffers from defects that may affect the overall analysis.


This only matters in calculations where randomness is important, i.e. Monte Carlo method.

Nevertheless, if you don't use the least significant bits, and if the constants are carefully chosen, MLCG passes most of the hardest statistical tests. For example it passes all DIEHARD tests, and most of TESTU01.


In addition, the described algorithm "is on-line - that is, it requires no preprocessing and can generate each element of the sample in constant expected time." (Quoting from the original paper.) It needs only an upper bound on the number items, while your suggestion requires knowing the number beforehand.

You also need an MLCG over the range 0..2^64-1 to match the given algorithm. I believe that requires some extra code to handle potential overflow in a MLCG, so adds a few more lines to your estimate.


Vitter's algorithm is for when you need to generate a random list of unique items with sequential reads. If you drop that requirement, you can just use a substitution-permutation network: https://en.wikipedia.org/wiki/Substitution-permutation_netwo...

A SPN will generate a unique random permutation, from which you can draw your random list of unique items. With a little of post-ptocessing, you can generate random permutations of arbitrary size.

I have implemented it in matlab and python:

http://nl.mathworks.com/matlabcentral/fileexchange/36626-alg...

https://gist.github.com/jdfr/46633c630471494d67ae


I can't believe this article doesn't explain how it works. Especially after stating "after reaching Vitter’s papers it again takes a concentrated effort to figure out what you need".


Yeah, the article was pretty dense so I just slogged through it hoping I would get to the explanation part.

Nope, it just gave me ugly c code at the end.


I read very often (by people who don't even use it) that Perl is a write-only language. Then i see code like this roll along that demonstrates perfectly how the readability of any given piece of code is 99% on the coder, not the language.


Why cannot you just create random numbers and discard duplicates?

For practical purposes when list.length << total.lenght shouldn't that be good enough?


I think the other replies to you are mistaken. This algorithm meets the requirements of running in linear time, once you apply the inverting trick when the sample size exceeds half of the pool size. See my top-level comment.


In certain circumstances, like when you're drawing from a limited pool, this becomes N^2 and unbounded - Imagine you're trying to shuffle a deck of cards. Select a card number, 1-52. The first draw, you're guaranteed to get a unique card, the second, you've got a 1/52 chance of having to generate a second number, all the way up to the final card you have a 1/52 chance of generating the final number you need, and you'll spin for quite a bit waiting for that to come up, with a possibility (if your RNG is bad) that you'll never get there. This gets amplified if you've got a larger set - If you draw from six decks, you'll go through an average of 156 'draws' before you get your final card... and the card before it almost as many!

The nicer way to do this is to have two lists. Put all of your cards into the first list. Choose a number between 1 and the length of list 1, then remove that element from the list and add it to the end of list 2. Repeat until list 1 is empty. It's linear-time (with a linked list), and there's no chance of duplicates.

But this isn't the scenario presented in the article - It's an algorithm for a very specialized version of a shuffle, where you need a random sample from an unbounded sequential (meaning, you don't know it's length ahead of time) list. The algorithm presented is for getting a good random sample from a dataset well beyond the size of memory, not for doing a perfect shuffle on a small list.


"The nicer way to do this is to have two lists. Put all of your cards into the first list. Choose a number between 1 and the length of list 1, then remove that element from the list and add it to the end of list 2. Repeat until list 1 is empty. It's linear-time (with a linked list), and there's no chance of duplicates."

I have to add my $0.02. This is correct, your algorithm will give correct result with no chance of duplicates. However in some context this absolutely must not be done - I've worked in a company which made online gambling games and the auditing requirement was 1-to-1 correspondence of number drawn from RNG and the card player sees. For example Ace of spades must always be 49, whether it was drawn as first or last card from the deck.


I had the same thought. However thinking about it more, the requirement is that this algorithm is O(k). As k gets closer and closer to the size of the list, you will spend more and more time looking for unique items. You also need to pay the costs of checking for uniqueness which also grows with k.


Well yeah, that is exactly why I said for the situation where the lenght of the list is much smaller than the total number of tweets. Which should cover most practical usages.


You store the selected items in a hashtable or similar data structure with O(1) behavior of "x in y".


You can, but that becomes inefficient for very large numbers


It's not O(k) then. As he mentioned, he's writing this as a language feature, so he can't make simplifying assumptions about k or n. It must handle large n, and can't assume k << n.


I see, as a language feature it makes sense to overoptimize for any situation.


   >>> random.sample(xrange(0, 1000000000), 5)
   [5258132, 23096612, 43529214, 91062733, 4912658]
This ran in a few milliseconds on my old macbook in python. It is an iterator over 1 billion element list. Looking at the source, they seem to track previous selections for large populations in a set, for which lookup of "x in y" is avg O(1).

https://svn.python.org/projects/python/tags/r32/Lib/random.p...


Actually, due to seeding, no matter how many times you run this you will never generate all combinations of tweet orders. Of course that doesn't really matter since for large values of N there isn't enough time in the universe to do so.


Could someone explain how this works? I feel like unique shuffling given a random seed is very useful algorithm, but I can't follow this code because of the variable naming and stuff.


The general idea is to draw n uniform samples from [1..N] and return them in sorted order. You can generate these incrementally, in order (no sort needed). The mean difference between successive samples is ~N/n. If the samples are uniformly distributed, their differences should have roughly an exponential [0] distribution (if N, n are large).

So the inductive step is: given k'th sample a[k], let a[k+1] = a[k] + S, where S is drawn from a discrete exponential distribution with mean 1/λ = N/n. That's the `exp(log(rand()) * stuff)' part of the code.

This almost works; there's more details [1] if you want it to actually work, and have accurate statistics, not overflow N, etc.

[0] https://en.wikipedia.org/wiki/Exponential_distribution

[1] https://hal.archives-ouvertes.fr/file/index/docid/75929/file...


Huh, is it really true that Reservoir Sampling is not well-known? It was part of the standard curriculum in my top-30-but-not-top-10 CS school, the University of Utah. I think that even a significant subset of my cohort could explain the correspondence between this and the Fisher-Yates shuffling algorithm.


I don't even have a CS degree and I know about it. I wrote some PowerShell (don't ask) to sample a large collection of documents from file shares looking for embedded links to other documents. We needed to break up the existing file share into smaller pieces and wanted to estimate how many documents would have to be touched due to broken links. (The surprising answer was "very few", since most links were relative to the containing directory).


The key to why the obvious algorithm of "half a Fisher-Yates shuffle" isn't enough is that it uses O(k) memory, whereas Vitter's algorithm is constant.

http://play.golang.org/p/UqC6XWrPLB


> Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.

So if N is really large and doesn't fit into main memory, how does one iterate over such large list in a reasonable amount of time? When I am taught algorithm my sample input is relatively small, from 10,000 elements to 1M integer numbers. Big deal. But I am not sure how to really think when N is really huge and whether we'd ever really have to deal with 10B integers in one pass. In that case, it's more likely I am paging through my database cursor... someone please shine light.


Basically you do a reverse fisher yates shuffle.

You are selecting n of N items, where N is unknown, in one pass, where every item is treated as unique, and where n is your sample size.

First get n items from N in an array of size n of candidates, which we'll call C.

Then, for each item k in N until you reach the end you know the current highest count, so select a random integer representing a location in an array of that count. If it's 0 to n - 1 (assuming a zero indexed array) replace the item in C with the current item k.

This means that each item will have an equal chance of being a final candidate when you reach N. If you care about order of the random sample, you can always shuffle n after you reach N.


Seems like the solution is to employ MapReduce http://had00b.blogspot.com/2013/07/random-subset-in-mapreduc...


The algorithm accesses each element in the input only once, and in order.

It's like reading a big file. Just call read() repeatedly, and discard old data each time you call read() again.


GNU coreutils' shuf since v8.22 (Dec 2013) implements this to minimize memory usage when sampling large files or unknown numbers of items read from a pipe.

For example this command never finishes, but consumes a constant (small) amount of mem.

  seq inf | shuf -n1


Everywhere in this article, and in all of the current comments, where the word "unique" appears, it really should be the word "distinct" instead.


This may be a stupid question, but why not store cards/tweets in a hashmap and add a .contains() check before adding?


Say you need to draw a 100 bilion samples without replacement from a dataset with a trillion plus rows, the resource requirements for hashmap would become enormous.


The C code in this article is a mirror of the code from Appendix 2 in Vitter's paper, which I guess explains/excuses the abbreviated variable names. The paper says things like, "use an exponentially distributed random variate Y," and uses variable names like n, N, U, S, X, y1, and y2 in the appendix.

Nonetheless, I find this coding style unreadable. "Y" is not a very good name for an exponentially distributed random variate, especially in a function that defines 24 variables with mostly meaningless names. :-(


var naming aside, the whitespacing is perplexing:

      U = random_double(); negSreal=-S;
      y1=exp(log(U*Nreal/qu1real)*nmin1inv);
      Vprime = y1 * (-X/Nreal+1.0)*(qu1real/(negSreal+qu1real));
      if(Vprime <= 1.0) break;       y2=1.0; top = -1.0+Nreal;       if(-1+n > S){bottom=-nreal+Nreal;limit=-S+N;}
      else{bottom=-1.0+negSreal+Nreal; limit=qu1;}
Sometimes two statements per line, sometimes one. Sometimes spaces before and after equals, sometimes none. Sometimes multiple ifs, breaks and statements on one line.

It looks like it was copy-pasted from a pdf. Maybe it was.


I feel like 2 seconds in an ide would have saved readability quite a bit. A simple automated cleanup would have helped worlds.


On the other hand, you need the original names in order to reference back to the paper.


random_variate_y

:P


Bloom filters might be a good alternative for this use case if you don't mind false positives.


> generate a list of random tweets, without duplication.

Considering there are somewhere over 1 trillion tweets (over 200 billion/year), this is a very easy problem, you do not even have to check for duplicates because the chance of getting a duplicate is so small it would be statistically unlikely to ever happen.


You're making the assumption that the requested list is quite short. It should be straightforward to show that the probability of a duplicate grows with the requested list length.


Based on your list size, try a reasonably designed hash function here.


This may be a stupid question, but what exactly is wrong with (using the dealing random cards analogy):

  for i = 1 to k
      pickedCardIndex = randomInt(n - i)
      hand[i - 1] = deck[pickedCardIndex]
      swap(deck[pickedCardIndex], deck[n - i])
As far as I can see this satisfies all conditions stated by the author (assuming that randomInt(x) produces a uniformly random integer in the range [0, x], there are n cards in the deck array, and the arrays are 0-indexed).


How do you implement swap if deck is too large to hold in memory?


mmap the deck


Your solution requires space to hold all items from the pool


I believe it would be easy to modify it to have only k chosen indexes in memory, and for k significantly smaller than n and n an order of 2 to 64 it would still be much faster than iterating through all n (which is the first algorithm presented here:

https://en.wikipedia.org/wiki/Reservoir_sampling )


However note that in the Vitter's paper all discussed algorithms are O(k) as per previous notation, or O(n) using paper's notation where n elements are selected from the set of N:

http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf

In short, it's worth reading Vitter's paper, the original post is just a post containing the C code, but the discussion of the algorithm and the comparison with the alternatives is in the paper.

(Edit: corrected the notation of O() as robrenaud suggested)


Not o (little oh), but rather O (big oh).

There is a difference between them. Little oh roughly means "grows strictly slower than, in the limit", where as big oh roughly means "grows no faster than, in the limit".

http://stackoverflow.com/questions/1364444/difference-betwee...


I think you aren't allowed to mutate the deck.


Again, I am confused. Why doesn't this snippet solve the problem?

  void increasingRandomSequence(arrayptr, base, k, n)
  {
      if (k == 0) return;
      int i = randInt(n - k);
      *(arrayptr) = base + i;
      increasingRandomSequence(arrayptr + 1, base + i + 1, k - 1, n - (i + 1));
  }
increasingRandomSequence(hand, 0, k, n) fills the array hand with a sequence which is picked with uniform distribution over all increasing sequences of length k with numbers in [0, n - 1].

This is O(k), and we can shuffle this in O(k). Why doesn't this solve the problem?


The method is on-line, that is, it doesn't need to know "n".

Assuming I translated your above code correctly into Python, as:

    import random

    def increasingRandomSequence(base, k, n):
      while k > 0:
        i = random.randrange(n - k + 1)
        yield base + i
        base += i+1
        k -= 1
        n -= i+1
then I checked the above using:

    N = 6
    counters = [0] * N
    for i in range(100000):
        for value in increasingRandomSequence(0, 5, N):
            counters[value] += 1
    print(counters)
and found a strong bias towards larger numbers. The counts are:

    [50033, 75037, 87438, 93686, 96900, 96906]
which means 5 is in the hand much more often than 0.


It looks like this dramatically undersamples the beginning. Like if I ask for 1000 numbers out of 10,000, the very first number you provide will be (on average) nearly halfway through the list.


You guys are right, it's uniform in the first digit of the sequence which is actually not uniform at all... Thanks for pointing it out!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: