> The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).
I don't think this is true!
First note that if you want to pick 98 values from 100, it is much simpler to instead decide which 2 to skip. Similarly when picking 75 out of 100 - better to just cross out 25. The hardest case is picking half: N/2 out of N, without replacement.
This is the Coupon Collector's problem, except we're stopping at half the coupons. We can use the analysis given in the Wikipedia page [1], except we only take the second half of the series.
This gives an expected time of
N * (H(N) - H(N/2))
where H(X) is the Xth Harmonic number.
In the asymptote this yields:
N * (γ + ln(N) - γ - ln(N/2))
= N * (ln 2)
so the expected time of the "retry on duplicates" approach is linear in N. (Of course, the worst case is infinite, if you keep redrawing the same card.)
edit: To complete the analysis, the requirement is that the answer be linear in k, the number of items selected. But we have k < N, so if we are O(N) we are also O(k) or better. So I think the argument works.
I think it is true. We head off discussion here by referring to the naive approach, but what you're doing is past naive coupon collector's and past the technical scope of the context.
There are several steps where your method could (will) fail. Since you don't introduce any machinery for the hand, you imply an O(k^2) checking algorithm. You also imply an unsorted hand. If you generate the hand in random order, then you fall to an O(k*lg(k)) sort. "Sorted" is an implicit requirement that we initially omit, but revisit later, because introducing it early makes the discussion harder to follow.
If you try to introduce machinery like a hashtable, the next thing you might suggest, you violate an implicit O(1) additional space requirement, which I've grudgingly reintroduced to the more-formal description ("why is this here?"). This conversation keeps going, but other failure paths we leave unexamined.
In addition to not being sorted it's also not "online." You won't get the first value out of it in constant time if k is too big, and you need to access elements in a random order.
Wiki shows Algorithm R, this article talks about Algorithm D, both by the same author.
The former is well known, while the latter is more obscure. This is probably because the extra complexity of Algorithm D is rarely necessary. If k and n are on the same order of magnitude, there is not much difference, and if k << n then simple random sampling would work nearly as well.
Use a multiplicative linear congruential generator with a period 2^n , where n is ceil(log2(m)), where m is the size of the list.
Seed the generator, start generating values and drop all values that are larger than m. In the worst case you will have to drop half of the values, in the best case you will drop none. This depends on how close to a power of 2 m is.
The generator is very short, one line of code, and very fast: one multiplication, one addition, and one bitshift. Space complexity is O(1).
If your tweets are generated out from an integer value, then you can use this value to generate them. The tweets will be unique (assuming the tweet generator will generate unique tweets from unique values), and you will have m of them.
This is very practical, but the result is not uniform; in fact it only returns a tiny fraction of the possible combinations.
There are "N choose k" combinations, which in the worst case (k = N/2) grows as ~ 2^N. The number of combinations quickly outstrips the number of possible states of the LCG, and most combinations will never be found. For example, we have a list of length 16 and wish to choose 8. We have a choice of 16 distinct seeds, so there are only 16 different combinations (out of 12870 possible) that we can get for any given LCG.
By randomly adding skips, can definitely get to all possible combinations even with only 16 seeds (since skips can be seeded from a much larger space), although there could be a speed implication.
Good point. You would get different sequences by choosing different constants for the generator though, but you couldn't cover all of them by a longshot.
It's not quite the same problem though; the algorithm described returns samples in sequential order. They mention performance reasons for this (C-f "tape sampling"). It also has the limitation that you need to know the sample size in advance.
This only matters in calculations where randomness is important, i.e. Monte Carlo method.
Nevertheless, if you don't use the least significant bits, and if the constants are carefully chosen, MLCG passes most of the hardest statistical tests. For example it passes all DIEHARD tests, and most of TESTU01.
In addition, the described algorithm "is on-line - that is, it requires no preprocessing and can generate each element of the sample in constant expected time." (Quoting from the original paper.) It needs only an upper bound on the number items, while your suggestion requires knowing the number beforehand.
You also need an MLCG over the range 0..2^64-1 to match the given algorithm. I believe that requires some extra code to handle potential overflow in a MLCG, so adds a few more lines to your estimate.
A SPN will generate a unique random permutation, from which you can draw your random list of unique items. With a little of post-ptocessing, you can generate random permutations of arbitrary size.
I can't believe this article doesn't explain how it works. Especially after stating "after reaching Vitter’s papers it again takes a concentrated effort to figure out what you need".
I read very often (by people who don't even use it) that Perl is a write-only language. Then i see code like this roll along that demonstrates perfectly how the readability of any given piece of code is 99% on the coder, not the language.
I think the other replies to you are mistaken. This algorithm meets the requirements of running in linear time, once you apply the inverting trick when the sample size exceeds half of the pool size. See my top-level comment.
In certain circumstances, like when you're drawing from a limited pool, this becomes N^2 and unbounded - Imagine you're trying to shuffle a deck of cards. Select a card number, 1-52. The first draw, you're guaranteed to get a unique card, the second, you've got a 1/52 chance of having to generate a second number, all the way up to the final card you have a 1/52 chance of generating the final number you need, and you'll spin for quite a bit waiting for that to come up, with a possibility (if your RNG is bad) that you'll never get there. This gets amplified if you've got a larger set - If you draw from six decks, you'll go through an average of 156 'draws' before you get your final card... and the card before it almost as many!
The nicer way to do this is to have two lists. Put all of your cards into the first list. Choose a number between 1 and the length of list 1, then remove that element from the list and add it to the end of list 2. Repeat until list 1 is empty. It's linear-time (with a linked list), and there's no chance of duplicates.
But this isn't the scenario presented in the article - It's an algorithm for a very specialized version of a shuffle, where you need a random sample from an unbounded sequential (meaning, you don't know it's length ahead of time) list. The algorithm presented is for getting a good random sample from a dataset well beyond the size of memory, not for doing a perfect shuffle on a small list.
"The nicer way to do this is to have two lists. Put all of your cards into the first list. Choose a number between 1 and the length of list 1, then remove that element from the list and add it to the end of list 2. Repeat until list 1 is empty. It's linear-time (with a linked list), and there's no chance of duplicates."
I have to add my $0.02. This is correct, your algorithm will give correct result with no chance of duplicates. However in some context this absolutely must not be done - I've worked in a company which made online gambling games and the auditing requirement was 1-to-1 correspondence of number drawn from RNG and the card player sees. For example Ace of spades must always be 49, whether it was drawn as first or last card from the deck.
I had the same thought. However thinking about it more, the requirement is that this algorithm is O(k). As k gets closer and closer to the size of the list, you will spend more and more time looking for unique items. You also need to pay the costs of checking for uniqueness which also grows with k.
Well yeah, that is exactly why I said for the situation where the lenght of the list is much smaller than the total number of tweets. Which should cover most practical usages.
It's not O(k) then. As he mentioned, he's writing this as a language feature, so he can't make simplifying assumptions about k or n. It must handle large n, and can't assume k << n.
This ran in a few milliseconds on my old macbook in python. It is an iterator over 1 billion element list. Looking at the source, they seem to track previous selections for large populations in a set, for which lookup of "x in y" is avg O(1).
Actually, due to seeding, no matter how many times you run this you will never generate all combinations of tweet orders. Of course that doesn't really matter since for large values of N there isn't enough time in the universe to do so.
Could someone explain how this works? I feel like unique shuffling given a random seed is very useful algorithm, but I can't follow this code because of the variable naming and stuff.
The general idea is to draw n uniform samples from [1..N] and return them in sorted order. You can generate these incrementally, in order (no sort needed). The mean difference between successive samples is ~N/n. If the samples are uniformly distributed, their differences should have roughly an exponential [0] distribution (if N, n are large).
So the inductive step is: given k'th sample a[k], let a[k+1] = a[k] + S, where S is drawn from a discrete exponential distribution with mean 1/λ = N/n. That's the `exp(log(rand()) * stuff)' part of the code.
This almost works; there's more details [1] if you want it to actually work, and have accurate statistics, not overflow N, etc.
Huh, is it really true that Reservoir Sampling is not well-known? It was part of the standard curriculum in my top-30-but-not-top-10 CS school, the University of Utah. I think that even a significant subset of my cohort could explain the correspondence between this and the Fisher-Yates shuffling algorithm.
I don't even have a CS degree and I know about it. I wrote some PowerShell (don't ask) to sample a large collection of documents from file shares looking for embedded links to other documents. We needed to break up the existing file share into smaller pieces and wanted to estimate how many documents would have to be touched due to broken links. (The surprising answer was "very few", since most links were relative to the containing directory).
The key to why the obvious algorithm of "half a Fisher-Yates shuffle" isn't enough is that it uses O(k) memory, whereas Vitter's algorithm is constant.
> Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory.
So if N is really large and doesn't fit into main memory, how does one iterate over such large list in a reasonable amount of time? When I am taught algorithm my sample input is relatively small, from 10,000 elements to 1M integer numbers. Big deal. But I am not sure how to really think when N is really huge and whether we'd ever really have to deal with 10B integers in one pass. In that case, it's more likely I am paging through my database cursor... someone please shine light.
You are selecting n of N items, where N is unknown, in one pass, where every item is treated as unique, and where n is your sample size.
First get n items from N in an array of size n of candidates, which we'll call C.
Then, for each item k in N until you reach the end you know the current highest count, so select a random integer representing a location in an array of that count. If it's 0 to n - 1 (assuming a zero indexed array) replace the item in C with the current item k.
This means that each item will have an equal chance of being a final candidate when you reach N. If you care about order of the random sample, you can always shuffle n after you reach N.
GNU coreutils' shuf since v8.22 (Dec 2013) implements this to minimize memory usage when sampling large files or unknown numbers of items read from a pipe.
For example this command never finishes, but consumes a constant (small) amount of mem.
Say you need to draw a 100 bilion samples without replacement from a dataset with a trillion plus rows, the resource requirements for hashmap would become enormous.
The C code in this article is a mirror of the code from Appendix 2 in Vitter's paper, which I guess explains/excuses the abbreviated variable names. The paper says things like, "use an exponentially distributed random variate Y," and uses variable names like n, N, U, S, X, y1, and y2 in the appendix.
Nonetheless, I find this coding style unreadable. "Y" is not a very good name for an exponentially distributed random variate, especially in a function that defines 24 variables with mostly meaningless names. :-(
Sometimes two statements per line, sometimes one. Sometimes spaces before and after equals, sometimes none. Sometimes multiple ifs, breaks and statements on one line.
It looks like it was copy-pasted from a pdf. Maybe it was.
> generate a list of random tweets, without duplication.
Considering there are somewhere over 1 trillion tweets (over 200 billion/year), this is a very easy problem, you do not even have to check for duplicates because the chance of getting a duplicate is so small it would be statistically unlikely to ever happen.
You're making the assumption that the requested list is quite short. It should be straightforward to show that the probability of a duplicate grows with the requested list length.
This may be a stupid question, but what exactly is wrong with (using the dealing random cards analogy):
for i = 1 to k
pickedCardIndex = randomInt(n - i)
hand[i - 1] = deck[pickedCardIndex]
swap(deck[pickedCardIndex], deck[n - i])
As far as I can see this satisfies all conditions stated by the author (assuming that randomInt(x) produces a uniformly random integer in the range [0, x], there are n cards in the deck array, and the arrays are 0-indexed).
I believe it would be easy to modify it to have only k chosen indexes in memory, and for k significantly smaller than n and n an order of 2 to 64 it would still be much faster than iterating through all n (which is the first algorithm presented here:
However note that in the Vitter's paper all discussed algorithms are O(k) as per previous notation, or O(n) using paper's notation where n elements are selected from the set of N:
In short, it's worth reading Vitter's paper, the original post is just a post containing the C code, but the discussion of the algorithm and the comparison with the alternatives is in the paper.
(Edit: corrected the notation of O() as robrenaud suggested)
There is a difference between them. Little oh roughly means "grows strictly slower than, in the limit", where as big oh roughly means "grows no faster than, in the limit".
Again, I am confused. Why doesn't this snippet solve the problem?
void increasingRandomSequence(arrayptr, base, k, n)
{
if (k == 0) return;
int i = randInt(n - k);
*(arrayptr) = base + i;
increasingRandomSequence(arrayptr + 1, base + i + 1, k - 1, n - (i + 1));
}
increasingRandomSequence(hand, 0, k, n) fills the array hand with a sequence which is picked with uniform distribution over all increasing sequences of length k with numbers in [0, n - 1].
This is O(k), and we can shuffle this in O(k). Why doesn't this solve the problem?
It looks like this dramatically undersamples the beginning. Like if I ask for 1000 numbers out of 10,000, the very first number you provide will be (on average) nearly halfway through the list.
I don't think this is true!
First note that if you want to pick 98 values from 100, it is much simpler to instead decide which 2 to skip. Similarly when picking 75 out of 100 - better to just cross out 25. The hardest case is picking half: N/2 out of N, without replacement.
This is the Coupon Collector's problem, except we're stopping at half the coupons. We can use the analysis given in the Wikipedia page [1], except we only take the second half of the series.
This gives an expected time of
where H(X) is the Xth Harmonic number.In the asymptote this yields:
so the expected time of the "retry on duplicates" approach is linear in N. (Of course, the worst case is infinite, if you keep redrawing the same card.)edit: To complete the analysis, the requirement is that the answer be linear in k, the number of items selected. But we have k < N, so if we are O(N) we are also O(k) or better. So I think the argument works.
[1] https://en.wikipedia.org/wiki/Coupon_collector%27s_problem