It is shocking, but what's more shocking is that doing it correctly is kind of rocket science. MySQL simply isn't built for picking a truly random row performantly (I'm not sure if any of the common relational databases are?).
If you do something naive like "ORDER BY RAND() LIMIT 1" performance is worse than abysmal. Similarly, doing a "LIMIT 0 OFFSET RAND() * row_count" is comparably abysmal. While if you do something performant like "WHERE id >= RAND() * max_id ORDER BY id LIMIT 1", you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen.
There are only two "correct" solutions. The first is to pick id's at random and then retry when an id isn't a valid article, but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal. While the second is to maintain a separate column/table that includes all valid articles in a sequential integer sequence, and then pick a random integer guaranteed to be valid. But the problem with performance is now that every time you delete an article, you have to recalculate and rewrite, on average, half the entire column of integers.
So for something that's really just a toy feature, the way Wikipedia implements it is kind of "good enough". And downsides could be mitigated by recalculating each article's random number every so often, e.g. as a daily or weekly or slowly-but-constantly-crawling batch job, or every time a page gets edited.
You could also dramatically improve it (but without making it quite perfect) by using the process as it exists, but rather than picking the article with the closest random number, select the ~50 immediately lower and higher than it, and then pick randomly from those ~100. It's a complete and total hack, but in practice it'll still be extremely performant.
If it were up to me, I'd implement something sort of like an online[0] Fisher-Yates shuffle:
Add a column to the article database for "random_index". This gives each article a unique integer index within a consecutive range of 1 to N. To pick a random article, just pick a random number in that range using your favorite random number generator and then look up the row with that random_index value.
The tricky part, obviously, is maintaining the invariant that random index values are (1) randomly assigned and (2) contiguous. To maintain those, I believe it's enough to:
1. Whenever a new article is created, give it random_index N+1. Then choose a random number in the range (1, N+1). Swap the random_index of the new article and the article whose random_index is the number you just rolled. This is the Fisher-Yates step that gives you property (1).
2. Whenever an article is deleted or otherwise removed from the set of articles that can be randomly selected, look up the article with random_index N. Set that article's random_index to the random_index of the article about to be deleted. This gives you property (2).
Oh wow, yup that will do it. That was my second suggestion for what is "correct" but I never heard of your #2 suggestion, so I thought there would still be performance issues because you'd have to shift all the values above. I'm kind of annoyed now I didn't know about this before... :)
Yes, this should definitely considered to be the correct, definitive, performant solution then. Thank you!
I think you'll need a full table lock in the case of a race condition trying to do a remove and add at the same time, but that's probably not usually going to be an issue?
Actually, now that I think about it, step 1 isn't necessary. You don't actually need to shuffle the indexes since you always choose a random one anyway. It should be enough to just do the swap on deletes in order to keep the integers contiguous.
> I think you'll need a full table lock in case you're ever trying to do a remove and add at the same time
Yeah, I think you need some sort of locking to ensure N can't be concurrently modified but I'm not a DB person, so I don't know how to go about that.
I believe that for the operation of randomly sampling a single row, only the consecutiveness invariant is needed. The random permutation invariant is unnecessary, because the choice from the range (1, N+1) is random anyway.
However, I can imagine the random permutation invariant being useful for other operations (although I can't immediately think of one).
EDIT: the random permutation invariant could maybe be useful for a list that can be displayed to the user. I can imagine some cases where it is nice to have an order that doesn't change a lot while you don't want it to depend on any property in particular (like the order in which the rows were added).
What we actually have though are thousands (potentially millions) of well-fed human brains which are volunteering their downtime in a particular moment of their life to receive factual information of any kind (of the kind they expect to see on Wikipedia) in the hopes of getting something good. A system that chose the best possible article to present to them in that moment to bend the course of their life in a good direction or satisfy some need (or attempt/pretend to satisfy, as much as we can by shooting photons at their retinas) would be a better engineered system than one that randomly selects from its entire catalog of options correctly and efficiently.
I think that’s got a similar bias as the original, unfortunately. All deleted indices between K and N (K<N) will result in picking N, and the number of such deletions would vary wildly between individual articles.
> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.
Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.
It's not a big deal because there is no need to retry 20 times. The probability of getting a deleted article several times in a row is very low, so you could limit to e.g. 5 tries, and if all are deleted fall back to the next non-deleted article (or even a hard-coded article). The bias would be negligible assuming the proportion of deleted articles is low; to guarantee that it is low, one can periodically renumber to eliminated deleted articles (this can be done efficiently using the trick suggested by @munificent; but the naive O(n) approach would probably be good enough).
Oracle has `SAMPLE(N)` which can come after a table name to get N records from the table. I believe it's built specifically to be performant for this kind of thing. That said, it's the only case I've heard of and only because of an aside in a book a while ago.
Why does the randomization have to happen in the database query? Assuming there aren't any large gaps in the distribution of IDs, and if you know the max_id, couldn't you pick a random number between min_id and max_id and get the record whose ID matches that random number?
Precisely because of the gaps. Tons of Wikipedia article ID's aren't valid for random selection because they've been deleted, because they're a disambiguation page, because they're a redirect, or they're a talk page or user page or whatever else.
My comment covered your suggestion already -- that's why I wrote "you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen".
That requires setting up an entirely different service and somehow keeping it in perfect sync with the database, and along with all of the memory it requires.
And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've figured out how to do that efficiently, you might as well skip all the trouble and just use that algorithm on the database itself.
>And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've
If you do something naive like "ORDER BY RAND() LIMIT 1" performance is worse than abysmal. Similarly, doing a "LIMIT 0 OFFSET RAND() * row_count" is comparably abysmal. While if you do something performant like "WHERE id >= RAND() * max_id ORDER BY id LIMIT 1", you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen.
There are only two "correct" solutions. The first is to pick id's at random and then retry when an id isn't a valid article, but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal. While the second is to maintain a separate column/table that includes all valid articles in a sequential integer sequence, and then pick a random integer guaranteed to be valid. But the problem with performance is now that every time you delete an article, you have to recalculate and rewrite, on average, half the entire column of integers.
So for something that's really just a toy feature, the way Wikipedia implements it is kind of "good enough". And downsides could be mitigated by recalculating each article's random number every so often, e.g. as a daily or weekly or slowly-but-constantly-crawling batch job, or every time a page gets edited.
You could also dramatically improve it (but without making it quite perfect) by using the process as it exists, but rather than picking the article with the closest random number, select the ~50 immediately lower and higher than it, and then pick randomly from those ~100. It's a complete and total hack, but in practice it'll still be extremely performant.