It is shocking, but what's more shocking is that doing it correctly *is* kind of...

munificent · on Oct 20, 2023

If it were up to me, I'd implement something sort of like an online[0] Fisher-Yates shuffle:

Add a column to the article database for "random_index". This gives each article a unique integer index within a consecutive range of 1 to N. To pick a random article, just pick a random number in that range using your favorite random number generator and then look up the row with that random_index value.

The tricky part, obviously, is maintaining the invariant that random index values are (1) randomly assigned and (2) contiguous. To maintain those, I believe it's enough to:

1. Whenever a new article is created, give it random_index N+1. Then choose a random number in the range (1, N+1). Swap the random_index of the new article and the article whose random_index is the number you just rolled. This is the Fisher-Yates step that gives you property (1).

2. Whenever an article is deleted or otherwise removed from the set of articles that can be randomly selected, look up the article with random_index N. Set that article's random_index to the random_index of the article about to be deleted. This gives you property (2).

[0]: https://en.wikipedia.org/wiki/Online_algorithm

crazygringo · on Oct 20, 2023

Oh wow, yup that will do it. That was my second suggestion for what is "correct" but I never heard of your #2 suggestion, so I thought there would still be performance issues because you'd have to shift all the values above. I'm kind of annoyed now I didn't know about this before... :)

Yes, this should definitely considered to be the correct, definitive, performant solution then. Thank you!

I think you'll need a full table lock in the case of a race condition trying to do a remove and add at the same time, but that's probably not usually going to be an issue?

munificent · on Oct 20, 2023

Actually, now that I think about it, step 1 isn't necessary. You don't actually need to shuffle the indexes since you always choose a random one anyway. It should be enough to just do the swap on deletes in order to keep the integers contiguous.

> I think you'll need a full table lock in case you're ever trying to do a remove and add at the same time

Yeah, I think you need some sort of locking to ensure N can't be concurrently modified but I'm not a DB person, so I don't know how to go about that.

crazygringo · on Oct 20, 2023

Yup, agreed on step 1. Basic autoincrement is perfectly fine, since you pick a random number at query time.

Levitating · on Oct 21, 2023

What an incredibly simple solution for something that was just deemed rocket science.

Also I liked your book.

jprete · on Oct 20, 2023

Thanks to both of you for pointing out some quite good solutions to this problem.

stcg · on Oct 20, 2023

This is the nicest solution I've seen so far.

I believe that for the operation of randomly sampling a single row, only the consecutiveness invariant is needed. The random permutation invariant is unnecessary, because the choice from the range (1, N+1) is random anyway.

However, I can imagine the random permutation invariant being useful for other operations (although I can't immediately think of one).

EDIT: the random permutation invariant could maybe be useful for a list that can be displayed to the user. I can imagine some cases where it is nice to have an order that doesn't change a lot while you don't want it to depend on any property in particular (like the order in which the rows were added).

shpx · on Oct 21, 2023

What we actually have though are thousands (potentially millions) of well-fed human brains which are volunteering their downtime in a particular moment of their life to receive factual information of any kind (of the kind they expect to see on Wikipedia) in the hopes of getting something good. A system that chose the best possible article to present to them in that moment to bend the course of their life in a good direction or satisfy some need (or attempt/pretend to satisfy, as much as we can by shooting photons at their retinas) would be a better engineered system than one that randomly selects from its entire catalog of options correctly and efficiently.

karmanyaahm · on Oct 20, 2023

2 is exactly how removal from sets is O(1) whereas lists are O(N)

chasd00 · on Oct 20, 2023

you probably don't even need 2. If your select with a random number N returns null then increment to N+1 and select again. You'll find one eventually.

jprete · on Oct 20, 2023

I think that’s got a similar bias as the original, unfortunately. All deleted indices between K and N (K<N) will result in picking N, and the number of such deletions would vary wildly between individual articles.

gwern · on Oct 20, 2023

> but the worry here is with how sparse the id's might be, and the fact it might have to occasionally retry 20 times -- unpredictable performance characteristics like that aren't ideal.

Is this really a big deal? Surely for a SQL database, looking up the existence of a few IDs, just about the simplest possible set operation, is an operation measured in a fraction of a millisecond. Wikipedia is not that sparse, and a lookup of a few IDs in a call would easily bound it to a worst case of <10 calls and still just a millisecond or two.

hddqsb · on Oct 21, 2023

It's not a big deal because there is no need to retry 20 times. The probability of getting a deleted article several times in a row is very low, so you could limit to e.g. 5 tries, and if all are deleted fall back to the next non-deleted article (or even a hard-coded article). The bias would be negligible assuming the proportion of deleted articles is low; to guarantee that it is low, one can periodically renumber to eliminated deleted articles (this can be done efficiently using the trick suggested by @munificent; but the naive O(n) approach would probably be good enough).

jjice · on Oct 20, 2023

Oracle has `SAMPLE(N)` which can come after a table name to get N records from the table. I believe it's built specifically to be performant for this kind of thing. That said, it's the only case I've heard of and only because of an aside in a book a while ago.

https://stackoverflow.com/a/22822473/9939508

jonesnc · on Oct 20, 2023

Why does the randomization have to happen in the database query? Assuming there aren't any large gaps in the distribution of IDs, and if you know the max_id, couldn't you pick a random number between min_id and max_id and get the record whose ID matches that random number?

crazygringo · on Oct 20, 2023

Precisely because of the gaps. Tons of Wikipedia article ID's aren't valid for random selection because they've been deleted, because they're a disambiguation page, because they're a redirect, or they're a talk page or user page or whatever else.

My comment covered your suggestion already -- that's why I wrote "you encounter the same problem where the id gaps in deleted articles make certain articles more likely to be chosen".

chongli · on Oct 20, 2023

Can't you just query the ID column and grab the entire list of valid IDs, put that into an array, store it, and pick random IDs from that?

crazygringo · on Oct 20, 2023

That requires setting up an entirely different service and somehow keeping it in perfect sync with the database, and along with all of the memory it requires.

And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've figured out how to do that efficiently, you might as well skip all the trouble and just use that algorithm on the database itself.

tester756 · on Oct 21, 2023

>And you've still got to decide how you're going to pick random ID's from an array of tens of millions of elements that are constantly having elements deleted from the middle. Once you've

How so? you have array of only valid IDs [1,2,3]

chongli · on Oct 21, 2023

Oh I thought this was a one-time thing, like a research paper. Doing it continuously in real time is much harder.