If you know the size of the set, then you can generate a random set of (non-duplicate) indices, and use those to fetch data.
Offhand I'm not sure how to do that in SQL, since data isn't usually fetched by index (unless the row has an explicit index key). Worst-case you could iterate over the rows just as you do with reservoir sampling, and it would still be faster (since you aren't generating tons of random numbers and doing a lot of array manipulation). Or you could do one query fetching just the primary key, record the keys of the indices you're interested in, and then do a second query to fetch that set. Or maybe SQL does have some way to fetch by index after all.
Yeah - unfortunately there isn't a great way to select n random rows in standard SQL.
If you have an integral primary key, you could select its max and generate random numbers up to that. Continuity isn't guaranteed so you would need to check for nonexistent primary keys in addition to checking for duplicates.
If there's no integral primary key, then there's no standard SQL solution faster than O(n).
If N is sufficiently larger than k, it might actually be faster to do something like
let count = query("SELECT COUNT() FROM table")
let indices = computeIndices(count, k)
for index in indices {
results.append(query("SELECT * FROM table LIMIT 1 OFFSET ?", index))
}
Yeah, you're doing a bunch of queries, but you're removing all of copying all the data from all the rows.
Depending on the size of k compared to N, and how much data is in each row, it's still probably better than iterating over the entire contents of the table.
Also, do you have any citation for OFFSET j being O(j)? I can believe that's the case, but it also seems fairly obvious to optimize OFFSET for the case of a non-filtered unordered SELECT (or one ordered in a way that matches an index).
Assuming that the DB is backed by some kind of tree, it would need to maintain extra metadata in the branches in order to achieve even O(log n) offsetting. For this reason I could understand it being a linear operation.
My first solution would have been getting a count, maybe a cached count and just doing random between 0 and that count, fetching by index?
Or if you want more than one, just get a slice? Probably not the best, but at the top of my head.
Offhand I'm not sure how to do that in SQL, since data isn't usually fetched by index (unless the row has an explicit index key). Worst-case you could iterate over the rows just as you do with reservoir sampling, and it would still be faster (since you aren't generating tons of random numbers and doing a lot of array manipulation). Or you could do one query fetching just the primary key, record the keys of the indices you're interested in, and then do a second query to fetch that set. Or maybe SQL does have some way to fetch by index after all.