If you know the size of the set, then you can generate a random set of (non-dupl...

dlubarov · on March 6, 2015

Yeah - unfortunately there isn't a great way to select n random rows in standard SQL.

If you have an integral primary key, you could select its max and generate random numbers up to that. Continuity isn't guaranteed so you would need to check for nonexistent primary keys in addition to checking for duplicates.

If there's no integral primary key, then there's no standard SQL solution faster than O(n).

lilyball · on March 6, 2015

If N is sufficiently larger than k, it might actually be faster to do something like

  let count = query("SELECT COUNT() FROM table")
  let indices = computeIndices(count, k)
  for index in indices {
    results.append(query("SELECT * FROM table LIMIT 1 OFFSET ?", index))
  }

Yeah, you're doing a bunch of queries, but you're removing all of copying all the data from all the rows.

dbenhur · on March 7, 2015

OFFSET j is O(j) in most RDMSes. You made an o(N/2 * k) algorithm given j is a random number in [0,N).

lilyball · on March 8, 2015

Depending on the size of k compared to N, and how much data is in each row, it's still probably better than iterating over the entire contents of the table.

Also, do you have any citation for OFFSET j being O(j)? I can believe that's the case, but it also seems fairly obvious to optimize OFFSET for the case of a non-filtered unordered SELECT (or one ordered in a way that matches an index).

detrino · on March 8, 2015

Assuming that the DB is backed by some kind of tree, it would need to maintain extra metadata in the branches in order to achieve even O(log n) offsetting. For this reason I could understand it being a linear operation.

seivan · on March 6, 2015

My first solution would have been getting a count, maybe a cached count and just doing random between 0 and that count, fetching by index? Or if you want more than one, just get a slice? Probably not the best, but at the top of my head.