what about using histograms: a histogram is N bins, where N is the number of val...

chmike · on Aug 24, 2009

Note that the range of integer values to consider is not specified. We don't know if they are signed and 32 or 64bit values.

The histogram is indeed the best algorithm and the one I would use. I call it the hash select. Note that it may be applied recursively, by narrowing the bins. Quick select is then in fact the hash select with only two bins.

Note also that since histogram filling is distributed, one needs to add up the histograms to find the one containing the median value. This is a good reason for not using histograms with 2^32 or 2^64 bins.

A 1024 bin histogram, would allow to find the median value of 32 bit integers in at most 4 iterations, and 64bit integers in at most 7 iterations.

The algorithm to identify the bin containing the median value is then very similar. One adds up the number of all lower values, until the bin containing the n/2 th value is found. This bin contains the median value.

Instead of adding up all histogram, which may be done pairwise (OlogN), one could do this for each bin, progressively, until the bin containing the median value is found. I guess there is a nice optimal and distributed algorithm to find out there. But we get the general picture.

Another optimization would be to adjust the histogram boundaries to fit the biggest and lowest value in the value set.

dspeyer · on Aug 24, 2009

I believe the original problem definition called for the data to be randomly spread over 1000 computers. Bringing together 32GB of data from 1000 nodes is going to stress your network, and pre-emptively sharding would be worse.

I think the best way to use histograms is 4 passes with 256 buckets (and bit-twiddling micro-optimizations), but other values are possible.

cperciva · on Aug 24, 2009

The author pointed this out; see "optimizing even more" under http://matpalm.com/median/erlang_multi.html