Hacker News new | past | comments | ask | show | jobs | submit login

what about using histograms: a histogram is N bins, where N is the number of values that an integer could assume and each bin stores the count of the number of times that value is seen. assume an integer is 32 bits. 2^32 ~= 4 billion bins. to store counts of up to a trillion, we'd use a data type that goes up to at least a trillion, we can use a 64 bit uint for that. so 2^32 bins * 2^3 bytes per count = 2^35 or ~32GB. my hard drive is bigger than that, so we can potentially just break the bins down into a bunch of files, maybe even 1 file per bin if the OS lets us. after we've stored all the numbers in our bins, we just iterate from our first bin adding up all our counts till we hit half a trillion. the bin that one is in is our median.

if we more than 1 computer, we could map the range among all the computers (so if we had 2 computers, 1 computer would take the first 2^31 bins, the 2nd computer would only care about the second 2^31 bins, etc). then you could iterate through all the computers in order, just passing along the current count so far, stopping when you hit half a trillion.




Note that the range of integer values to consider is not specified. We don't know if they are signed and 32 or 64bit values.

The histogram is indeed the best algorithm and the one I would use. I call it the hash select. Note that it may be applied recursively, by narrowing the bins. Quick select is then in fact the hash select with only two bins.

Note also that since histogram filling is distributed, one needs to add up the histograms to find the one containing the median value. This is a good reason for not using histograms with 2^32 or 2^64 bins.

A 1024 bin histogram, would allow to find the median value of 32 bit integers in at most 4 iterations, and 64bit integers in at most 7 iterations.

The algorithm to identify the bin containing the median value is then very similar. One adds up the number of all lower values, until the bin containing the n/2 th value is found. This bin contains the median value.

Instead of adding up all histogram, which may be done pairwise (OlogN), one could do this for each bin, progressively, until the bin containing the median value is found. I guess there is a nice optimal and distributed algorithm to find out there. But we get the general picture.

Another optimization would be to adjust the histogram boundaries to fit the biggest and lowest value in the value set.


I believe the original problem definition called for the data to be randomly spread over 1000 computers. Bringing together 32GB of data from 1000 nodes is going to stress your network, and pre-emptively sharding would be worse.

I think the best way to use histograms is 4 passes with 256 buckets (and bit-twiddling micro-optimizations), but other values are possible.


The author pointed this out; see "optimizing even more" under http://matpalm.com/median/erlang_multi.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: