what about using histograms: a histogram is N bins, where N is the number of values that an integer could assume and each bin stores the count of the number of times that value is seen. assume an integer is 32 bits. 2^32 ~= 4 billion bins. to store counts of up to a trillion, we'd use a data type that goes up to at least a trillion, we can use a 64 bit uint for that. so 2^32 bins * 2^3 bytes per count = 2^35 or ~32GB. my hard drive is bigger than that, so we can potentially just break the bins down into a bunch of files, maybe even 1 file per bin if the OS lets us. after we've stored all the numbers in our bins, we just iterate from our first bin adding up all our counts till we hit half a trillion. the bin that one is in is our median.
if we more than 1 computer, we could map the range among all the computers (so if we had 2 computers, 1 computer would take the first 2^31 bins, the 2nd computer would only care about the second 2^31 bins, etc). then you could iterate through all the computers in order, just passing along the current count so far, stopping when you hit half a trillion.
Note that the range of integer values to consider is not specified. We don't know if they are signed and 32 or 64bit values.
The histogram is indeed the best algorithm and the one I would use. I call it the hash select. Note that it may be applied recursively, by narrowing the bins. Quick select is then in fact the hash select with only two bins.
Note also that since histogram filling is distributed, one needs to add up the histograms to find the one containing the median value. This is a good reason for not using histograms with 2^32 or 2^64 bins.
A 1024 bin histogram, would allow to find the median value of 32 bit integers in at most 4 iterations, and 64bit integers in at most 7 iterations.
The algorithm to identify the bin containing the median value is then very similar. One adds up the number of all lower values, until the bin containing the n/2 th value is found. This bin contains the median value.
Instead of adding up all histogram, which may be done pairwise (OlogN), one could do this for each bin, progressively, until the bin containing the median value is found. I guess there is a nice optimal and distributed algorithm to find out there. But we get the general picture.
Another optimization would be to adjust the histogram boundaries to fit the biggest and lowest value in the value set.
I believe the original problem definition called for the data to be randomly spread over 1000 computers. Bringing together 32GB of data from 1000 nodes is going to stress your network, and pre-emptively sharding would be worse.
I think the best way to use histograms is 4 passes with 256 buckets (and bit-twiddling micro-optimizations), but other values are possible.
I liked the question and after thinking hard about potential algorithms, I came to a different conclusion:
- In general for data sets of this size, you don't want anything more then a linear algorithm or less preferably.
- Also, because a median is a statistic measure, it's use will be of a predicting nature. (Why would anyone need the exact median of a trillion numbers?)
- Because of the uncertainty involved with predicting and the size of the data set, I think a sample approach would be appropriate.
- A sample size of 1000 - 10,000 numbers per machine is enough for significantly recognizing any reasonable distribution.
- We could take the median of those 1 - 10 million numbers directly or we could apply curve fitting techniques to gain a lot more insight.
Of course if someone could provide a reason for needing the exact median I am completely wrong.
Thats a good point. Two things come t mind though. One is that since its an interview question they are probably just using this to test your knowledge of (distributed) algorithms.
The second is that you might want the exact answer if you are using the statistic to convince someone of something and you don't want to leave any room for doubt.
This is the perfect answer to an interview question, with the proviso that it is nearly impossible to demonstrate over the course of an interview. You can sketch out the general nature of the approach but actually doing something productive with it takes a wee bit longer.
For questions like this, you think between the phone screen and the serious interview you could send someone an email saying "Here's homework: we have a torrent here of a substantial data set. Find the median number, document your approach, and we'll talk about it at the interview."
That all depends on what one is actually looking to find in an applicant. If the goal of the question is to actually determine if the interviewee has the ability to research a problem, think it through, and implement the solution, then the "homework" approach is definitely superior.
However, if the thrust of the question is to see the applicant think on his/her feet and possibly apply some knowledge from a college class on algorithms (I never covered a problem in this depth in my undergrad comp sci education, but it's definitely standard to discuss computing order statistics), presenting it in the interview is best. The idea of the solution as presented here could easily be sketched out in the interview, and a skilled interviewer could lead the applicant through fleshing out some details if that was so desired.
That said, I think the latter is more often the goal of a final round interview.
If you're familiar with the linear time deterministic selection algorithm, this is straightforward to solve. It's usually taught in intro to algorithms classes. See CLRS chapter 9.
Incidentally, the algorithm is from 1973 and four of the authors went on to win the Turing award: ftp://db.stanford.edu/pub/cstr.old/reports/cs/tr/73/349/CS-TR-73-349.pdf
Yeah, I commented right before bed. I thought of a bunch of problems with the average while trying to get to sleep.
I was thinking since the size of an outlier is constrained by fitting in 4 bytes, there must be a way to show the median is in some band of numbers around the average. I didn't think of a particularly useful algorithm though.
Actually, even be worse here. E.g. if all the numbers are positive you will get an overflow pretty quick, so you'll need a better datatype. So if you are dealing with b-bit numbers you'd need O(n * log(bn)) to calculate the average.
(Usually this doesn't matter, but since n >> b it does).
Edit: Notwithstanding my earlier comment that average doesn't work, merely commenting on the run time of average.
This will compute the exact historical average. I like to use 2/t instead of 1/t to get a moving average over a non-stationary distribution, which favors the more recent events.
if we more than 1 computer, we could map the range among all the computers (so if we had 2 computers, 1 computer would take the first 2^31 bins, the 2nd computer would only care about the second 2^31 bins, etc). then you could iterate through all the computers in order, just passing along the current count so far, stopping when you hit half a trillion.