Sorting 1 million 8-digit decimal numbers in 1MB of RAM

ChuckMcM · on Oct 21, 2012

This was a Google interview question for a while (I know I got it as one) I think it has since made it to the banned question list. The 'trick' was, as answer #2 elided, to use bits to store the fact that you had observed the number and then you could dump them back in sorted order.

So imagine you have a bit field where 'one bits' indicate you have seen the number and 'zero bits' indicate you haven't. And you compress it with run-length encoding.

Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)

Anyway, as you get numbers you will split and coalesce the bit space until you've either seen all numbers (0, 99999999) or you run out of memory because a hacker has sent you only the even numbers from the space.

Its a clever 'math trick' which explores how you think about numbers (are they a thing or a representation). But I never thought it really gave a good indication of whether or not you would be a good candidate for the company.

ChuckMcM · on Oct 22, 2012

So as it turns out, this is wrong. Several folks have mentioned that in that the original Google interview question is 'sort the numbers' / 'remove duplicates' and the solution is to create a bitmap in a disk file to do a single pass sort. This question mixes that up a bit and my particular bitmap solution fails to meet the criteria of staying under a MB of memory.

With all the discussion I decided to code up a quick version in perl. The code is pretty straight forward, start with the tuple (99999999,0) and on each number do one of three actions:

1) Break the number representing a string of zeros into two segments.

2) Extend the length of an existing region of one bits.

3) Collapse two regions of one bits separated by a single zero bit.

Now the two key to the reasons it doesn't work are that one, it takes two 32 bit numbers to identify non-adjacent numbers, and two, for a uniform distribution random number generator the ratio of a million numbers to a space of 100 million means that most of the numbers won't be adjacent. So each 1 bit will cost 8 bytes to store. A million numbers * 8 bytes is 8 million bytes (7.6MB if you use 2^20th as 1MB). Its fun watching the algorithm because it gets slower and slower (its doing a linear search of segments to insert 'bits' and memory goes up and up, until you have about 50M uniques and then it starts getting faster and faster again as it collapses more and more segments.

Storing it on disk with an uncompressed bitmap, runs in O(n) time, and of course you 12,500,000 bytes, just about 12MB (to count multiples you need to multiply that by the number of bits you want to reserve per-number) but doing it in memory only requires a better compression algorithm than simple run-length encoding.

alecco · on Oct 23, 2012

> So as it turns out, this is wrong. [...] I decided to code up a quick version

You rock. I wish more people at HN/SO/Google were like you.

jules · on Oct 21, 2012

It is extremely unlikely that you'll be able to sort a random sequence of 1 million 10 digit numbers like that, or ANY method for that matter, unless the particular sequence is highly compressible. You won't even be able to store the answer in the average case. You'll need 10^6*lg(10^10) bits to store the raw input, from which you'll be able to save about lg(10^6!) bits because it's sorted. That comes out to more than 1.7 megabytes.

Edit: the exact amount of bits necessary is ceil(log_2((1e10+1e6-1) choose 1e6)), which is ~1.756 megabytes.

Edit edit: see udiv's comment below: that 1e10 should be 1e8 of course.

ChuckMcM · on Oct 21, 2012

You've missed the second part, which is reading out the sorted list. By storing which numbers you've seen, in a virtual bit field you have 'codified' in the index of the bit field the seen number. Thus to 'print the sorted list' you simply go through the bit field starting at index zero and print any index in the bit field that is a 'one'. That is your sorted list of numbers.

The reason this was an interview question was because it allowed the interviewer to see if the candidate could see 'past' the requirement of sorting (which pre-loads your mental algorithm cache with all sorts of algorithms) to the somewhat more subtle concept of embedding partial computation into the data structure.

Early in my career at Sun I was tasked with putting mandatory file and record locking into UFS in the kernel, the challenge had always been how do you avoid deadlock? Of course one level deadlock was fairly easy, you need to know if process A is holding a lock and waiting on process B which is holding a lock elsewhere, and you are in the context of process B trying to get this lock then you can see that you will deadlock A. But 'n' level deadlock starts getting harder because the possibilities multiply. I discovered that you could create a doubly linked tree structure which had modest memory growth per-process but any process could walk it one way and look for 'itself' on the list to discover if there was a deadlock possible. Here the 'code' was walk a list and note that you were already on it, so you would deadlock if you got the lock.

The number 'sorting' problem is very much like that because while numbers have different values, they are intrinsic values. Unlike strings which consist of an arbitrary sequence of tokens. So the goal of the question was to identify people who could see the difference between 'sorting numbers' as a problem and 'general sorting.'

jules · on Oct 22, 2012

Your solution still doesn't in any way fit into the required space...for a random sequence there will be hardly any coalescing going on, and you'll blow way past 1MB with the method as described.

uvdiv · on Oct 21, 2012

They're 8-digit decimals, not 10, so it works out to 0.96 MiB.

I think the parent's solution works. The typical separation between numbers is about 2^7: that works out to 8 bits per number -- 7 for the length of the '00...0', plus one length-one '1'. 8 bits * 10^6 is 1 MiB exactly. You need a variable-length encoding, so you can fit 2^7 in 7 bits while still allowing larger numbers to be encoded.

ChuckMcM · on Oct 21, 2012

There is fairly intuitive way to reason about the storage requirement. Lets start with you've seen no numbers in the range, you have the tuple (99,999,999, 0). That is two 32 bit numbers or 8 bytes of storage. Now lets say you've seen all the numbers in the range, you've got (0, 99,999,999), again 8 bytes of storage, two 32 bit numbers. Since we're describing a linear bit field, the worst case is : (1,1,1,1,1,1,1,1,1, ... 1,0) every odd number and none of the even numbers. So the memory to encode the virtual bit field will be (assuming 32 bit numbers always) 4 bytes * n where 'n' is the number of distinct regions in the bit field. With a perfectly uniform random number generator each number generated has possible outcomes:

1) It splits a region (it causes a string of zeros to have a one in them) (adds 8 bytes)

2) It coalesces a region (it removes the last remaining zero between two regions) (subtracts 8 bytes)

I'm sure there are other solutions, and as was pointed out it doesn't really deal with duplicates. I suspect a multi-value bit field might cover that.

jules · on Oct 22, 2012

With 10^6 random numbers between 0 and 10^8 almost all operations will be splits, and you'll end up with 8 bytes * 10^6 = 7.6 megabytes. So indeed the chances of this working for a random sequence are extremely slim. You have to have a huge number of coalesces, which is just extremely unlikely when you choose 10^6 random numbers in a range of 10^8. So for almost all inputs this method doesn't work.

I simulated it with a Python program and you get about 10^4 coalesces. That doesn't even begin to make a dent in 10^6, and storing this will still take ~7 megabytes.

uvdiv · on Oct 22, 2012

With 10^6 random numbers between 0 and 10^8 almost all operations will be splits,

Right.

and you'll end up with 8 bytes x 10^6 = 7.6 megabytes.

Not if you encode cleverly. The successive differences will be on the scale of 10^8/10^6 = 100, which is a very small number. It takes 7 bits to store, or at least 8 bits in a variable-width encoding.

jules · on Oct 22, 2012

Sure, if you encode cleverly. That is the essential difficulty of the problem which he glosses over (note that he explicitly says that he will store the differences as 4 byte numbers). Getting it under 1.5MB (or even 1.1MB) is probably easy, but I have yet to see an encoding which is sufficiently compact to fit in 1MB.

jules · on Oct 21, 2012

You're right, I misread that. Still, that's an information theoretic bound, and I'd like to see anybody achieve that. That's not a lot of room to play with until you hit 1 megabyte. Unless you're doing something complicated across multiple numbers, that variable length encoding will already waste 1 bit at which point you're back to 8 bits and already way over budget. Additionally, if I generate a random array, sort it, and compute adjacent differences, then around 30% of the numbers are greater than 2^7.

emn13 · on Oct 22, 2012

Even though it's almost certainly possible and even practically achievable to compress the required data into the 1MB of space, doing so without using more than 1MB even in temporaries might be impossible. (To get this close to the bound you'll need to use "excess" entropy from one bitpattern to save space in the next; i.e. a delta # doesn't straightforwardly correspond to a bit pattern, but instead depends on context.) Clearly it would be hard, as you say.

An argument that you need context info: if you don't use context info, each bit pattern needs to be rounded up to the nearest number of whole bits. That waste's about N/2 bits. According to my calculations, that would raise the bound from 0.96MiB to 1.02MiB, i.e. too much.

But really, it's even worse. Let's assume you can indeed define such a compression and you manage to use the context sufficiently to avoid wasting too many bits. Further, you manage to do so without using much temporary space at all.

That means you'll need to be updating your sorted list in place. Each update is likely to be a non-trivial task, since you need to find the place to update (with variable-width encoding and no space for luxuries like an index, that's probably a linear scan), and you need to take into account some amount of context - and then, because you've changed some data, recode possibly unbounded amounts (so N) of following context. That makes this sorting algorithm quadratic; and it'll have a high scaling factor too since the operations we're counting here aren't cheap: we're not moving bits around, we're decoding and recoding them all the time in a context-sensitive way and that means doing lots and lots of branches.

Assuming the 1MB limit is reasonably inspired because the chip neads to be tiny, you're probably not dealing with the fastest chip in the world. As an estimate, I'd guess this algorithm (if we could make it) would be about 1000000 times slower than a normal sort (assuming a 13 times higher constant factor), and you'd be running it on something very very slow. On a modern processor, a plain sort takes about 75ms, making this thing 75 ks, or approximately 1 day. On that very limited 1MB proc...

So even you manage to pull it off (very hard), it's almost certainly going to be very slow.

alecco · on Oct 21, 2012

It is possible. Just a bit complicated but possible with proper encoding. Definitely not easy to code in an interview. Or in a day of work.

victorhn · on Oct 21, 2012

How does that handle duplicate elements?

alecco · on Oct 21, 2012

It doesn't. And it looks easily broken the more I think about it. Anyway, running time would be unusable. This is not the right path. Don't drink the Google kool-aid ;)

alecco · on Oct 21, 2012

Thanks for the context.

So that solution does not cover all possible cases. Lovely of Google to ask that question.

Edit: not only that, the running time of Google's "solution" would be terrible, aren't there other possible data sets not covered or am I missing something?

ChuckMcM · on Oct 21, 2012

Actually it can. I'll leave it up to you to figure it out. Proving that the adjusted algorithm works in all cases gets into some interesting number theory (think Galois Fields).

alecco · on Oct 21, 2012

I rather spend my time with a useful solution than this path. Also it's a BS question anyway.

ChuckMcM · on Oct 22, 2012

Well in the case of Google's interviewing techniques they don't really care about the question per se, what they look for is how folks reason about things. You can start down one path, back up, head down another, and explore a variety of paths to possible solutions. The only bad answer is "That's a silly question, why would I ever need to do that?" or something similar. An unwillingness to engage in a discussion, even about 'impossible' problems, was the most common reason I saw on feedback where Google passed on the candidate.

alecco · on Oct 22, 2012

> Well in the case of Google's interviewing techniques they don't really care about the question per se

From my experience interviewing for Google, I beg to differ. Many other accounts seem to confirm this.

055static · on Oct 21, 2012

Maybe they're just looking for improvements in the way they do sorting? Asking questions like these in interviews might be a way to find people who could teach their code monkeys new tricks?

Or it might have just been some silly way to eliminate candidates for arbitrary reasons. The usual.

Anyway, sorting is a big deal I think. If you can do it faster even by just a little bit than everyone else, that's a competitive advantage. Just my personal opinion.

Evbn · on Oct 21, 2012

In thought the popular Google version allowed you to use secondary storage, and the challenge was to be efficient in how often you swapped.

Daniel_Newby · on Oct 21, 2012

"The list of numbers may contain duplicates, which I must not discard."

alecco · on Oct 21, 2012

I guess the googlers woud answer having "0:1" for duplicates.

Big problem #1: insertions for 1M integers would take ages.

Big problem #2: And some distributions can't be covered this way. For example, 1m integers with distances 0:99 (e.g. +99 each one). Now think the same but with random distance in the range of 0:99. (Note 99999999/1000000 = 99.99, so it's possible input)

Google's approach is nonsense, too.

codex · on Oct 21, 2012

All of these bizzaro explanations leave me wondering if I've missed something. Can't this be elegantly solved via the pigeonhole principle?

There are one million numbers to be sorted, out of a space of one hundred million. That means that no number may be more than 100 away from any other number once fully sorted (7 bits). Therefore, you can simply use deltas to encode your numbers, as 7 bit deltas * one million numbers < 1 MB RAM.

EDIT: should've been clearer: no number may be more than 100 away from any other number on average once fully sorted. Therefore, it's an average of 7 bits per number, maximum. Duplicates are even easier, since it's only one bit to encode the duplicate (a delta of zero).

EDIT 2: As for the encoding to be used, I think a universal code or Golomb code would probably be sufficient. They can get quite close to natural entropy.

tolmasky · on Oct 21, 2012

Duplicates are allowed, so you could for example have 500,000 zeros and 500,000 99,999,999's, which are more than 100 apart.

codex · on Oct 22, 2012

That is one of the more ideal distributions, as each number gets one (zero) bit for delta, and the number at the jump point gets a large but globally inconsequential number of bits. Bits per number is still less than 7.

hcles · on Oct 21, 2012

> That means that no number may be more than 100 away from any other number once fully sorted (7 bits) It's possible that 90% of numbers are <1 million, and the remaining 10% are > 91 million. That's a 90 million delta possibility.

Dylan16807 · on Oct 21, 2012

You would of course use a delta scheme that can encode any number using O(log n) bits. At that point having a lopsided distribution of numbers actually helps you. Cramming the first 90% of the numbers into 1% of the space can save you about six bits per number. Even if you spread out the remaining 10% the price is going to be far less than sixty bits each.

wangg · on Oct 21, 2012

Hmm, that doesn't follow, I can have 0, and from 99,000,002 - 100M.

damienkatz · on Oct 21, 2012

Since no computational bound has been placed, this problem could be solved in n^2 by an insertion sort and keeping the list of numbers sorted in memory as they are received. Then the problem then boils down to encoding a list of sorted 8 digit decimal #s, where it's possible to insert new #s.

Since the #s are stored sorted and bounded in size, they can be encoded as deltas which will be more space efficient than storing absolute values. Now we just need to figure out the worst case encoding and will 1 million values fit?

andrewcooke · on Oct 21, 2012

i think that's the clearest answer i've seen. really, it's pretty much equivalent to the other answers, but at least for me it's by far the easiest to grasp. although i guess that may be because the idea is simply becoming familiar after reading the others.

the best "high level" explanation, i think, is that you are compressing the sorted numbers, which are therefore not random, and so concerns about the incompressibility of random streams are completely irrelevant.

eru · on Oct 22, 2012

Merge sort would work better. But yes, the secret sauce is in the compression, not the sorting.

uvdiv · on Oct 21, 2012

Information theory to the rescue! Well, not really.

8 decimal digits takes 26.6 bits. An ordered list of 10^6 of these takes 3.17 MiB. The information contained in the ordering is lg(10^6!) ~= 2.20 MiB [0]. So as an unordered multiset, the information content is 0.96 MiB. It's at least theoretically possible to store the input in 1 MiB of working memory. But only just; in fact it's significant that the problem specifies 2^20 bytes, because for an SI megabyte (10^6 bytes), it wouldn't work.

I don't think it's actually possible though. The answers here don't do it. LZMA accomplishes nothing.

[0] Stirling's approximation in base 2: lg(n!) ~= n lg(n) - n/ln(2)

dpark · on Oct 21, 2012

> The information contained in the ordering is lg(10^6!)

Can you point me to a resource that discusses the information theory behind this claim? I'm interested in learning more, but don't know what to search for.

uvdiv · on Oct 22, 2012

You have N elements, so they have N! possible permutations. The ordering of the list is one such permutation; to describe one message out of a language of N! possible messages, it takes lg(N!) bits.

dpark · on Oct 22, 2012

Interesting. Thanks for explaining. Are there ways to store (multi)sets that take advantage of this fact to reduce size? I can't envision a way to utilize this information meaningfully. It seems like any storage scheme would imply some ordering.

uvdiv · on Oct 22, 2012

Yes, the other commenter ChuckMcM showed one

https://news.ycombinator.com/item?id=4680259

An ordering of a storage scheme doesn't always store information. E.g. if the list is sorted, the ordering is completely determined by the data -- it's redundant.

For storing integers, an idea to is store the pairwise differences between sorted elements. E.g. [30,10,20] -> [10-0,20-10,30-20] = [10,10,10]. If you have N integers of typical size X, the differences will be much smaller, typically on the scale of X/N. With variable-width integers, X/N takes lg(N) fewer bits to encode than X. So you save N lg(N) bits (asymptotically the same as lg(N!)), compared to storing the integers literally.

dpark · on Oct 22, 2012

> Yes, the other commenter ChuckMcM showed one

I saw that comment. I'm not clear how that's not still storing ordered data, or rather how the order there is not storing information.

Obviously my information theory knowledge is weak.

> if the list is sorted, the ordering is completely determined by the data -- it's redundant.

I don't understand this. In what way is the information redundant? If there's 3.17 MB of data in the ordered list, and 2.2 MB of data in the ordering itself, the information stored in the ordering cannot be redundant, because that would mean >4.4MB of information is stored in the ordered list.

ralfd · on Oct 22, 2012

> So as an unordered multiset, the information content is 0.96 MiB

Could you please explain this more or point to other links?

eru · on Oct 22, 2012

The second answer does it.

mikeash · on Oct 21, 2012

Since code size is not restricted, the obvious solution is to store the entire list of numbers received in the program counter. In short, put every possible input combination into the code, and just follow whatever branch pattern you receive.

This requires no memory other than that for the networking stack. It is, of course, also completely impractical.

curiousdannii · on Oct 22, 2012

This was my thought as well. But if you can afford several TB of ROM you can probably afford to increase your RAM to 4MB.

emn13 · on Oct 22, 2012

it's not TB of ROM. It' much, much more: 2^(2^20) addresses... Even if you're faking it and the addresses aren't real (inevitable with those numbers), you're still just cheating and storing data in a multi-megabyte dynamic (not ROM!) program counter. After all, you need to distinguish all those addresses...

mikeash · on Oct 22, 2012

It's not cheating! If they wanted to limit the size of the program counter then they should have specified a limit on the amount of code!

tolmasky · on Oct 21, 2012

EDIT: I put this same description into the stackoverflow page with an image which might help visualize it.

I think one way to think about this is from a combinatorics viewpoint: how many possible combinations of sorted number orderings are there? If we give the combination 0,0,0,....,0 the code 0, and 0,0,0,...,1 the code 1, and 99999999, 99999999, ... 99999999 the code N, what is N? In other words, how big is the result space?

Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:

You can imagine any horizontal leg in our path as a number in our ordering, where the Y location of the leg represents the value (image: http://i.stack.imgur.com/aJp4b.png ). So if the path simply moves to the right all the way to the end, then jumps all the way to the top, that is equivalent to the ordering 0,0,0,...,0. if it instead begins by jumping all the way to the top and then moves to the right 1,000,000 times, that is equivalent to 99999999,99999999,..., 99999999. A path where it moves right once, then up once, then right one, then up once, etc to the very end (then necessarily jumps all the way to the top), is equivalent to 0,1,2,3,...,999999.

Luckily for us this problem has already been solved, such a grid has (N + M) Choose (M) paths:

(1,000,000 + 100,000,000) Choose (100,000,000) ~= 2.27 * 10^2436455

N thus equals 2.27 * 10^2436455, and so the code 0 represents 0,0,0,...,0 and the code 2.27 * 10^2436455 and some change represents 99999999,99999999,..., 99999999.

In order to store all the numbers from 0 to 2.27 * 10^2436455 you need lg2 (2.27 * 10^2436455) = 8.0937 * 10^6 bits.

1 megabyte = 8388608 bits > 8093700 bits

So it appears that we at least actually have enough room to store the result! Now of course the interesting bit is doing the sorting as the numbers stream in. Not sure the best approach to this is given we have 294908 bits remaining. I imagine an interesting technique would be to at each point assume that that is is the entire ordering, finding the code for that ordering, and then as you receive a new number going back and updating the previous code. Hand wave hand wave.

kamaal · on Oct 22, 2012

I think the answers to most of these seemingly impossible questions depends on discovery of isomorphic problems spaces where they can be solved a lot more easily.

In your case you translated the problem to a different problem space and solved it there. The other contributors in stack overflow tend to do the same eg: Using network latency, compression etc to solve these problems.

These sort of solutions become very interesting when they become isomorphic to some other real world problems.

saucerful · on Oct 22, 2012

In case anyone is interested in how to prove "(N + M) Choose (M) paths":

Any such path ends at (N,M). Such a path can be represented as a sequence of N+M bits, 0=right, 1=up, where there are exactly M ups. So choosing a path is identical to choosing the positions of the M ups, thus there are (N+M) choose (M) such paths.

EDIT: My first proof was needlessly complicated because it dealt with paths that ended at arbitrary (N,m) but really you can just let the paths go all the way up to (N,M)-- even if the max value in the list is m < M-- and just ignore the last part of the path. It's still a bijection.

greendestiny · on Oct 22, 2012

Yeah I think if you number the permutations in order from smallest to largest (ie all 0 to all 99999999) then you can update as new numbers come in by keeping track of the number (N) you've already seen and assume the other (million - N) are 0. So you just have your 10^6 bit number representing the sorted list of numbers and an 27bit integer representing the number of samples seen. The update function might be a bit ugly though :)

It's probably actually a question of whether the update function could be done in the remaining memory - it certainly couldn't unpack the whole representation.

spitfire · on Oct 21, 2012

Radix sort. Done.

Depending on his dataset characteristics a radix sort can have a space requirement as low as a few hundred bytes to sort several million values.

EG: 8-bit values, Simply make a 256byte array. Increase the appropriate count on each value when you see a value. When you've gone through the list, loop through the array outputting count values at that index. It's also quite cache friendly, mind the last time I compared was on a Pentium PRO to quick sort.

For larger datasets, you actually want to compare on the digits (LSD or MSD first), and that'll take more memory.

EDIT: Originally posted that it'd take 256 bytes of memory. That's not true for his dataset.

danbruc · on Oct 21, 2012

There is not enough memory to keep all the numbers and therefore radix sort is no viable solution. But I think you actually meant counting sort, but this will not work either. It requires 100,000,000 slots each 20 bits wide for a total of almost 238.5 MiB.

mfukar · on Oct 22, 2012

> Radix sort. Done.

Got code?

dangoldin · on Oct 21, 2012

Reminds me of the sleep sort that was on here a few months ago. If you don't care about the run time and your processor was fast enough it may be hacked to work.

http://rosettacode.org/wiki/Sorting_algorithms/Sleep_sort

optimiz3 · on Oct 22, 2012

As others have mentioned, you need 7 bits per number (on average) if you store the numbers as deltas in sorted form. So 7M bits out of 8388608 bits yields 1.32MB of working set.

One could implement a simple block allocator, where each block contains a sequential list of deltas.

The trick to fast insertion is to place new blocks at addresses interpolated between 0 and 10^8. If there is a collision, merge blocks. If the input distribution is off, physically reallocate colliding blocks left or right into free space.

So inserting the numbers 10, 20, 1000, 2000, 1M, 2M would give you a heap looking like:

[head->[10,10]->[980]->[1000]->[998000]->[1000000]->tail]

As more numbers are inserted, blocks combine until you end up with one contiguous block.

jedberg · on Oct 21, 2012

Am I the only one who thinks this was someone's homework assignment?

kevinburke · on Oct 21, 2012

It does sound similar to the opening problem in the book "Programming Pearls." The solution proposed there was to write temporary files to disk and do a K-ways sort.

gburt · on Oct 21, 2012

Except in this case, you have no disk.

mmaunder · on Oct 21, 2012

Compressed buckets of number ranges. I love it because it's incredibly easy to understand which makes the code easy to maintain, and it works.

carl8 · on Oct 22, 2012

Here's the solution that I came up with prior to following the link or reading any comments.

We're given 1 million integers from 0 to 99,999,999. We only have 1MB of RAM, or an average ~8 bits for each of the million numbers. So we can't store the numbers directly since they take ~27 bits each.

First thought was to use a bitset but that would require 100 million bits, and we only have ~8 million bits RAM, so that's not going to work. Also need to deal with duplicates.

How about this. Something similar to a selection sort algorithm that stores deltas of distances between sorted numbers. As a number is streamed in, we start scanning from the beginning of the list until it's correct position found, where it is inserted and then push down the remaining numbers. This will be O(n^2).

Since the average delta distance between numbers is about 100, we'll use 8-bits to store the delta value. Value 0 means the number is a duplicate of the current number. Values 1-254 mean add this number to the current number for the new number. Value 255 means add 255, then use the next byte as the delta value (repeat until the value != 255).

(Case 1) 1 million ints exactly 100 apart: 0, 100, 200, 300, 400, ..., 99999800, 99999900 Stored as a list of 8-bit delta values: 0, 100, 100, 100, 100, ..., 100, 100 (1 million bytes total)

(Case 2) 1/2 million zeros, then 1/2 million values of 99999999. Stored as: start: 0, 0, 0, 0, ... (1/2 million zero deltas) then: 255, 255, 255, 255, ... (99999999 / 255 = 392,156 times repeated, which gets us to number 99,999,780) then: 219, 0, 0, 0, 0, ... (another 1/2 million zero deltas)

So the total amount of storage for Case 2, which I presume is worse case (but correct me if I'm wrong!) is: 500,000 + 392,156 + 1 + 500,000 = 1,392,157 bytes to store the delta values.

1MB = 1,048,576 bytes, so I'm over by 343,581 bytes... (so close!)

We'll have to modify this scheme so that we reduce the number of 255 values, which should not be hard to do and will get us under the 1MB size limit. Or we could try something fancier like huffman coding to reduce the size of the delta values.

teeja · on Oct 21, 2012

Compressed bucket-ranges works fine for random date, but if the data are skewed in some way (e.g. normal-curve shaped), the bucket-sizes must be adjustable.

An alternative method is to use one bucket into which all values below a limit (e.g. 20,000,000) are sorted as they arrive, and compress all the rest. When 10^6 values have arrived, transmit the bucket and then reuse the empty bucket repeatedly to sort the compressed values.

Zenst · on Oct 21, 2012

Just a thought but given "have a computer with 1M of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, "

I was wondering why you can't use TCP as a form of storage, possibly many ways but latency and buffers would actualy work for you as crude storage. Not that it is need in this case but it is one form of queue that could be abused to store data.

Someone · on Oct 21, 2012

If you go that way: mail all numbers as message subjects to your GMail account, then list your messages alphabetically.

You can 'store' data in DNS, too by, measuring response time to non-existent domains. The first lookup stores a one; skip it if you want to store a zero, the second one destructively reads it with some probability of data loss.

greenyoda · on Oct 21, 2012

That was actually suggested by one of the replies:

http://stackoverflow.com/questions/12748246/sorting-1-millio...

Zenst · on Oct 21, 2012

I must of missed that one, though good too know I'm just as sane as others out there. Many forms of storage that most forget is out there as not directly obvious. Even the display ram is usable - blame my ZX81 days for that perversion.

recon517 · on Oct 22, 2012

Store each number as a delta from a previous one, first number as a delta from 0. Each number starts with a code describing its length: 0 (duplicate), 7, 9 or 27 bits. For storing a code use values '1', '01', '001' and '0001'. Calculate statistics for different code types and when space is tight replace most common codes with shortest ones.

alexkus · on Oct 24, 2012

Bit late to this as I've been ill.

If you try that encoding scheme with some random data you'll find it won't always fit.

I took 1,000,000 random numbers between 0 and 99,999,999 and sorted them and calculated the deltas. Here's the distribution of the sizes of the deltas:-

   diffbits: 1 nos 5009
   diffbits: 2 nos 19937
   diffbits: 3 nos 38275
   diffbits: 4 nos 72319
   diffbits: 5 nos 128146
   diffbits: 6 nos 202000
   diffbits: 7 nos 252698
   diffbits: 8 nos 202681
   diffbits: 9 nos 72750
   diffbits: 10 nos 6148
   diffbits: 11 nos 37
   totbits = (1*5009)+(2*19937)+(3*38275)... = 6,408,685

That's the total number of bits required to store just the deltas, but doesn't include the length encoding.

810241024 - 6408685 = 1979923 spare bits.

Each of the 999,999 deltas will take at least one bit to determine the size of the delta, leaving you with 979,924 bits for extra length encoding and wastage (if you pack multiple lengths under one encoding).

There's no way I can see (after trying lots of permutations) to be able to encode all of those bunches of diffs using those few remaining bits.

What's even more difficult is that you don't have the space to 'calculate statistics for different code types' because you can't fit anywhere near all of the deltas into that 1MB of memory without having them encoded perfectly anyway. Calculating them on partial data is all that you can do, and that's not going to be accurate enough because you've no idea what the unseen data has to hold (it could all be duplicates or 11 bit differences...).

I'm keeping going on looking at this (and trying not to look at any of the posted solutions).

recon517 · on Oct 26, 2012

You are right, my solution is not really working. But I still believe that there might be a working solution of encoding deltas.

Thank you for your reply!

thedufer · on Oct 21, 2012

Does anyone know what the original question was? Its clear from the answers that the input used to be some kind of stream (or, it was unspecified and everyone jumped on this interpretation), whereas now it is in ROM.

tikhonj · on Oct 21, 2012

No, the input is 1 million 8-digit decimal numbers over TCP.

The ROM is used to store the program itself, so the code doesn't take up any of the limited memory itself.

thedufer · on Oct 21, 2012

I misread it. Thanks.

Cieplak · on Oct 21, 2012

Now, can you please solve this with 50% less code using Haskell?

stavros · on Oct 21, 2012

Go ahead.

Cieplak · on Oct 21, 2012

I'm am guessing it would take substantially more LOC to do this in Haskell than in Java. The problem seems to lend itself much better to OO design patterns than to functional ones.

edit: unless perhaps you make use of haskell's FFI, then maybe haskell could beat it on LOC

ben0x539 · on Oct 21, 2012

I really don't see how OO design patterns come into play here. I haven't seen anyone suggest OO-based solutions.

Cieplak · on Oct 21, 2012

The POC code offered by Renat Gilmanov is OO.

mfukar · on Oct 22, 2012

That's not a solution.

taf2 · on Oct 21, 2012

i don't think you can run a JVM in less than 1MB of ram?

Cieplak · on Oct 21, 2012

True. The problem states, though, that the code can be placed in ROM. I agree that this isn't realistic for a JVM.

steve8918 · on Oct 22, 2012

What did the question mean by "8 digit decimal number"? Did that mean anything from 0.0000001 to 99999999?

tosseraccount · on Oct 22, 2012

Use a machine with 32 bit bytes.