Hacker News new | past | comments | ask | show | jobs | submit login
Sorting a million 32-bit integers in 2MB of RAM using Python (neopythonic.blogspot.com)
18 points by astrec on Oct 23, 2008 | hide | past | favorite | 7 comments



Yes the question is bogus since 1 million 32bit integers require 4MB storage.

Here is my solution. Select the 500.000 smallest values, sort them using an in place sorting algorithm and spit them out, then select the 500.000 biggest values, sort them and spit them out.

The problem then boils down to select the n/2 smallest values of the list. To do this I store the n/2 first values in the 2MB buffer. I then organize them into a heap with the biggest at the top. I then process the remaining n/2 values. When one of these values is bigger than the heap top value, it is skipped. If it is smaller, the heap biggest value is replaced with the smaller value and the heap is restored to get the biggest value at the top. When all the n/2 values have been processed, the 2MB buffer contains the n/2 smallest values of the million integer list.

Since it is already organized as a heap one can use heap sort to sort it. Pick the biggest value at the top and swap it with the last value of the list and restore heap to get the next biggest value at the top. Swap it with the last unsorted value and proceed until the heap length is reduced to one element.

Apply the same algorithm with the n/2 biggest value but using a heap with the smallest value at the top.

This algorithm requires two passes over the million integer, and uses nothing more than the 2MB buffer.

The complexity is O(nlog).


I am in the process of writing a program to sort one million 32bit integers in 2MB in RAM.

The trick is to notice that sorting the numbers loses information. One million sorted numbers take (in theory) log_2 (1000000!) less bits to store than the same numbers shuffled randomly. That's around 18488884.82 bits.

So you get an ideal storage requirement of +32000000-18488884=13511116 bits. That's about 1.6MB (thus less than 2MB). If you're clever you'll stay close enough to the theoretical limit in your implementation.

Strangely bz2 seems to be able to compress my list of random numbers (in a binary file) to 164KB. That's even less than what my theory says.

At first I suspected my pseudo-random number generator (/dev/urandom on a recent Ubuntu) my produce hidden patterns. But I since tested it on true random numbers from random.org. Same result.

I'll investigate.


I believe your ideal storage requirement, but what's the time complexity of your algorithm, and could you describe it in more detail than "clever"?

I'm pretty sure random reads and writes on any compressed form will be worse than O(1), and I'm pretty sure your sort will do O(n log n) of those, so I believe your algorithm is asymptotically (and in practice for n = 1 million) slower than chmike's two-pass algorithm.


I was thinking along eru's lines. The specifics I came up with: 1) Compress sorted lists by encoding the differences between successive numbers using a code that's shorter for smaller numbers. 2) Use in-memory merge sort. That way you don't need random access to the lists, and it's overall O(n log n). 3) In order to merge large lists when memory is already almost full, divide the lists into blocks, allocate blocks for the merged list while you're deallocating blocks from the two source lists. 4) To do this in Python, use the "array" module.


Ah, I made a mistake with compressing the real file - I assumed 2 Million 16 Bit numbers.

Using 1 Million 32 Bit numbers I still get down to 1891867 Bytes with 7zip. That's less than 2 MB but more than the theoretic minimum of 13511116 bits (=1688889.5 Byte).


[deleted]


He explains when he talks about "heap". See http://en.wikipedia.org/wiki/Heap_(data_structure)


http://netlib.bell-labs.com/cm/cs/pearls/cto.html

Millions of numbers in a single megabyte may be interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: