Anyway -- the machine is not swapping -- python grabs ~20GB of ram, there is another ~100GB available. It pegs one of the cores. Nothing else was running during this test so there was no competition for the fsb. This is on a recent-gen 16 core xeon server. The bit with the pipe and the popen is just me writing the header for the matrix market file in a separate file because it's simpler to dump it from hadoop that way. My time measurements above did not include file reading time -- just the experience of running the norms code (which is just a dot product of each column with itself, essentially). The matrix file is 2GB uncompressed on disk; dimensions are [5e5, 13e7] with 1.2e8 nz.
I rewrote as a sparse, column major matrix in java, running on the same machine, with a sparse dot product implementation. I was off a little before -- the time to stripe the entire dataset is ~2.7 seconds, averaged over 1000 tries. I was confused because I'm spawning 5 threads which gets the time to compute a dot product of one column against every other column down to an average of 1 pass through the data per second. If you contact me, I'm happy to share the code, but it's rather more lengthy.
One more edit -- the java code is not particularly optimized and doesn't batch columns. It just spawns 5 threads that bang on my giant array. Where the data is stored as a column major array of sparse vectors, and each sparse vector is an array of integer indicies and an array of integer values.
Such large differences imply a difference in the algorithm
Indeed, you write
c = mat2.getcol(j)
norms[0, j] = scipy.linalg.norm(c.A)
which means (i) extract a sparse column vector, (ii) convert it to a dense vector, and (iii) compute the norm. Now, this should explain the speed difference. Looking at the nnz, a dense norm can take up to a factor 5e5/(1.2e8/1.3e7) ~ 54000 longer :)
The main issue here is that the linear algebra stuff under `scipy.linalg` doesn't know about sparse matrices, and tends to convert everything to dense first. You'd need to muck around with `m2.data` to go faster.
Thanks a bunch for the help. It would be nice if this were documented somewhere besides reading the source code though :( And none of my googling turned up m2.data.
I'd actually guessed that it might be making columns full, but I'd expected to see a step-ladder up and down memory pattern as fectors were allocated, gc was triggered, vectors were allocated, etc. I didn't observe such a pattern; memory usage was almost constant.
Anyway, thanks again for your help -- I'd offer via email to buy you a beer if you're ever in SF, but no email, so...
Is there any chance that SciPy was compiled without the ATLAS/BLAS/LAPACK libraries? (You can verify this using with the "numpy.show_config()" and "scipy.show_config()" functions in an interactive python session.) In that case, all the numerical math would be done in straight Python rather than optimized C/FORTRAN.
ATLAS/BLAS/LAPACK is mostly useless for basic sparse operations, at least in scipy. It becomes useful for more advanced operations like SVD, etc... Unless you can use dot (scalar product), which is very fast in numpy if you use atlas (e.g. time the difference between np.sum(x * x) vs np.dot(x, x) - sum does not use atlas, dot does).
For sparse SVD as I implemented in scipy, even fast dot does not seem to make that much of a difference (would be interesting to do real benchmarks, though).
http://gist.github.com/578226#file_gistfile1.py (sorry, that's an editable link, so please be nice)
Anyway -- the machine is not swapping -- python grabs ~20GB of ram, there is another ~100GB available. It pegs one of the cores. Nothing else was running during this test so there was no competition for the fsb. This is on a recent-gen 16 core xeon server. The bit with the pipe and the popen is just me writing the header for the matrix market file in a separate file because it's simpler to dump it from hadoop that way. My time measurements above did not include file reading time -- just the experience of running the norms code (which is just a dot product of each column with itself, essentially). The matrix file is 2GB uncompressed on disk; dimensions are [5e5, 13e7] with 1.2e8 nz.
I rewrote as a sparse, column major matrix in java, running on the same machine, with a sparse dot product implementation. I was off a little before -- the time to stripe the entire dataset is ~2.7 seconds, averaged over 1000 tries. I was confused because I'm spawning 5 threads which gets the time to compute a dot product of one column against every other column down to an average of 1 pass through the data per second. If you contact me, I'm happy to share the code, but it's rather more lengthy.
One more edit -- the java code is not particularly optimized and doesn't batch columns. It just spawns 5 threads that bang on my giant array. Where the data is stored as a column major array of sparse vectors, and each sparse vector is an array of integer indicies and an array of integer values.