I worked on a very interesting project aligning point clouds using SVD,
for a pair of point clouds of the same scene that are not aligned:
- select 3+ pairs of matching points in each cloud (tops of trees, edges of a building etc)
- calculate the vector to the centroid of each cloud
- use SVD to calculate the rotation that gives a minimum distance when applied from the source to the target
- translate and rotate the source cloud to the target
I did this using the rust nalgebra^1 crate after reading a very helpful paper^2 detailing the process. I had planned to build the rust lib into WASM so the process could be run alongside the browser based point cloud visualiser we were using, but had limited time and instead used Neon^3 to build a native binding for our nodejs server to use.
This is cool, I briefly read through the paper, one thing I’m curious of is how hard it would be to add ability to do scale transformation (eg zoom in/zoom out) in addition to translation and rotational transformations. Would it be as simple as just adding a scale factor to the optimization objective and rework a bit of the math?
Very interesting. I will be digging into this one. The first thing that popped out at me:
> We can decompose a given image into the three color channels red, green and blue. Each channel can be represented as a (m × n)‑matrix with values ranging from 0 to 255. We will now compress the matrix A representing one of the channels.
I wonder if the author considered converting to YCbCr colorspace first. The luminance component (Y) is substantially more important to the human visual system than the Cb/Cr components. Some subsampling of the chrominance components would probably work well in these schemes.
This leaped out at me too. Doing lossy compression on RGB is not setting yourself up for success. Of course, then you'd need two sliders in the UI; one for how much to compress the Y and one for the CbCr.
SVD also works on complex matrices. I imagine there's value in compressing the subsampled Cb/Cr channels as real/imaginary components in a complex matrix.
The representation by SVD would work almost identically. SVD is one of those absolutely magical, amazing algorithms that power a ton of things we do every day.
Especially if YCbCr is a linear transform of the RGB. (I'm not sure if it is, but it's likely to be close). If it is, it's essentially just making eigenvectors from a matrix with a different set of bases.
Funny enough, I did this same project (minus the fancy web interface) for a numerical linear algebra course in college—except I had to do it in Matlab.
It's worse than just about any "real" image compression algorithm, but it works! (Plus you get lossless compression if your image is low-rank.)
It is interesting that lossy compression algorithms are better, despite the Eckart–Young–Mirsky theorem. I guess “best low rank approximation under unitarily invariant norms“ doesn’t mean much to eyeballs though.
Eckart-Young-Mirsky just relates to approximation by low rank matrices, and it says that the error is best in terms of the operator norm, i.e. the action of the approximation is closest.
When you’re trying to compress an image you’re trying to optimize something quite different, so it is actually not too surprising.
Images have structure that can be exploited, and exploitable structure is everything. A Fibonacci spiral isn't going to have a nice low rank decomposition but it can be specified exactly (and thus reproduced exactly) in a small handful of bytes. EYM is a structure-free guarantee, so we can't expect it to show that SVD would perform better on the natural image manifold than would purpose-built algorithms.
Perceptual loss is different from mathematical RMSe loss. There has been a lot of work into making things that decompress into something mathematically completely different, but which to Mk1 Human Eyeball looks very similar.
SVD is probably the most important theorem in linear algebra. Basically you can take any matrix and find how much it rotates and stretches as a linear transformation
It is a tough battle between SVD and the concept of eigenvalues/vectors. SVD is only meaningful for linear operators between inner product spaces, whereas eigenvalues/vectors do not even require a norm. On the other hand, eigenvectors are only meaningful for linear operators from a space to itself.
SVD is just an extension of the Eigenvector Decomposition to allow the two orthogonal matrices to not be equal. Think of SVD as Eigenvectors of your data both in a rowwise and colwise perspective and intuitively it works out pretty well.
Basically this expresses the image as an image where each row is a linear combination of a set of K rows, with different coefficients for each row (or equivalently for columns).
In general it doesn't make sense to compress images this way, since the algorithm is not invariant with respect to 2D image rotation, a very relevant operation for realistic images, but is invariant with respect to row/column permutations, which are not a relevant operation for realistic images.
Mmh, compression ratio doesnt seem super high. I wonder how the values are stored, perhaps floats? Some thoughts:
Maybe using yuv would be better than rgb?
Maybe values could be represented as 0..1, and the values itself could be stored as 0.16 fixed point numbers.
Use some generic compression on top.
Is there a smart way to store the basis vectors? Something about angles somehow (analogue to quaternions)? Also, once U is known, isnt the inverse implied? Also, if U is approximated and fixed, perhaps the singular values can be adjusted again to minimize errors.
> Is there a smart way to store the basis vectors?
Consider, instead, compressing small blocks of the image instead of a whole image plane at once. Let's say 8x8.
Over such a small region, pixels are usually very highly correlated. You could model them statistically as an order-1 autoregressive process with a high correlation coefficient (e.g., x[i + 1] = x[i]*rho + noise, with rho >= 0.95 usually, separately in both the horizontal and vertical directions).
Then, computing the eigendecomposition of the 8x8 cross-correlation matrix C_{ij} = rho^|i - j| produces a set of common eigenvectors that you can use for all blocks (recall that the left and right singular vectors U and V of a matrix M are just the eigenvectors of M.M^T and M^T.M, respectively). So now instead of several very large vectors, you only need a single 8x8 matrix of basis vectors (because it's square and orthonormal, the inverse is just the transpose).
But wait. Let's take the limit as rho -> 1. It just so happens that the resulting eigenvectors converge to the Type 2 Discrete Cosine Transform [0]. So, in fact, you don't need to transmit the basis at all.
You can store the output of the transform as fixed point numbers with some arbitrary base (the choice controls the amount of compression: anything too small just gets rounded to zero). Maybe use run-length encoding as your generic compression, because you expect to have a lot of zeroes (visit them in a zig-zag order, to get the big values first and maximize the length of the runs of zeroes), with some Huffman coding of the actual symbols. Add a simple predictor for the (0,0) coefficient of each block (it winds up being the average of the pixel values, so it should be pretty similar from block to block).
I’m guessing there’s just a lot left on the table—
> Maybe values could be represented as 0..1, and the values itself could be stored as 0.16 fixed point numbers. Use some generic compression on top.
Codecs like JPEG use a variable amount of quantization. You quantize by scaling the 0..1 float by some scalar, putting it in the range 0..k, and then truncating it to an integer, and encoding the integer. The integer value is encoded using an entropy coder like Huffman. The parameter k must also be encoded somehow, or fixed.
Look up “JPEG coefficient quantization” for how JPEG does it.
Codecs for audio and images are often made up of understandable parts that fit together: transformations, quantization, entropy coding. If you come up with a new transformation, you can make a whole codec by putting together the remaining pieces—but something like a new transformation is itself interesting, because somebody else can always assemble it into a codec if it shows promise.
> Codecs like JPEG use a variable amount of quantization
The reason quantization works so well in JPEG is because of the DCT step and its energy compaction properties. This gets most of the coefficients near zero. I think without this transform you would be introducing a lot more noise in the final result.
At some point, we are going to end up re-implementing a thing approximating jpeg here. Colorspace convert, subsampling & DCT+quantization is most of the magic sauce.
The singular value decomposition is doing the same thing as the DCT, sort of. It’s transforming the image into coefficients where the important coefficients are packed in one location. It’s worth applying similar techniques. You can see the demo where it’s truncating the coefficients.
Half the image codecs out there look like JPEG if you squint hard enough. That’s ok, “reimplementing a thing approximating” another codec is how a lot of marginal improvements have been made in the past, and if you add up enough marginal improvements, you end up with a competitive new codec in its own right.
The DCT isn’t magic, it just happens to work really well and requires little computational power. There’s a lot of possibilities for replacing the DCT with something else, especially now that we have more computational power to throw around.
Have you done any PSNR analysis & comparison to prevailing standards like webp (or even jpeg)? from the look of it, it seems not very efficient as the compression ratio gets pretty bad for any kind of visually comparable scenarios.
Don't get me wrong -- they're great tools. Especially OLS for analysis has this whole framework for understanding errors you will not get in models fit using maximum likelihood methods.
But as a final usecase for a product there's generally better out there.
I think you're mainly thinking of machine learning and data science applications, and so your perspective may be a bit limited. But, of course, you didn't actually give any explanation of what you mean other than mentioning ordinary least squares. Would you like to elaborate and back your point up?
In computational science and engineering, there are many applications in which the SVD is a very reasonable and good choice. Some examples: fast direct solvers for integral equations, model order reduction, solving inverse problems, etc.
It's funny how the higher compression ratios make everything look similar to a style Gerhard Richter often used. It could be used as an effect in painting apps.
- select 3+ pairs of matching points in each cloud (tops of trees, edges of a building etc)
- calculate the vector to the centroid of each cloud
- use SVD to calculate the rotation that gives a minimum distance when applied from the source to the target
- translate and rotate the source cloud to the target
I did this using the rust nalgebra^1 crate after reading a very helpful paper^2 detailing the process. I had planned to build the rust lib into WASM so the process could be run alongside the browser based point cloud visualiser we were using, but had limited time and instead used Neon^3 to build a native binding for our nodejs server to use.
^1 https://docs.rs/nalgebra/latest/nalgebra/linalg/struct.SVD.h...
^2 https://igl.ethz.ch/projects/ARAP/svd_rot.pdf
^3 https://github.com/neon-bindings/neon