Doubling the speed of jpegtran with SIMD

pierrec · on Oct 8, 2015

I sort of expect a top-notch compiler to perform this automatically (especially if it allows profiling). The MSVC++ compiler has pretty impressive auto-vectorization, as I've verified several times by testing the performance of my DSP code with SIMD enabled and disabled. Just make sure your code facilitates auto-vectorization, and it should be more readable and future-proof while achieving the same result, assuming the compiler is smart enough.

(Also, since this is CloudFlare, <insert rant and dream about SIMD happening in LuaJIT>... Thanks CloudFlare!)

legulere · on Oct 8, 2015

The problem is that often the compiler can't automatically vectorize because some preconditions are not met. The knowledge about those preconditions is mostly limited to compiler people and so far, by what I know, there's no static analysis tool that tells you why the compiler couldn't vectorize a loop.

leecb · on Oct 8, 2015

Clang can tell you why a loop wasn't vectorized with a couple of compiler flags:

-Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorized

http://llvm.org/docs/Vectorizers.html#diagnostics

claudius · on Oct 8, 2015

The Intel compiler has various options to report autovectorisation success. You still have to go through them and follow the advice, of course, but by and large they are not entirely useless.

RogerL · on Oct 8, 2015

MSVC also will report whether vectorization succeeded, and why it didn't if it failed.

flamedoge · on Oct 8, 2015

do a profile run and I'll bet you it will insert as many runtime checks necessary to get around those precondition checks.

jgrahamc · on Oct 8, 2015

Do you want to help out with LuaJIT? We are looking for people...

lukego · on Oct 8, 2015

Do you mean that you want to hire people as LuaJIT maintainers? That would be exciting news. If you have any details to share I would be happy to start pointing good people in your direction.

jgrahamc · on Oct 8, 2015

Yes. If someone is qualified and wants to do that we would be interested in hiring them in London or San Francisco.

pierrec · on Oct 8, 2015

I would love to, but I'm probably severely under-qualified (apart from having used LuaJIT extensively). I'd probably get lost pretty fast while delving further into the inner workings of the compiler, a domain in which I only have superficial knowledge. I can learn, of course, but it looks pretty steep. (And no, I do not have a CS degree, which seems like one of those useful things to have when working on a compiler.)

nfriedly · on Oct 8, 2015

You should apply anyways. There's really only two possible outcomes:

* They say "no": you go on with your life, no biggie.

* They say "yes": you move to SF/London and start learning :)

oberstein · on Oct 8, 2015

Those two possibilities vary hugely in their time consideration! 1 considers the far future outcome ("you go on with your life") but 2 only considers the next few months. 2a you find out you can do it and enjoy it, great career path for the next n years, 2b you find out you can't hack it and either quit or get fired, 2bi you then do what you would have done in 1, not a huge opportunity cost, 2bii your ego suffers a huge blow and you fall into a pit of depression and drunkenness and self-doubt about ever learning anything again and you: 2biiα eventually get out and do 1 (larger opportunity cost and potential permanent damage), or: 2biiβ die soon after...

A lot of people really want to avoid the 2bii tree and will shrink away from any action that seems to have a chance of leading from the bearable status quo to there.

valarauca1 · on Oct 8, 2015

>I sort of expect a top-notch compiler to perform this automatically

Based on the N-Body portion of the Bench Mark game it only seems like the ICC does this.

0x07c0 · on Oct 9, 2015

U can also use something like openMP (Newer versions has pragma's for SIMDs) to help the compiler expose parallelism in your code. (Hard fore a compiler to figure out the intent of the programmers. So be nice and give the the poor compiler some hints ;) ). Also many consumer grade chips have GPUs integrated, this can also be used for speeding things up.

Symmetry · on Oct 8, 2015

I'd hope that at least the branches would be turned into cmovs.

enigmo · on Oct 8, 2015

Depends on how predictable the branches are. LLVM used to use CMOVs whenever possible but I believe this was changed a few years back to generally improve performance.

See a related bug in GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

"conditional moves instead of compare and branch result in almost 2x slower code"

lorenzhs · on Oct 8, 2015

Yeah, easy to predict branches are really cheap. Very non-deterministic, hard-to-predict ones on the other hand are still surprisingly expensive. In such a case, conditional moves might make sense, or sometimes you can also use the result of a comparison in arithmetic operations, something like:

  idx = 2*idx + (key > table[idx])

(to move a key down an implicitly-stored binary tree)

a_e_k · on Oct 8, 2015

I remember Linus making much the same point: http://yarchive.net/comp/linux/cmov.html

It generally lines up with what I've observed. Surprisingly, I've found that even arithmetic comparisons like you mentioned ran a bit faster with branches instead of conditional moves. The one case where I have seen comparisons benefit is with SSE code where the comparison results go straight to a register instead of a flag.

guardian5x · on Oct 8, 2015

I wonder if they've considered using libjpeg-turbo instead of libjpeg9a, which comes with a lot of SIMD optimization and is regarded to be 2x-5x as fast as IJG's implementation.[1]

[1] http://www.libjpeg-turbo.org/About/Performance

jgrahamc · on Oct 8, 2015

We considered libjpeg-turbo, but the performance of jpegtran there is much unchanged. If you look at the Huffman encoding function it is exactly the same. Applying the optimizations to libjpeg-turbo also yield a 2x performance improvement for our usage.

listic · on Oct 8, 2015

My thought as well.

As far as I understand, these intrinsics map to SSE instructions (they are 128-bit). One could use the later AVX (256-bit, found in server chips since 2011). Probably they decided to start with the lowest common denomintor of SSE, because they don't have AVX in all of their servers? The term SIMD is generic.

unwind · on Oct 8, 2015

That is explicitly mentioned in the article:

Note: we could also choose to use 256-bit wide YMM registers here, but that would only work on the newest CPUs, while gaining little performance.

I'm not saying they're right and you're wrong, but you make it sound as if they never even considered AVX.

liotier · on Oct 8, 2015

> Note: we could also choose to use 256-bit wide YMM registers here, but that would only work on the newest CPUs, while gaining little performance.

Should that be an #ifdef ?

repsilat · on Oct 8, 2015

Maybe if the machine running the code is the machine that compiles it, or if you have a good way to send processor-specific binaries to the right places.

More useful, I think, would be a runtime check far enough up the stack (i.e., away from tight loops) that it doesn't affect performance. Probably doesn't even need to be that high, it'll branch-predict correctly every time but the first iteration, and the binary bloat probably won't hurt your icache because the number of instructions you ever actually execute stays more or less the same.

booblik · on Oct 8, 2015

There is an #ifdef in the final code on github

xioxox · on Oct 8, 2015

AVX 2 is required to get the wider integer instructions, so this would require Haswell processors.

dahart · on Oct 8, 2015

Very nice! An independent but very useful way to speed up image batches, as log as you're working in batches and not one at a time:

  parallel jpegtran ::: *.jpg

Combine it with these SIMD improvements, and your batch will be done before you hit enter.

sloppycee · on Oct 8, 2015

http://www.gnu.org/software/parallel/

karim79 · on Oct 8, 2015

While this is a great project from both an academic and performance perspective, the resulting images are substantially larger than ones produced using Mozjpeg. From the article:

We at CloudFlare make sure that our servers run at top notch performance, so our customers' websites do as well!.

Correct me if I'm wrong, but it does seem to me that this is of greater value to CloudFlare in terms of electricity savings than to CloudFlare's customers. Less computing time for CloudFlare at the expense of larger output files for CloudFlare's customers to serve, files which will potentially be served again and again and count against the customer's transfer quota. It might not seem like much, but it adds up when you consider how many times a single image in the wild can be downloaded. Consider the following example:

  # Original size is 21796912 bytes
  # jpegtran with SIMD
  jpegtran -copy none -optimize -progressive bigjpeg.jpg > bigjpeg_jpegtran.jpg
  # bigjpeg_jpegtran.jpg resulting size is 19771224 bytes

  # mozjpeg 3.1
  mozjpeg -copy none -optimize -progressive bigjpeg.jpg > bigjpeg_mozjpeg.jpg
  # bigjpeg_mozjpeg.jpg size is 19328832 bytes

That's a difference of 442KB! I'm not emphasizing the computing time here. I do not dispute that this is orders of magnitude faster than Mozjpeg, but isn't it worth doing the extra work to get a smaller file? That file could be served billions of times. My argument is simply that the extra computing effort in the beginning leads to much grander savings in the long run.

In my (quick) testing Jpegtran's squishing performance is consistently less than that of Mozjpeg. So what, actually, is right thing to do in the end?

0x07c0 · on Oct 9, 2015

If they get larger file sizes then that probably means they are doing something wrong. Optimized a mpg encoder with SSE once. And larger file size was usually do to some lose of floating point precision. Going thorough the functions and comparing output from serial to vector function should give the bug. Fixing this stuff also can give some more speed up!

chatman · on Oct 8, 2015

They should've included graphs to their article, e.g. to compare speedups at each compression level using histograms. Even tabulation of their results may have looked cool.