Pillow-SIMD – Fast, production-ready image resize for x86

pedrocr · on July 6, 2017

Is doing lanczos really faster and/or better quality for scaling down than just doing a simple pixel mixing:

http://entropymine.com/imageworsener/pixelmixing/

I implemented this for my image pipeline:

https://github.com/pedrocr/rawloader/blob/230432a403a9febb5e...

Makes for simple enough code and even before any serious effort at optimization or SIMD it can convert a 3680x2456x4 image in 32bit float (source article is 3x8bit) to 320x200x4 also in 32bit float in about 60ms (across 4 threads in 2 cores on a i5-6200U).

jacobolus · on July 7, 2017

Probably not faster. Definitely better quality. As Alvy Ray Smith said, a pixel is not a little square, http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf

Edit: If you want to do even better than the tensor product Lanczos filtering, you can do a filter based on Euclidean distance. Make sure that you work in linear (non-gamma-adjusted) color space.

A couple more resources to read: http://www.imagemagick.org/Usage/filter/#cylindrical http://www.imagemagick.org/Usage/filter/nicolas/

Edit 2: really weird that this is getting downvoted. There’s not really anything to dispute here. It is straight-forward to show that the Lanczos method has objectively better output. Moreover, Alvy Ray Smith’s paper is a classic which anyone interested in image processing should read.

wtallis · on July 7, 2017

That memo needs to die. There are better ways to make the few of its points that are still valid, without making assumptions and assertions that are frequently invalid with today's technology.

Pixels are often not representative of little squares (rectangles). However, they are more likely to be representative of little rectangles now than they were in 1995, now that most photographs are captured on Bayer arrays of square sensors, most computer-generated images are rendered as an average of point samples weighted according to estimated coverage of a square pixel, and everything is displayed on LCD or OLED panels with square pixels comprised of rectangular subpixels. If you take a random JPEG or PNG off the Internet, interpreting the pixel data as point samples will usually be less accurate than interpreting them as integrated over a rectangle. Interpreting the data as integrated over a gaussian distribution is also less realistic than the rectangle interpretation. Doing image processing in a point sample or gaussian context is certainly useful, but it's definitely not more fundamentally right than the little square model. Historically, that context was at best a neutral choice that was equally unrealistic no matter what kind of hardware you were working with, but mathematically convenient.

The paper's arguments about coordinate systems (whether the y-coordinate should grow toward the top of the screen or the bottom, and whether pixels should be centered on half-integer points) are also a waste of time for the modern reader.

mark-r · on July 7, 2017

Even if your physical sensor is made of little squares, they cease to be squares once you've converted them to single readings - you've integrated the square into a single point. This is the basis of sampling theory. Continuing to think of them as little squares leads you to bad intuitions such as the resampling algorithm that we're responding to. Maybe there's a better way to make that point than the cited paper, but I haven't seen it yet.

Whether pixels are on the half-integer points is a completely arbitrary decision, unless you're trying to mix raster and vector graphics. Then the correct solution will become obvious.

pedrocr · on July 7, 2017

From the point of view of sensors the square model seems correct. Sensor pixels are photon counters. If you want to turn a 2x2 pixel square into a single pixel summing the four pixels is the same as if you had a sensor with 1/4 of the pixels and each of them counts the photons over those 2x2 areas[1]. What we may be saying is that since you have the extra pixels you can actually do better than if the sensor was already natively at your desired resolution by using a more complex filter over the extra data (e.g., by avoiding the Moiré artifacts that are common in lower resolution sensors).

[1] Most sensors are bayer patterns so some extra considerations apply to color resolution

panic · on July 7, 2017

Yeah -- pixels are little squares, since a pixel appears as a little square on your display. If displays did the "correct" thing and interpolated the pixels using a band-limited signal, it wouldn't look good at all. The low sample rate compared to the resolution of human vision means you'd see ringing artifacts on every sharp edge.

If you want to do the correct thing from a signal processing perspective, you should upsample your images with a square-pixel filter until the Nyquist frequency is below the limit of human vision first. Then you can do your operations on the pixels as point samples before downsampling again with a square-pixel filter.

jacobolus · on July 7, 2017

On your display a pixel appears as a narrow sorta-rectangle, with a bit of black space above and below, and a whole bunch of empty space to the left and right (populated with 2 other pixels of other primary colors). But properly doing antialiasing for the precise geometry of sub pixels is going to be a big pain in the butt, and will be device specific and break down hard when you get to a device that uses a different pixel geometry, or when someone applies intermediate processing to your image before display. Not to mention, if you really want to get it exactly right, you’ll need to take into account the viewer’s visual acuity, the viewing distance, and the precise color characteristics of the display and surrounding environment.

jacobolus · on July 7, 2017

The better alternatives are all like 50-page papers with worse titles and a lot of math. If you know a nice 5–10-page “modern” summary of how to treat filters for image resampling aimed at a non-specialist audience, please link away. I’ll agree that treating pixels as Gaussians is not especially great.

In practice most images go through multiple layers of processing, physical sensor and display pixels come in all kinds of wacky shapes (and as you point out have channels offset by different amounts, etc.). It’s usually better to treat pixels as point samples for generic intermediate processing because you have no idea what kind of source an image comes from, and you have no idea what someone down the line is going to do with your image before sending a final signal to the display hardware. Creating synthetic images by integrating over little rectangles produces markedly inferior results to applying some real resampling filter to the samples, but gets done by computer games etc. because it’s computationally cheap, with the hope that there are enough pixels moving fast enough that someone won’t notice the artifacts.

jacobolus · on July 7, 2017

> The better alternatives ...

e.g. http://hhoppe.com/filtering.pdf

a_e_k · on July 7, 2017

The "Sampling and Reconstruction" chapter from PBRT is also pretty good. The authors have placed the 1st-edition version online as a sample chapter: http://www.pbrt.org/chapters/pbrt_chapter7.pdf

a_e_k · on July 7, 2017

> most computer-generated images are rendered as an average of point samples weighted according to estimated coverage of a square pixel

This may be true in terms of the sheer amount of GPU rasterized imagery.

I wrote the pixel filtering code currently used in one of the major production-quality film renderers. I can tell you that it uses the classical approach of treating pixels as point samples. Notionally, it convolves irregularly placed camera samples with a reconstruction filter to create a continuous function which is then point sampled uniformly along the half-integer grid to produce the rendered image.

Regarding the squarish pixels on the shown screen, I like to view them as just an analog convolution of those discrete samples so that the photoreceptors in our eyes can resample them.

jacobolus · on July 7, 2017

Nice paper (is this patented?) https://graphics.pixar.com/library/MultiJitteredSampling/pap...

You don’t have an email address in your HN profile. Is there one folks trying to contact you should prefer (before I hunt around the internet looking for one)?

a_e_k · on July 7, 2017

Thanks, just updated my profile.

And no, intentionally not patented.

dahart · on July 7, 2017

I totally understand the sentiment, Alvy Ray's paper feels anachronistic and perhaps too pushy, but I think it's still more right and more applicable that you're allowing.

The real point of the paper is that a pixel is a sample in a band-limited signal, not something that covers area. That's still just as true today, no matter what camera or display, no matter what pixel shape you're using. The point behind the paper still stands, even if the shape turns out to be a square, so we shouldn't get too hung up on the title and language railing against a square specifically.

While true that display pixels are more square today than when it was written, that's only one minor piece of the puzzle. Because we're talking about image resizing, there are multiple separate filters to consider, and for resizing it would be bad to treat pixels as squares even if you can.

If a camera's pixels are little squares, and we want to sample and then resize that image, our choice of resize filter needs to account for the little squares. We can't use a lanczos filter at all, we'd have to use something else entirely.

The big problem you have is that a sampled signal is band limited, and we treat them as perfectly band-limited. We have a body of knowledge about how to use and reason around perfectly band-limited signals, we don't have a strong image resizing theory for sampled data that is made of high-frequency samples.

If you don't convert to an ideal band-limited signal during initial sampling, then you'd have to keep the kernel shape with the image as some kind of meta data, and you'd have to use that during image resizes. If we don't have a perfectly band-limited signal, then our resize filter will always be larger than the ideal resize filter, and resizes with square pixels will take longer than resizes with band-limited point samples.

> The paper's arguments about coordinate systems are also a waste of time for the modern reader.

I'm curious why? These are still issues if you write a ray tracer, or if you mix DOM and WebGL in the same app. The paper was written for what was the SIGGRAPH going audience -- professors and Ph.D. Students -- at the time, which were all graphics researchers just learning about signal processing theory for the first time. Graphics textbooks today still cover Y-up vs Y-down for images and 0.5 offsets for pixels.

pcwalton · on July 7, 2017

> most computer-generated images are rendered as an average of point samples weighted according to estimated coverage of a square pixel

Or, in the case of fonts, they're rendered by computing the exact area coverage of the polygon over…a square representing each pixel.

a_e_k · on July 7, 2017

I'd say usually better quality rather than definitely better quality. While the Lanczos filter is always superior to the box filter in theory, there are some case in practice where the box filter may be better.

If you're downsampling a line-art image, for example, you may actually be better off with some variation of the box filter. The negative lobes on the Lanczos filter can induce objectionable ringing because line-art tends to be full of what are basically step functions. It's a similar issue as with strong mosquito artifacts on JPEG compressed line-art.

A filter function that is everywhere non-negative, such as the box filter or a Gaussian can't suffer from this problem. Of the two, the box filter will give you sharper results but of course it doesn't antialias as well, either.

mark-r · on July 7, 2017

Despite what the writeup says, you don't need to view pixels as little squares to make sense of this algorithm. It's a filter just like any other. You can evaluate its quality by examining its frequency response.

smallnamespace · on July 7, 2017

The 2D frequency response is poor because treating pixels as 'little squares' essentially substitutes Manhattan distance for Euclidean distance between pixel positions.

You have reasonable behavior along the orthogonal dimensions, but you're introducing a sqrt(2) stretch factor along diagonals.

mark-r · on July 7, 2017

I think you're missing my point. Although the "little squares" are used as a justification for the algorithm, the application of it is not dependent on that view. You're doing a convolution of points against a filter formula, just as you would with bicubic or Lanczos.

The stretch factor along diagonals will be the same for any separable filter will it not?

jacobolus · on July 7, 2017

Yes, any separable filter causes grid artifacts that can be partially ameliorated by using a radial filter.

My own pipe-dream is that we would use a triangular grid (if you like, the Voronoi cells here are hexagonal pixels) for intermediate image representations. This is more spatially efficient and has nicer 2-dimensionnal frequency response than a square grid, and displays are so heterogeneous nowadays that we need to do some amount of resampling for output pretty much all the time anyway, and our GPUs are getting fast enough that resampling at high quality from a hexagon grid to a square grid for output should add only relatively cheap overhead.

jacobolus · on July 7, 2017

Okay, sure. Go ahead and make a white paper about your analysis of the frequency response of this method for some particular scale of resizing (it ends up somewhere between a box filter and a bilinear filter), and you’ll see that it ends up doing a significantly worse job than the Lanczos filter. You’ll get less detail resolved and more aliasing artifacts.

mark-r · on July 7, 2017

I agree completely by the way.

dahart · on July 7, 2017

Yes, lanczos really is better quality, but it seems like the pixel mixing article admits and explains that openly. You don't always need better quality, sometimes a box filter is fine, but for some applications it can be super important to use a high quality filter.

Faster depends on implementation details, but from the looks of it, to implement pixel mixing, you either have to generate your convolution kernel dynamically, or chop some source pixels into potentially 4 separate pieces? I would guess that it's easier to optimize a static kernel than to make a dynamic kernel faster than a static one. But I'm not entirely sure how fast pixel mixing could be made.

> It can be confused with a box filter or with linear interpolation, but it is not the same as either of them... [pixel mixing] Treats pixels as if they were little squares, which gets on some experts’ nerves.

This is a funny way of putting it. Pixel mixing is definitely using a box filter, just slightly differently than what people normally call box filter resizing. The author even says that later: "Another way to think of pixel mixing is as the integral of a nearest-neighbor function." Pixel mixing as described here is clipping the source image under the box filter rather than using a static kernel. The reason that a box filter isn't ideal is well understood. It's because the filter itself has high frequencies in it. This is the reason that the quality of pixel mixing resizes is low.

One of the benefits of a box filter is that applied multiple times, it becomes a better filter, and approximates a Gaussian after several times. I'm not sure but I would bet that clipping to the exact filter boundary actually prevents you from being able to do that with pixel mixing.

mark-r · on July 7, 2017

Generating the factors dynamically isn't that hard. I implemented this algorithm back in the late 80's, and it had acceptable performance on a 25 MHz processor. Not only that but it had to do some distortion correction at the same time, so it was applying a different resize factor to every single line of the image. All with integer math.

dahart · on July 7, 2017

Oh absolutely, I believe it, and I looked at @pedrocr's code. Very simple. But, a static box filter is even simpler.

Did you call it pixel mixing in the 80s? I've been doing image resizing for decades and never heard the term before tonight. I have implemented a couple of very similar algorithms to pixel mixing for CG films, once in a shader and once for antialiasing in a particle renderer.

pedrocr · on July 7, 2017

>Did you call it pixel mixing in the 80s? I've been doing image resizing for decades and never heard the term before tonight.

I hadn't either. I just searched for a name for the algorithm I also came up with independently. The imagemagick docs had a particularly comprehensive discussion on resizing options and linked this page with this naming for the algorithm.

mark-r · on July 7, 2017

I didn't have a name for it back then. I developed it independently so there was no existing reference for the name.

I worked on a major application years later that was calling it bilinear, until someone pointed out that bilinear had a completely different definition. I think we renamed it Weighted Average.

homm · on July 7, 2017

Well, 3680x2456 / 0.06 seconds ≈ 150 Mp/s or 75 Mp/s/per core. Pillow-SIMD's current implementation runs on ≈ 700 Mp/s/pre core for bicubic (which is closer to this implementation).

Obviously, your implementation could be further optimized, but quality drawbacks will remain the same.

pedrocr · on July 7, 2017

I've since made a small change that doubled performance and I'm operating on >5x more data per pixel (32x4 vs 8x3). So a 10x factor makes this already faster without any explicit SIMD yet. Proper benchmarking on the same machine would need to be done obviously.

On the quality drawbacks I'd have to do some more checking. This algorithm is closest to what the image would be if you had a camera with that native sensor size. The standard filtering approach may very well have plenty of cases where it produces better output but it can also cause issues so I wonder if this isn't a conservative solution.

mark-r · on July 7, 2017

Nice writeup. I think it spells out the difference pretty well - less aliasing and artifacts for Lanczos. But you gotta admit it's fast. For large shrink factors you'd be just as well off averaging whole pixels, even if the rectangles being averaged aren't all the same size.

CyberDildonics · on July 7, 2017

Filters are not there for speed, they are there for quality. If you image is not a square power of two, pixel mixing will have artifacts.

pedrocr · on July 7, 2017

>If you image is not a square power of two, pixel mixing will have artifacts.

I've implemented it generally by weighing the pixels by the actual area overlapping so border pixels get weights <1.0 for non power of two reductions. Haven't seen any artifacts from it.

CyberDildonics · on July 10, 2017

That can be generalized to a filter kernel.

est · on July 7, 2017

can we add this to Pillow?

Asooka · on July 6, 2017

Let me insert my personal pet-peeve: have you thought of making it colour-space-aware? Most (all?) images you'll encounter are stored in sRGB colourspace, which isn't linear, so you can't do the convolution by just multiplying and adding (the result will be slightly off). The easiest way would be to convert it to 16-bit-per-channel linear colour space using a lookup table, do the convolution in linear 16-bit space, then convert back to 8-bit sRGB.

mark-r · on July 6, 2017

I particularly like this example image for testing: https://transloadit.com/blog/2010/10/is-your-image-scaling-s...

You might be able to do this as a 3-part process instead of expecting the resizing to handle it natively. But that brings up a good question, does the new SIMD goodness work on anything other than 8-bit data? You couldn't do linear in anything less than 16 bits.

homm · on July 6, 2017

Color space conversion is a hard topic in terms of performance. First of all, not all images are stored in sRGB. Most of them have another color profiles (such as P3 or ProPhoto). So, sRGB conversion is not enough, you need the full color management.

Second, you'll see a real profit of color management only on a few images. Most time you'll see the difference only when you see both images at the same time on the same screen.

For now, I came up to the resizing in original non-linear color space and saving the original color profile with the resulting image.

jacobolus · on July 7, 2017

By far most images in the wild are sRGB, and those that aren’t should typically be tagged with their color space.

Resizing in gamma-adjusted space (sometimes) causes nasty artifacts when resizing. If you can afford the CPU use, always convert to an approximately linear space first, then downsize, then convert back. If you get the gamma curve slightly wrong (e.g. gamma = 2.0 vs. 2.2) it’s not too big a deal, the resulting artifacts won’t really be noticeable, so feel free to use a square root routine or something if it has better performance.

jacobolus · on July 7, 2017

But note if you go to a linear space, you need to bump up the precision of your numbers. 8 bits per channel doesn’t cut it anymore, you’ll want 12 or 16 bits per channel (or even better, a floating point number). This might have a big effect on performance.

mark-r · on July 6, 2017

You don't really need the full color management for resizing, only fixing the gamma curve. I believe Apple's implementation of P3 uses the common 2.2 gamma, which is also approximately what sRGB uses. Unfortunately ProPhoto uses 1.8.

Matumio · on July 7, 2017

You can even see this problem in the article. The convolution-based sample image is clearly darker than the nearest-neighbour one.

People go all crazy about interpolation and then get the brightness wrong. It's even more obvious for high resolution photos of a tree or grass in bright sunlight. Once you start looking you notice the change in brightness everywhere, when you click on a thumbnail or while a JPEG is loading.

pishpash · on July 6, 2017

For that matter, the conversion should be to CIELAB, which is perceptually uniform.

lcrs · on July 6, 2017

Perceptual uniformity is in some ways opposite to the linearization suggested above - the L* component of CIELAB is much more like the gamma-encoded values of sRGB than a linear light measure.

It seems tough to come up with hard and fast rules for whether to mimic the linear physical processes, or work in a perceptual space more like the human visual system. I'd love to hear about more rigorous work in this area - most things I read have boiled down to "this way works better on these images".

It's interesting for example that using Sinc-type filters to resize truly linear data, like that from HDR cameras, usually gives rise to horrible dark haloing artifacts around small specular highlights, despite that being the most "physically correct" way to do it. Doing the same operation in a more perceptual space immediately sorts out the problem.

pornel · on July 6, 2017

No, it shouldn't be done in perceptually uniform color space. LAB, like sRGB, is non-linear.

Resampling simulates physical properties of the light, not subjective perception of the eye.

jacobolus · on July 7, 2017

Resampling in CIELAB space tends to work better than resampling in gamma-adjusted R′G′B′ space, because at least you never end up averaging two pixels and getting a lightness which is outside the range of input lightnesses, which is what causes the worst artifacts in R′G′B′. A linear space will give a better result, but CIELAB results are usually acceptable.

dahart · on July 7, 2017

I'd be very interested in an optional Pillow-SIMD downsampling resize that produces 16 bit output internally and then uses a dither to convert from 16 bit to 8 bit. Photoshop does this by default and it produces superior downsampling. Without keeping the color resolution higher, you can end up with visible color banding in resized 8 bit images that wasn't visible in the source image.

I am curious if the reason that Pillow-SIMD is more than 4x faster than IPP is due to features IPP supports - like higher internal resolution - that Pillow-SIMD doesn't? The reported speeds here are amazing, and I'm definitely going to check this project out and probably use it, but I'd love a little clarity on what the tradeoffs are against IPP or others. I assume there are some.

homm · on July 7, 2017

Each resampling algorithm will internally produce some high-precision result before cutting it to 8 bits. For Pillow-SIMD it is 32-bit integers. Currently, I haven't considered dithering, but it is a very interesting idea. Do you have any links for further reading about downsampling banding and dithering?

About IPP's features: the comparison is pretty fair: the same input, the same algorithm and filters, pretty much the same output. If IPP uses more resources internally with the same output, so, maybe it shouldn't.

Shame on me, I still haven't added the link to IPP's test file I used. Here is it: https://gist.github.com/homm/9b35398e7e105a3c886ab1d60bf598d... It is modified ipp_resize_mt program from IPP's examples. If you have installed IPP, you'll easily find and build it.

dahart · on July 7, 2017

> Do you have any links for further reading about downsampling banding and dithering?

Sadly, no, I wish I did. I just made some expensive mistakes printing giclee images from downsampled digital files, and whipped up my own dither for converting 16bit to 8bit. It wasn't until it bit me that I noticed Photoshop does it better than most apps because dithering is on by default. That's when I went looking and found an option for it in Photoshop's settings.

The main banding problem when downsampling is with slow changing gradients. Sky and interior walls, for example. I bump into it a lot with digital art too, since the source images don't have any noise. But even when there's noise in the source image, downsampling 2x or more with a good filter can eliminate the noise and cause gradients to stabilize and show their edges in 8bit color. In my experience, the problem is more common with print than on-screen resized images, but it's still pretty easy to spot on a screen, especially in the darks, and especially when jpeg compressing the results.

Implementation wise, the 16-to-8 bit dither is nowhere near as sensitive as the dithers we normally see converting 8bits to black & white or when posterizing. Almost anything you come up with will do. You don't need any fancy error diffusion or anything like that. Here's what I do: imagine the filtered 16 bit result as an 8.8 fixed point number in the [0-256) range, so the least significant bits are in the [0-1) range. I add a random number between -0.5 and +0.5 before rounding to the nearest integer. Viola, drop the low 8bits and the result is a dithered 8 bit value.

What I just described will be way slow in your world if you call a random number function every pixel, so don't do that. :P For Pillow-SIMD you'd want a random number lookup table or something slightly smarter than a random() function. And I dither on the color channels separately, but there might be some way to make it blaze by dithering the brightness and rounding all three channels up or down at the same time. I've just never tried to optimize it the way you're doing, but if you find a way and release anything that dithers, I would LOVE to hear about it.

jacobolus · on July 7, 2017

I suspect on a GPU it will be better to use 32-bit floating point internally. But yeah, dithering the output when converting back to integers would be great.

In Photoshop I always convert to 16-bit linear color before doing any kind of compositing or resampling.

FrozenVoid · on July 7, 2017

Obligatory: http://johncostella.webs.com/magic/

gioele · on July 7, 2017

Unrelated: since when has this picture of Bologna become a new lenna.jpg?

I think I have already seen it in a couple of recent posts about image compression. (Fits perfectly the definition of Baader-Meinhof phenomenon [1].)

[1] https://en.wikipedia.org/wiki/List_of_cognitive_biases#Frequ...

josteink · on July 7, 2017

> Unrelated: since when has this picture of Bologna become a new lenna.jpg?

Lenna does IMO not contain enough sharp edges and contrasts to highlight the differences between the different resize techniques.

With Bologna, you can clearly see the problems with a nearest-neighbour approach. I'm not sure that would have been equally visible with lenna.

homm · on July 7, 2017

This picture from great site with collections of high-resolution photos under free license https://unsplash.com/photos/vXpcpTl2Tt4

Bu the way, thank you fo pointing out that this is Bologna! I'm going to Italy at the end of the month, can visit it :-)

McKayDavis · on July 7, 2017

Even in 1996 there was discussion [1] about the controversy around lenna.jpg and its decline due to its origins and possible copyright issues.

[1] http://www.cs.cmu.edu/~chuck/lennapg/editor.html

stephencanon · on July 6, 2017

FWIW the Accelerate framework[1] gives roughly comparable performance[2] for Lanczos resizing. Apple platforms only, but all Apple platforms, not limited to x86.

[1] vImageScale_ARGB8888( ).

[2] I don't have identical hardware available to time on, and it's doing an alpha channel as well, so this is slightly hand-wavy.

arcticbull · on July 7, 2017

Accelerate is a really amazing and highly underrated framework. Sure the function names are, uh, sub optimal (I'm looking at you, vDSP). That said, having a framework guaranteed across all devices to implement algorithms and primitives in the fastest way for each new device as they come out is amazingly valuable.

I've built production systems over the last few years with it that really wouldnt have been possible without it.

stephencanon · on July 7, 2017

Just to provide a little context for the vDSP function names, vecLib predates OS X--it was part of OS 9--and the vDSP interfaces descend from the SAL library at Mercury; like LAPACK and BLAS, they needed to be short for Fortran 77 compatibility.

izacus · on July 7, 2017

What kind of (server-side) production system can afford to only run on Apple consumer hardware? Or was it a mobile app?

nuschk · on July 7, 2017

IMGIX (a image scaling service) used to use Mac Pros [1]. Dunno if they still do though.

[1] http://photos.imgix.com/racking-mac-pros

nostrademons · on July 6, 2017

Curious how this would compare vs. running it on the GPU? This is literally what GPUs are made for, and they often have levels of parallelism 500+ times greater than SIMD.

dimatura · on July 7, 2017

I tried to do this once with Theano, and found that the latency of the roundtrip to GPU and back made it not worthwhile for a single image. Maybe a batch of images at once would make it worthwhile. And this isn't what theano is intended for, admittedly - custom CUDA might do a better job.

fulafel · on July 7, 2017

I got curious about the numbers so I did a napkin calculation:

In a 2012 vintage Nvidia article[1] they get 5-6 GB/s in both directions (array size 4MB) which would be around 1500 Mpix/s with 8bit RGBA pixels.

15 Mpix image: Transfers both ways would take 20 ms, and given GPU kernel going at ~5x the CPU speed (CPU 30, GPU 150 Mpix/s), you would spend 100 ms doing the computation. So 120 ms on GPU vs 500 ms on the CPU.

[1] https://devblogs.nvidia.com/parallelforall/how-optimize-data...

edit: so I have no idea about the real GPU spedup, but this shows that the transfers shouldn't hurt too much unless the speedup vs CPU is very small.

nostrademons · on July 7, 2017

Interesting, thanks. So it seems like it'd still be a pretty heavy win for the GPU.

Also, a common use-case on the web today is to have one input image and then a large number of output images (usually smaller) for different screen resolutions & thumbnails. Seems like you could save a lot of time by uploading the input image once and then running a bunch of resize convolutions for different output sizes while it's still in the GPU memory, then download the output files as a batch.

mark-r · on July 6, 2017

I'm really happy to see this. The one time I tried looking at the PIL sources for resizing, I was appalled at what I saw. Simply seeing that you're expanding the filter size as the input to output ratio shrinks is a huge deal.

When I wrote my own resizing code, I found it helpful to debug using a nearest-neighbor kernel: 1 from -0.5 to 0.5 and 0 everywhere else. It shook out some off-by-one errors.

cvwright · on July 6, 2017

> No tricks like decoding a smaller image from a JPEG

Given that most cameras are producing JPEG now, I'm curious why you don't make use of the compressed / frequency-domain representation. To a novice in this area (read: me), It seems like a quick shortcut to an 8x or 4x or 2x downsample.

Or is the required iDCT operation just that much more expensive than the convolution approach?

jpap · on July 7, 2017

They would likely get another big speedup by doing this. iDCT gets faster as you perform a "DCT downscaling" operation because you require fewer add/mul [1].

You could probably go for another speedup, independently of DCT downscaling, by operating in YCbCr before a colorspace conversion to RGB. For example, for 4:2:0 encoded content (a majority of JPEG photographs), you end up processing 50% less pixels in the chroma planes.

When you combine both techniques, you can have your cake and eat it too: for example, to downsample 4:2:0 content by 50% you can do a DCT downscale on only the Y plane, keeping the CbCr planes as they are before colorspace conversion to RGB. No lanczos required!

If you need a downsample other than {1/n; n = 2,4,8}, you can round up to the nearest integer n then perform a lanczos to the final resolution: the resampling filter will be operating on a lot less data.

On quality I once saw a comparison roughly equating DCT downscaling to bilinear (if I can find the reference I'll update this comment). With the example above, it really depends on how you compare: if you compare to a 4:2:0 image decoded to RGB where the chroma is first pixel-doubled or bicubic-upsampled before conversion to RGB then downsampled, it might be that the above lanczos-free technique will look just as good because it didn't modify the chroma at all. Ultimately it's best to try-and-compare.

Lastly you could leverage both SIMD and multicore by processing each of the Y, Cb, and/or Cr planes in parallel.

[1] http://jpegclub.org/djpeg/

jacobolus · on July 7, 2017

That’s a shortcut if you only ever have to downsample by powers of two and you don’t mind worse image quality, since your down-sampled picture won’t use any data from across block boundaries.

mark-r · on July 7, 2017

You can use it in a multi-step process. Use JPEG blocks to get slightly above the target size, then Lanczos to finish it off.

_ul1u · on July 7, 2017

Looks really nice!

I'd love to see vips in the benchmark comparison, perhaps a Halide-based resizer too as those are the fastest I've found so far. Perhaps GraphicsMagick too, as I believe it's meant to be faster than ImageMagick in many cases.

dahart · on July 7, 2017

GraphicsMagick (GM) is definitely loads faster than ImageMagick (IM) for some resize operations, but I doubt it's in the same league as Pillow-SIMD, just guessing though. I've had to do some large image (gigapixel) resizing, and IM keeps the entire image in memory, which causes it to hit swap. GM streams the image resize instead, so it doesn't have to swap, and it can finish in a few minutes instead of many hours.

ashishuthama · on July 7, 2017

Another data point: MATLAB, glnxa64 AVX2, 12 core

>> maxNumCompThreads(1);

>> im = randi(255, [2560, 1600, 3],'uint8');

>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))

ans =

    0.0083

>> timeit(@()imresize(im,[320,200],'bilinear'))

ans =

    0.0301

>> maxNumCompThreads(6);

>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))

ans =

    0.0062

>> timeit(@()imresize(im,[320,200],'bilinear'))

ans =

    0.0113

Oh, missed that lanczos2 part:

>> maxNumCompThreads(1);

>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))

ans =

    0.0146

>> maxNumCompThreads(6);

>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))

ans =

    0.0049

Since MATLAB tries to do most of the computation in double precision, its harder to extract much from SIMD.

ttoinou · on July 6, 2017

Have you tried to use a fast blur (like StackBlur for example : http://www.quasimondo.com/BoxBlurForCanvas/FastBlur2Demo.htm... , the radius should be computed according to the ratio between original size and target size) as a first step before taking the classic nearest neighbor ? And also try to make an algorithm that resize to multiple resolution at the same time could improve speed

    I take an image of 2560x1600 pixels in size and resize it to the following resolutions: 320x200, 2048x1280, and 5478x3424

So you are also upscaling ?

vadiml · on July 6, 2017

Great work and great article. One question though: Did you consider to replace convolution based filters by FFT based ones ?

liuliu · on July 7, 2017

Lanczos or Gaussian are separable filters. That reduces computation considerably.

stagger87 · on July 6, 2017

For small filter sizes, convolution is going to be faster than an FFT approach. Plus, correct me if I'm wrong, but you need to perform a convolution for every output pixel where the filter kernel is different for each convolution (Sampling the lanczos filter at different points depending on the resample ratio), which would really slow down an FFT approach.

pishpash · on July 6, 2017

Beyond the mismatched signal size, the FFT approach creates circular convolutions, which is not what you want in images. You'd need windowing or a larger effective signal size, and pay the cost of conversions back and forth.

homm · on July 6, 2017

No. As I can see, no one uses FFT for image scaling. I guess because of it much slower.

cyphar · on July 7, 2017

It's a bit of a misnomer to talk about a distinction between convolution and FFT. By the convolution theorem, the two are mathematically equivalent. In addition, on paper, FFT-based convolution scales much better than traditional convolution because it reduces the complexity from O(n^2) to O(n log n).

I haven't done the benchmarks for a fully optimised implementation, but comparing naive implementations you can easily tell that FFT-based is much faster (even with all of the tricks that MATLAB does to optimise sparse matrix operations).

gfody · on July 6, 2017

> I wasn’t building it for fun: I work for Uploadcare and resizing images has always been a practical issue with on-the-fly image processing.

you ever consider pushing the work entirely to the client with a resize implemented in javascript? that would cut down on bandwidth as well.

igordebatur · on July 6, 2017

Uploadcare supports shrinking images before uploading, https://uploadcare.com/documentation/widget/#image-shrink (I'm one of co-founders)

homm · on July 6, 2017

It's exactly what my previous article "Image resize in browsers is broken" is about )

https://blog.uploadcare.com/image-resize-in-browsers-is-brok...

It originally was written two years ago, so some things have changed. But in general, it is still correct: for most browsers, you need to combine several ugly technics to get suitable results. Though, the quality will have nothing common with quality when you have direct access to hardware.

MichaelGG · on July 6, 2017

Why not use asm.js instead of relying on canvas and hacking around browser problems?

homm · on July 6, 2017

1. asm.js is dead in favor of WebAssembly 2. I need a solution for most browsers on most platforms, not for some very recent and most popular

nneonneo · on July 6, 2017

He's scaling images down, not up. Think taking an uploaded smartphone picture (multiple megapixels) and scaling it down to thumbnail-sized images for various screens.

gfody · on July 6, 2017

ic - I thought it was just the usual image-downsample-and-upload thing you see everywhere.

even so, if the service is hosting your images in multiple resolutions you could do it all client side at upload time. they'd be trading bandwidth for cpu time.

striking · on July 6, 2017

Bandwidth is more valuable than CPU time when communicating across the wider internet.

techdragon · on July 7, 2017

Any reason this had to be a fork?

I would much rather this feature be in Pillow so ALL of the python ecosystem could get 6 times faster image resizing.

girvo · on July 7, 2017

He addressed that in the article: Pillow is cross-platform and cross-architecture, so these sorts of specific optimisations (x86-64 only, with some pretty specific instruction requirements) meant that the author felt it wouldn't be a good fit in the original library

vortico · on July 6, 2017

Looks fantastic! Are there other bottlenecks, such as JPEG encoding and decoding that can be ported to SIMD code in Pillow?

lcnmrn · on July 6, 2017

No, that’s handled by libjpeg which can be replaced by libjpeg-turbo but nothing else at the moment.

legulere · on July 7, 2017

Can the speed increased even further using GPGPU?

MuffinFlavored · on July 6, 2017

> With optimizations, Uploadcare now needs six times fewer servers to handle its load than before.

This is devil's advocate, but did you guys have concrete need for this optimization? You now need six times fewer servers, but was that a crippling problem, or is it a cool statistic for the future when you get more users?

FridgeSeal · on July 7, 2017

Does it need to be a crippling problem before you do anything about it?

Even discounting that, the fact that their server bill will now be 6x smaller is justification enough? Even if the cost savings aren't quite that much (suppose they work out to be 50% of previous costs), if I was running a business I would totally be implementing optimisations that allowed me to halve my running costs...

igordebatur · on July 6, 2017

Now we're a little happier every time AWS send their bills

girvo · on July 7, 2017

One can make an argument that the CO2 and energy cost for wasted server usage is a decent reason for it! 6x fewer is not a small amount, that's a great result.

dmitrymukhin · on July 7, 2017

I thought this was in article already :)