Hacker News new | past | comments | ask | show | jobs | submit login
Image Scaling Attacks (embracethered.com)
431 points by wendythehacker on Oct 29, 2020 | hide | past | favorite | 73 comments



This obviously works when the image is "scaled" by sampling/nearest-neighbor (e.g. downscaling 2x by taking every second pixel and discarding the rest), not actually scaled through some better method (by doing math that involves all pixel values).

What the article doesn't mention, and the paper it links to probably mentions somewhere alongside so much irrelevant information that I couldn't find it yet, is whether this also works on some of the better scaling algorithms, and thus whether it's a "duh, OBVIOUSLY" or actually interesting research.

The blog post gives a cv2.resize example which seems to default to "bilinear", but I'm not sure what this means for downscaling, in particular for downscaling by a large factor.

I suspect that the key takeaway is "default downscaling methods are bad".


You have to use AREA interpolation for downscaling. Bilinear will only interpolate among the 4 nearest source image pixels. It still ignores most of the source pixels.

This is in essence a special version of sampling artifacts, aliasing artifacts. Anyone writing image processing software should already know about aliasing, the Nyquist theorem etc. Or, well, perhaps not in the current hype, where everyone is a computer vision expert who took one Keras tutorial...

Resizing with nearest neighbor or bilinear (ie ignoring aliasing) also hurts ML accuracy, so they better fix it even regardless of this specific "attack".


Bilinear could mean downscaling with a triangle kernel, but it might well be the standard bilinear interpolation that's native to most GPUs and OSs.

Also area interpolation still has some pretty terrible aliasing, since box kernels are terrible at filtering high frequencies.

And of course with downscaling you could still freely manipulate the downscaled image if you're allowed to use ridiculously high or low values, provided you knew the exact kernel used.


Bilinear uses the triangular kernel over the source image (with size corresponding to the input pixel size).

Area interp works very well in practice, it's more sophisticated than just a box filter on the input and sampling. It calculates the exact intersecting footprint sizes and computes a weighted average. Do you have examples where this causes aliasing and can show a better alternative?


You can use any image with a high frequency regular pattern. Wikipedia has the following example: https://en.wikipedia.org/wiki/File:Moire_pattern_of_bricks_s....

Anything softer than area will help with those kind of issues (which is why the original https://en.wikipedia.org/wiki/Aliasing#/media/File:Moire_pat..., looks fine in most browsers even if your resize it). Bicubic tends to do better in this respect. It's a trade off though.


Sorry, but this is wrong. Area has no aliasing, all others introduce aliasing artifacts when DOWNscaling.

https://imgur.com/a/C6utkwr

Now you could use pre-smoothing with a kernel and then resampling, but then we are talking about something else.

It's important to understand that interpolation happens in the source pixels, so it does not help when downscaling. Cubic tends to look nice, yes, but only when UPscaling.


Yeah if you're going to be using interpolation to downscale it's obviously going to look worse than even the most basic version of downscaling. That's why downscaling uses the transpose of the the interpolation kernel, not doing that and being surprised the result doesn't look good is just silly.


Do you know of any image processing library that has an implementation for that?


Imagemagick should work. It also has quite an extensive documentation: https://legacy.imagemagick.org/Usage/resize/. Though it's a bit hard to know where to start. I'm fairly certain it'll tell you somewhere that interpolation and downscaling use their kernels differently, but I couldn't tell you where.


There's another way to hide the image, and that is to exploit the nonlinearity of the response curves (gamma).

I have an image I crafted a long time ago which looks something like gray noise when you open it up, but when you downscale it, you see an image of Lt Cmdr Data from Star Trek. I wonder if I can dig it up.

The technique itself was not novel when I did it, a more sophisticated version involving embedded gamma values (which you can make quite large or small) was routinely used on image boards some ten or fifteen years ago.


It's ridiculous that so few websites actually handle this well. Even my own self-written imgur clone does it just fine:

https://i.k8r.eu/i/F_XCMA

https://i.k8r.eu/F_XCMAm.png

https://i.k8r.eu/F_XCMAt.png

You just have to go into a linear colorspace and use an area filter.


Related, You can get an idea of what your browser display is doing in this shadertoy: https://www.shadertoy.com/view/Wd2yRt


Fwiw, the reason why wikipedis doesn't do this when rescaling images (or at least didn't years ago when i was working on image resizing code for wikipedia) is that to do that (with off the shelf software) required keeping the entire image in memory, which was a big no no. I mean, i guess it would be fine for small images, but then you're using two different algorithms depending on image size, which seems bad.



The article links to this browser test page:

http://www.ericbrasseur.org/gamma_dalai_lama.html

On my machine, both Firefox and Chrome display grey rectangles when scaling down. Why do the browsers get this wrong?


Because resizing in a linear colorspace is more costly. JPEG can be resized without shifting colorspaces VERY cheaply, but requires loading into RAM if a change in colorspace (or gamma shift) is performed. The hit can be quite significant. On a phone or laptop it would hurt battery, on an online service (dynamic resizer service) it would impact latency.


> on an online service (dynamic resizer service) it would impact latency.

If its even possible at all. Sometimes users upload things like https://commons.wikimedia.org/wiki/File:“Declaration_of_vict...


Can also depend on the monitor? When I drag this page between monitors I see different effects.


Max Pooling could also be targeted extremely easily with this technique, and it is immensely popular as a scale reduction technique in convolutional neural networks. So, yes, it could very well be a relevant and non-trivial attack in the context of 'dataset poisoning'. (it would also be relatively easy to defend against; just don't use max-pooling in the first layer -- but the point is this is a steganographic attack).


One key thing to be aware of is that not all "bilinear" scaling algorithms are created equal. If the "bilinear" in question is GPU-accelerated, it's quite possible that it's the Direct3D/OpenGL bilinear filter, which samples exactly 4 taps of the image from the highest appropriate mip level (which may be the only one, unless the application goes out of its way to generate more). That means if the scaling ratio is less than 50%, it becomes something like a smoothed nearest neighbor filter and is vulnerable to this attack.

The introduction of a mip chain + enabling mip mapping mitigates this, because when the scaling ratio is less than 50% the GPU's texture units will select lower mips to sample from, approximating a "correct" bilinear filter. This does also require generating mips with an appropriate algorithm - there are varying approaches to this, so I suspect it is possible to create attacks against mip chain generation as well.

Thankfully, quality-focused rendering libraries are generally not vulnerable to this, because users demand high-quality filtering. A high-quality bilinear filter will use various measures to ensure that it samples an appropriate number of points in order to provide a smooth result that matches expectations.

One other potential attack against applications relying on the GPU to filter textures is that if you can manually provide mip map data, you can use that to hide alternate texture data or otherwise manipulate the result of downscaling. As far as I know the only common formats that allow providing mip data are DDS and Basis, and DDS support in most software is nonexistent. Basis is an increasingly relevant format though and could potentially be a threat, but as a lossy format it poses unique challenges.


> This does also require generating mips with an appropriate algorithm - there are varying approaches to this

http://number-none.com/product/Mipmapping,%20Part%201/index....

http://number-none.com/product/Mipmapping,%20Part%202/index....


Bilinear and trilinear with mipmap is still relatively poor. 3D also use anisotropic filtering, that eliminates a lot of artifacts, even in 2D scenarios.


It is a very common, and often overlooked issue in image processing. Bilinear is widely used, and not particularly good. For large factor downscaling it is reminiscent of nearest pixel.


> It is a very common (...)

Bilinear interpolation is perfectly acceptable for zooming-in an image (making it larger by adding new pixel values). If you want to zoom-out, you have can still use bilinear interpolation, but of course you have to filter the image data beforehand to avoid aliasing.


Most often scaling and filtering is an integrated process, when one says bilinear it is usually implied that it is combined with nothing else.


Indeed. If you filter the image data, you should _not_ do bilinear on top of that, since bilinear is a box filter, so you'd soften the image for no good reason.


You still need some kind of interpolation if the zoom factor is non-integer, and bilinear is a good choice in that case.


Yeah, the default implementation should check the scaling factor and use AREA interpolation when downscaling and bilinear for upscaling.


Whether it works or not depends on how many samples are used to downscale. Amusingly, this attack was used for bait-and-switch and “click here to [x]” gimmicks on some websites, especially 4chan, and you can find examples tuned primarily for typical thumbnail generators (which, probably for performance reasons, tend to only sample a small number of pixels.)

https://thume.ca/projects/2012/11/14/magic-png-files/


You're looking for section 3.1 in [1] where they analyze the effect of scaling width and kernel size for any abitrary downscaling kernel.

> Any algorithm is vulnerable to image-scaling attacks if the ratior of pixels with highweight is small enough.

1 - https://www.usenix.org/system/files/sec20-quiring.pdf


Just a quick thought: If you just average the surrounding pixels, you could possibly still add occasional pixels to skew the average and create a different image, though that may be much more noticeable.


If you add occasional pixels to skew the average then probably it will be noticeable in the original image. But the interpolation scheme that uses only the four corners while ignoring the rest can be easily fooled. You can blend an entire lower resolution image in the four corner.


I remember seeing this techinqie 8 or 10 years ago on 4chan. The thumbnail was some innocuous picture, when clicked on it, it expanded to the larger version with a banana. The larger version also had these kind of dots on it.


This is a different, related trick, which I explored in detail in PoC||GTFO 15:13.

https://archive.org/stream/pocorgtfo15#page/n96/mode/1up

This isn't based on attacking scaling algorithms per se, but rather on the fact that most browsers honor the gAMA gamma setting in PNG file headers, while most image processing libraries don't and strip it when downscaling them.

The abuse potential for AI training exists here too, but both attacks are a bit of a stretch.


I'm curious about the use of the word 'attack' here - is that really what this is? If so, what exactly is being attacked? I thought this kind of thing was called steganography


The attack part seems to be that husky ai is down scaling images it uses to train its model. If it was vulnerable to this attack its down scaling would expose the hidden image and use that for training instead of the user visible image. I think this could be used to trick manual or even automated reviews of the input.


My guess it's an evil actor could contaminate a training data set with hidden images, resulting in a faulty ML model.

... but yeah, it's a screech as the application in the real world, seems to be a really specific case to work.


I guess you can potentially bypass automatic content filters on social media for example.


Steganography usually has the recipient intending to get the hidden message. Since this is about fooling the recipient "attack" seems apt.


There was a very popular yet useless trick in the late '90, early 2000 where you'd combine two images in a checkerboard pattern: one at regular intensity, the other very bright (so it doesn't stand out that much upon regular viewing.

Internet Explorer had this feature that if you CTRL+A page contents, it would overlay images with 1px grid to indicate selection. If you got your pattern right, the hidden image would appear. This is essentially the same effect, but on steroids.


This reminds me that a few years ago (almost two decades?) there was a lot of concern online, almost "moral panic", about the potential of digital steganography to hide information in public image files.

Even if this method is not feasible as an attack vector, at the very least it looks like a very practical way to share information that otherwise would be censored or restricted–all the more so if the hidden image data can be encrypted, which may make it impossible to detect.

On the other hand, I know nothing about steganography and I'm talking out of my arse, so maybe current steganography methods are much more powerful.


I remember in the early 00's that people would share books and movies by using a simple command that would allow someone to zip an archive into a jpeg. For example, they would put a book's pdf file in an image of its cover. Someone else could then download the image, unzip it and get it's contents.

I can easily imagine how someone could use this for nefarious purposes.


very recently someone created a method to encode files and data into videos - this video could then be uploaded to youtube, and distributed / stored permanently there.


How could that be possible though as YouTube doesn't serve the original video file back to users? It get processed to create different video streams, so this seems pretty crazy


According to the creator, /u/T0X1K01 on reddit:

> No, that's what's so cool about it. I explain it in more detail in the video, but basically because the videos are created using 1-bit color images, it makes it easy to retrieve data without having to worry about how YouTube changes the video.

There's a video explanation here: https://www.youtube.com/watch?v=yu_ZIr0q5rU&feature=youtu.be

Source code here: https://github.com/AlfredoSequeida/fvid/

An example here: https://www.youtube.com/watch?v=NzZDFxM5Coo


But you won't be able to download it - youtube-dl is not with us anymore.


There was a new version released yesterday. [0]

[0] http://youtube-dl.org/


I suppose a very stupid thumbnail generator could be attacked with something like this. Proper tools for downscaling images already take this (and also gamma correction) into account.

See http://www.ericbrasseur.org/gamma.html


It's one thing to take non-linearity into account. But you also need to take into account the embedded colorspace information of your source image, if it has one. It's not necessarily sRGB.



I was expecting the article to mention another use for this attack: to share porn on regular hosting sites and bypass automated detection systems.


Mmm, I wonder if it would work for videos too.


It would be spectacularly difficult to do for videos:

first there's lossy compression, which means that there's no guarantee your injected pixels survive the encoding pass.

Then there's the additional hurdle of motion vectors, which will most likely be misaligned between the original video and the injected one.

This would result in hard to predict artefacts after encoding.

Finally, each decoder handles scaling slightly differently, so even if your embedded video trick works on one software/hardware decoder, it might fail on another (sometimes even depending on just the version or additional settings/filters being enabled).


This is what came to mind to me, to break major social sites automated censorship mechanisms, although I feel like its largely crowdsourced these days?


Imagine combining this technique with the encoding of software like youtube-dl in this twitter post:

https://twitter.com/GalacticFurball/status/13197659867911577...

Probably hard to get it working in every environment, but if you know what you are up against, it might be possible ;-)


Typically when you downdsample you're going to want to filter than use whatever downdsample kernel you want with the correct stride. Since the filter is lowpass, think just Fourier transform then taking an inner smaller square of the image and inverting, then you can embed the poison image only in that frequency spectrum. Now by playing with the power, if we downdsample by a factor of 4 then just assume that we lose a quarter of the power in the original image while the poison image loses no power. So right off the bat, we are scaling up the poison image power by a factor of the downsampling ratio. For example, we might go from 1/4 power in the poison image relative to the true image then to equivalent power. The other aspect would be if the interpolation kernel and strides are known we can just make sure that the poison image has large values at those specific pixels and further increase the gain.


Really impressive and ingenious. This looks scary in some sense...


I thought almost everyone uses some form of interpolation when resizing which would defeat this attack completely. Or are there use cases for not using interpolation (I know it requires less processing)?


Opencv does use linear interpolation by default. What you'd need to do is to use something that helps against aliasing, for example first blurring the image with a kernel of the appropriate size or to use a scaling method like opencv's INTER_AREA.


What actually helps here is to use linear colorspace for downscaling and to correctly detect the source image's colorspace.


Colorspaces are an issue with scaling/averaging, but it's not what's happening here.


Try filling a large square image with thin vertical lines, interpolated for smoothness, but still visibly separate from each other. The width of each line should be about 1-3 pixels. Then map the image onto polar coordinates, so the lines would meet in the middle. Finally, downscale it a couple times with a basic avg(2x2) -> 1x1 mapping. Observe an elaborate "shadow shape" in the middle that looks like r=cos(4 pi a), but with a lot more nuanced details.


This is oddly specific. Can you point to a realistic scenario where this makes sense?


I'm just working on a very particular app in the webgl rendering space and noticed this mysterious glitch in this very case. First I though I've discovered something interesting, but turned out it was just the antialiasing bug being discussed here. I'll share a link to that demo on HN a little later: my account is still green and I'm afraid HN would shadowban me and my domain for sharing links now.


For those wondering how it works: it's explained in the article linked in the third paragraph. It takes advantage of aliasing.



So in the same way we build pipelines to sanitise user text input (Little Bobby Tables etc) we need to treat image data in the same way - I guess a pipeline that uses openCV to detect an image in full size and thumbnail - and if they are widely different flag for review ?

It's still cool though


It's definitely neat that it works that way, but I don't really see it as a problem.


Hah, this is kind of brilliant. Hiding in plain sight…


In the example "atttack image" the husky and the outline of the fence in the sky. "That's amazing!"


Oh good, perhaps things will start defaulting to less aliasy downsampling kernels now.


thats incredible!


I did not see in the article nor so far here in the comments one example of this in the wild, which perhaps indicates that using such simple sampling approach isn't common? If someone could successfully execute this against Twitter or Reddit, for example - that changes its newsworthiness completely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: