> However, researchers (not me) are already looking for ways to extract data from the libraries. I think it's only a matter of time before this becomes a much bigger problem.
What's the use of the data, and what problems do you see?
> Perhaps there is a reason that they don't want really technical people looking at PhotoDNA. Microsoft says that the "PhotoDNA hash is not reversible". That's not true. PhotoDNA hashes can be projected into a 26x26 grayscale image that is only a little blurry. 26x26 is larger than most desktop icons; it's enough detail to recognize people and objects. Reversing a PhotoDNA hash is no more complicated than solving a 26x26 Sudoku puzzle; a task well-suited for computers.
> I have a whitepaper about PhotoDNA that I have privately circulated to NCMEC, ICMEC (NCMEC's international counterpart), a few ICACs, a few tech vendors, and Microsoft. The few who provided feedback were very concerned about PhotoDNA's limitations that the paper calls out. I have not made my whitepaper public because it describes how to reverse the algorithm (including pseudocode). If someone were to release code that reverses NCMEC hashes into pictures, then everyone in possession of NCMEC's PhotoDNA hashes would be in possession of child pornography.
Depends on what you define as "Desktop Icons". Many icons in Windows are 16x16. The icons that literally sit on the Windows desktop are 32x32 or bigger, though.
I guess it has to be decided if wide access to blurry images like this is better or worse than finding full resolution images. This all assumes there's not some obfuscation layer, I suppose (is that possible?).
We might also see DL image upscaling used. Obtain similar-looking pictures (at 128x128 or even smaller), downscale to 26x26, train DL to do 128x128 or larger upscaling using the original picture as training data. Assuming most content in the actual DB is similar, you can probably obtain plausible results. A similar technique is used by Peacemaker Filmworks to upscale 4k to 8k: https://youtu.be/umyglbDr4IE?t=257
That will be an incredibly questionable dataset used to train it. At that point, there would probably be no need to upscale these images. Just downscale then use this questionable DL to re-upscale any image.
When the algorithm is known, adversaries can figure out how to minimally modify images to certainly avoid detection. Uncertainty about detection capabilities may have been a deterrent.
One could claim that knowing the algorithm exists, for all cloud photo services (scanning isn't just an Apple iCloud thing), means that these people just don't upload to cloud services and not worry about any of this. For the Apple implementation, avoiding detection is as simple as switching off iCloud sync in the settings. In the case where they've modified the images to enable iCloud sharing, that would be risky game since those tweaked images would most likely make their way back into the database, and any tweak to the algorithm would mean it's all detected. My naive assumption would be that the latest model would be required to interact with the iCloud service (seems like a good idea at least).
All perceptual hashes, including AI-based perceptual hashes, have a "projection" property. If you have the hashes, you can project them into some kind of image. Some hashes result in blurry blobs (pHash, wavelet hashes). Some result in silhouettes (aHash, dHash). And some show well-defined images (PhotoDNA).
We don't know how Apple's solution works. But if it can project recognizable images, then it means people can use Apple's hash system to regenerate child porn.
Without details and an actual code review, I'm not willing to accept Apple's assurance that an image projection isn't possible. We hear that same promise from Microsoft's PhotoDNA, and it turned out to be false.
Imagine the problems if every iPhone and every Mac was in possession of child porn because Apple put it there...
Yes, this is a lot of 'if's, but until it has been evaluated, it is in the very real realm of possible.
What's the use of the data, and what problems do you see?