Hacker News new | past | comments | ask | show | jobs | submit login
How an A.I. ‘Cat-And-Mouse Game’ Generates Believable Fake Photos (nytimes.com)
197 points by gk1 on Jan 4, 2018 | hide | past | favorite | 81 comments



Good article and great tech. However, I don't know if I believe the results are as good as they claim. Many of the pictures look a bit off to me, like they all have dead eyes. Maybe celebrities generally look like that anyway, so it is being true to form. :)

In particular, I think this guy is missing a pretty significant part of his head: https://static01.nyt.com/newsgraphics/2017/12/26/ai-faces/8e...


The inventor of GANs, which have been considered the most interesting idea in ML in the last decade, is Ian Goodfellow. I met him on reddit a few years ago. I was supposed to get private ML tutoring from him, just around the time Andrew Ng opened the first Coursera course. I didn't get lessons because I gave up and eventually took the MOOC. But it's amazing to know we share the same forums and sometimes exchange a comment or two.

The great idea about GANs is that they replace one of the most hard to understand parts of a neural net - the "Loss function" - with another neural net, thus making the loss function learnable. This opens up the door for a kind of unsupervised learning that was impossible to make work before. GANs are very very important also because they are almost like reinforcement learning (actor + critic = RL, generator + discriminator = GAN), and RL is supposed to be the way to AGI.

The most famous problem of GANs is instability during training and mode collapse - which is like a student learning especially for an exam (and not in general) thus optimising for the test instead of the real thing.


> The most famous problem of GANs is instability during training and mode collapse - which is like a student learning especially for an exam (and not in general) thus optimising for the test instead of the real thing.

I must confess I haven't worker with GANs yet, but isn't that the whole point of GANs? Student is optimising for the test while the teacher is learning how to make tests as similar to reality as possible?

If I understand correctly, the main challenge is finding a way to allow teacher and student (well, generator and adversary) to learn at a similar rate, so that one doesn't stop learning because its competitor is too advanced. Is that correct?


> but isn't that the whole point of GANs?

not quite, but youre on the right path.

think about it this way: you (the generative model) are trying to predict a unit gaussian, which is just a fancy way to say bell curve. you get +1 if you predict a number in this distribution (eg 0.1 or -0.5, which is within one standard deviation of the mean of 0); you get -1 if you predict a number thats "far" from this distribution (something like 40 - which has an infinitesimally low probability of being drawn from a unit gaussian).

mode collapse, then, is when you predict 0 all the time. yes, you are technically correct but youve failed to learn the true distribution.

obviously ive simplified this quite a bit and have anthropomorphize the model, but i hope you get the gist. otherwise, the [original paper](https://arxiv.org/abs/1406.2661) is refreshingly easy to read.


Thanks!


> private ML tutoring from him

> I didn't get lessons because I gave up and eventually took the MOOC

Udacity still didn't get him onboard. I took DLF ND because of the tutoring they promised, did GANs as my first project to be in the queue, then graduated later still with no mentoring sessions. So you didn't miss anything by dropping out. How were Ng's new lessons? Worth taking it if I did DLF + fast.ai already?

BTW, GANs main use might be allowing almost fully unsupervised learning by extending small datasets with believable data.


> BTW, GANs main use might be allowing almost fully unsupervised learning by extending small datasets with believable data.

I've wondered if dreams are basically this. Your brain uses its world-model-prediction subsystem to generate plausible inputs against which to train its action-generation-policy subsystem. Then, in real life, the action-generation-policy subsystem can react much more appropriately and quickly to real events.

Also, toddlers' stream-of-consciousness babbling when they first start talking. They narrate everything and more than once I've wondered if it's essentially them generating their own verbal training data. When they start talking to themselves their pronunciation, grammar etc. start improving much more rapidly.


> RL is supposed to be the way to AGI

Could you expand on that? The more I read from folks like LeCunn & Chollet seem to disagree strongly. Just this week Yan posted about unsupervised modeling (with or without DL) to be the next path forward, and described RL as essentially a roundabout way of doing supervised learning.


RL/DRL assumes world is Markovian, i.e. past doesn't matter between two states, which is way too simple. It requires huge amount of tries/episodes and properly tuned exploration-exploitation ratio. It is somewhat based on biological reinforcement learning, so there might be basis in reality as it is with convolutional neural networks and visual field maps in visual cortex (even if very rough approximation). DRL is the technique that allows modeling decisions; so for predictions you have CNN/RNN/FCN, for generation GANs and for decisions DRL; together they are closest to AGI we have right now.


> RL/DRL assumes world is Markovian, i.e. past doesn't matter between two states, which is way too simple.

There's plenty of RL papers using RNNs and some types of memory networks.


Likely as value function approximators for one piece of the whole algorithm (as is the case with DQN/DDQN). However the main algorithm is likely using variation of Bellman equation, that assumes Markovian property and gives strong guarantees about convergence.


If you're using DQN or pretty much anything in DRL, you don't have any guarantees about convergence in the first place, and using a RNN does give you the history summary you need (at least up to the minimum error achievable with that fixed-length summary, not that that is any more likely to converge than the overall DRL algo is).


I meant that under Markovian assumption value iteration used for Bellman equation is guaranteed to converge. So it makes math people happy, even if such property doesn't hold in the real world nor in the problem they try to solve, and the "deep" in DRL is just heuristics, though surprisingly working in many cases.


that is true: popular rl techniques (eg policy gradients) are very similar to "vanilla" supervised learning techniques and architectures, but they are unsupervised in the sense that they required zero human input.

alphago zero is the canonical example of tabula rasa machine learning.


Even better: https://static01.nyt.com/newsgraphics/2017/12/26/ai-faces/8e...

That's one heck of a receding hairline, meaning receding out of the plane of existence.


You were primed to be looking for flaws by the nature of the article. It wouldn't be hard to come up with a context where each and every one of the 3x3 grid of pictures in that article was accepted at face value.


They might work as thumbnails, but these are terrible when blown up to full size. When given both images I was trying to find one that might be real thinking it could be some freaky filter or something. And I still had a 'these are terrible fakes feeling.'

Even the 'best' headline image fails as the eyes are not the same size and the rest of the face just looks off.


Did you even read my comment? They're not perfect, but you were expecting them to be fake. Someone not told there would be computer generated images would be considerably easier to fool.

Also, probably the bigger risk is not that you'll be shown an entirely fabricated image, but rather that someone could convincingly be inserted into an existing image.


I was not thinking about fake images when looking at the article this was pure instinctive revulsion. It's easier to avoid the uncanny valley with pictures than motion, but some of theses fall deep into it and many others don't even make it that far.


In most cases hair gave it away for me, especially at the contours. The picture you linked is an extreme example.

However, give it to a good Photoshop artist, like most celebrity pictures are, and I'm sure these issues will be fixed in no time.


Most likes I ever got on OKCupid was when I used a GAN-generated celebrity as my profile picture. Just one girl noticed something wasn't quite right.


The test of believability they give in the article is also bullshit. Both of the options are fakes, they both even look like fakes. Her hair and forehead doesn't make sense, his mouth and ears don't make sense. They don't match it up against a real picture because people's performance on that task would contradict the headline.


> The test of believability yep! that is the fundamental limitation of adversarial networks. theres no good measure or "loss", as it's highly subjective.


And the ears are not "compatible" either. Could this be the reason they prefer women's faces (with long hair which covers ears) in the article?



Several of them seem to be using many features from specific celebrities. It may just be me, but there is a very strong similarity to Paul Walker, Liv Tyler, Michael Douglas and Adam Sandler in some of these. I don't know if it's a result of overfitting?


To me the fine details are incongruent: the grain of the hair sporadically changes direction, patches of skin have different qualities. It looks like a bit of Frankenstein's work.


The hair was the giveaway to me. I stared at the two "which one is real" images for a couple minutes thinking, "They both have that fake looking wavy hair, I thought for sure neither was real."

Cheap trick NYTimes. Cheap.


look at their foreheads, there's still like blurry half-generated wavey hair texture on their skin.


For those looking for a well-commented you-don't-need-a-PhD-to-understand implementation of GANs + variants (using Keras), I recommend the examples in this repo: https://github.com/eriklindernoren/Keras-GAN


Scary to think where this will be in 10 years. Perhaps even video evidence will be hard to believe any more. How do you convict someone if this technology is mature?


There was a post here a few days ago on amateurs (people who had never even touched python before, let alone tensorflow) who were using deep learning to generate fake celebrity porn. The results are actually pretty believable:

https://news.ycombinator.com/item?id=16040463


Is this safe to view at work?


Yes, first link is to a 4 day old post on HN, which links out to another SFW writeup. The title of the article (AI-assisted fake porn are being used by people on Reddit for self-completion) is worth keeping off the monitor if you don't want someone to quickly scan the word 'porn' on your monitor, but there are no NSFW photos or content.


There's a whole radiolab series about this called the Future of Fake News http://futureoffakenews.com/


And or how do you prove someone's innocence if you can generate a believable, fake, crime video.

Supporting counter evidence will become that much more important.


You might be able to use this same technology to counteract this from happening.

If you generate content you would have a base to test an AI to spot actual fake content. You could use video's and pictures like these to test a learning AI to spot discrepancies, then report findings in detail.

Makes me wonder if there is a future in forensics for this type of technology.


This network was trained using an adversarial approach. What that means is that a second network that does exactly what you say was used to train the first.

They kept training until they created images that could reliably fool the discriminator. A more powerful discriminator would just be used to create better fakes.


If there were a government conspiracy to put you away, most of the time, they wouldn't need fake video evidence to convict you - just a bit of prejury by officer witnesses.


There's going to need to be some kind of blockchain-style tech that allows the source and veracity of video to be determined.


A blockchain in this case would just be a timestamping service, to prove the minimum time that elapsed since the image existed. That’s only useful if the time to create a fake is significantly more than the resolution of the chain - 10 minutes for bitcoin, 15 seconds for ethereum.

But that I only makes sense if you know ahead of time that the information will be valuable, and it only proves the age if there is sufficient hash power on the network, so a “private” blockchain would not be viable.


Imagine how this could be misused to start wars-

https://www.youtube.com/watch?v=uIvvHwFSZHs


Heinlein's fair witness?


This side of the house is white.


You cannot infer that the type of structure supporting the visible surface is a house, or part of one.


By that scheme you can not infer it is a surface either. But that was the example given in the book.


I don't understand the images associated with that article. They purport to show the progressive refinement of the output over a series of days. But the figure changes dramatically from image to image, all the way to the end of the run.

At the very least it seems the output is not stable: a human has to decide when to stop the Wheel of Fortune. It looks more like a series of images taken from different training sets or parameters, for the NNs I'm used to.

Caveat: I've done a lot of ML, but not GANs specifically. Is this common? How do you solve the 'where to stop' problem if the output is so unstable?


> They purport to show the progressive refinement of the output over a series of days. But the figure changes dramatically from image to image, all the way to the end of the run.

What they are probably doing is showing snapshots of the same noise vector (==random seed) for various epoches. Since the mapping of noise vector ~> face is totally arbitrary, the ProGAN is free to vary it as it pleases; thus, some but not perfect stability. I saw the same thing in messing around with anime GANs: a fixed set of noise vectors would show the anime faces change eye or hair color etc.

> At the very least it seems the output is not stable: a human has to decide when to stop the Wheel of Fortune.

Yeah, you can't do principled early stopping with GANs, really, because there's no held-out set and the loss is changing. I always ran until it diverged or I became impatient, and similarly with ProGAN: they ran as long as they could (takes like a week on big GPUs). To some extent, if you're using Wasserstein losses, the discriminator loss is supposed to be meaningful as a kind of absolute distance between the true image distribution and the generator distribution so you can do early stopping like 'stop if no improvement for 3 epochs'. (This is just in the pure generative approach; if you're using GANs for a semi-supervised application, presumably you can do early stopping as usual based on whatever you have held-out.)


I am not sure if "unstable" is the word I would use. Sure, even after training for days the GAN produces not-so-realistic images, but the rate at which it generates those images gradually decreases over the training period and the images get more "realistic".

>How do you solve the 'where to stop' problem if the output is so unstable?

Looking at the discriminator loss would be a good start for that.


It's not the quality I was referring to. Look at the main image sequence. The images from 0 to, say, Day 5 show the kind of progressive refinement I expected: the network is improving its image over time. Each image is a refinement of the previous.

But compare the images from Day 5 the end. Eye colour is changing and then changing back. As is the background. And the hair colour. The position of the parting. Whether the mouth is closed or showing teeth. Day 16 is not an intermediate point between Day 9 and Day 18.

If it runs for another couple of days, would we get another version like Day 16?

That's what I mean by instability.


Ah, I understand what you are saying. The instabilities could be explained by the batches sampled during those training days and the generator's input. Training a GAN is not very straightforward and even minor changes in batch sampling could produce vastly different generated images.


One application I imagine for this is a future of game development similar to the movie inception where an "architect" designs the layout and setting and then all the details are filled in ad hoc by the computers "imagination".

Today its faces that feel familiar but aren't real. Tomorrow its whole cities that feel familiar but aren't real. The cities are filled with people you swear you've seen before. Perhaps the details are tailored to you personally based on the corpus of photos you've posted online.


This is happening. https://www.youtube.com/watch?v=1Ea57XERywM&index=6&list=PLc...

"Narrative Dungeon Design"


The paper was published in October and is named: Progressive Growing of GANs for Improved Quality, Stability, and Variation http://research.nvidia.com/publication/2017-10_Progressive-G...


And the one hour movie of celeb faces generated by the Progressive GAN (ProGAN) is here.

https://www.youtube.com/watch?v=36lE9tV9vm0

The most amazing AI video I have ever seen, actually. I spent hours staring into it - it works great as background for many pieces of music, you can think of it as the AI version of the burning log video.


>> “QUESTION: Look at the two photos below and see if you can figure out which person is real.”

>> “ANSWER: Sorry! This was a trick question. Both images were generated by computers.”

Not really a trick question when even if you know they’re both fake that the only way to be right (confirm you are right) is to be wrong.


I’d also like to comment on the “ha, fooled you!” tactic used in this article, where the authors asks the reader to choose the photo of the real person from two given photos and then reveals that, gasp, both are computer generated.

Whenever I run into this often-used tactic in papers and talks, I can’t help but feel – no, the author didn’t just convince me of their point. Instead they convinced me that they don’t value being trustworthy. Often I will just stop reading the article right then. Or if I do continue I will become unforgivingly skeptical of any claim that doesn’t provide a citation that is independently verifiable.

Use of the tactic feels particularly peculiar in an article which itself grasps towards the implications of a future in which photos and videos are no longer trustworthy, a future in which personal reputation will be more meaningful.


Yeah, I thought both looked a tiny bit off. I think it has to do with the reflection in the eyes which is a tiny bit inconsistent, among other things.


maybe so (they fooled me) but you were already prepped to scrutinize them. To the point others have made, we’ll soon need to be constantly prepared to assume fakery.

the technology of fakery is rising the meet the “everything is fake news” moment


I immediately picked the right image, because I saw whisker stubble on the left, and I already knew that image-generation AIs seem to have a thing for painting whisker stubble all over anything even remotely resembling a male face.

Surprise! Guess I should have considered the possibility of a trick question.


So Cat and Mouse is layman’s term for Adversial network?


No, is NYT's made up term. "Arms race" is the traditional term


Great overview. I especially appreciate that they linked straight to the paper instead of a popsci/buzzfeed regurgitation of the results.


The faces are pretty good but the ears and craniums are awful, presumably because of a dependency on neighboring pixels and getting confused by diverse backround images. Why ruin their work by including he garbage parts in the presentation/claims? And why not leart foreground and background separately, and mask them together?


I would love this as a service for generating fake users.


See also: https://news.ycombinator.com/item?id=16040463 -- fake porn faceswap generated by AI


Excellent images and the more I think about what this could be used for the creepier it gets. One day, truly, software will be very dangerous.


And it makes you wonder, what is the current legal framework preventing this kind of tool to be misused? How does it differ across countries/unions?

How can such a thing be enforced to begin with?

Are companies/labs/universities/individuals themselves the only thing standing between fair play and massive misuse of realistically generated media?


Mr. Hwang believes the technology will evolve into a kind of A.I. arms race pitting those trying to deceive against those trying to identify the deception.

That's like a chess game. We have seen AlphaGo and other MCTS implementations take the "trying to detect the deception" into account.

By the time the image is generated, it would have already been factored in.


> the technology will evolve into a kind of A.I. arms race

Ha! Google Brain organised a competition on "Adversarial Attacks and Defenses" within the NIPS 2017 conference.

Reminds me of Harry Potter learning magic attack and defence arts at Hogwarts.


Where does AlphaGo try to detect deception? What is deception in perfect information games?


https://www.youtube.com/watch?v=XaQu7kkQBPc

Imagine automated system for danger recognition on for example airport. These kind of deception attacks could make problems with these systems. Imagine if suddenly 10,20,100 airports all around globe would recognize weapons, bombs or any other dangerous items? I can imagine panic and huge news headlines badmouthing AI.

People don't trust AI. These kind of errors could only prolong proper integration, which in many ways could enhance the way we live.


I’d be really worried if I was a photo model.


We've made good looking people obsolete.


These are pretty damn good but it seems to me like the program is sort of over optimizing the pictures: in my mind, the pictures from the 5th to the 7th day are the most realistic.

After that, it feels to me like the "realness" slowly degrades.


Great PR but none of the faces look human, with the weird position of the nose, the strange curly thing around the hairs, there is still work to do to trick the most fundamental tool of the brain, recognize a fellow human face.


“Believe nothing you hear, and only one half that you see.” ― Edgar Allan Poe


I found the Obama video at the end very interesting. It would be a neat next step to map non-Obama audio to the generated video. For example, pull audio from an Obama impersonator.


> pull audio from an Obama impersonator

We can impersonate voices with neural nets. We can clone timbre and style, and this tech is being used commercially by Baidu at the very least (keyword: Deep Voice 3).


Watching the progress of this system reminds me of that scene at the beginning of The Thing where The Thing almost but not quite mimicks one of the humans.


Very interesting. A this kind of tech something that is mere mortals can play with?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: