Hacker News new | past | comments | ask | show | jobs | submit login
Nvidia Vid2vid: High-resolution photorealistic video-to-video translation (github.com/nvidia)
303 points by davedx on Sept 22, 2018 | hide | past | favorite | 127 comments



I'm probably being captain obvious here, but if this is what's being released for free, I wonder how much better a polished commercial version does, and when we reach the point where we can't trust anything we see anymore. It doesn't even have to be super perfect, even reaching the point where it takes experts about two weeks to determine if something's real or not might already be long enough to do great damage.

From a technical standpoint I think this is very impressive, and I'm also interested in creative/artsy use of this. Their "replace trees by houses" example is pretty dull but gives a good glimpse at what can be done.


AI can also be used to identify these fake videos, and they are probably better than human identification. There will be a rise of AI forensics.

The type of neural network they use (GAN) works by having two networks battling each other, one tries to generate fake, and another (discriminator) tries to identify fake, it's a constant arms race. As the generator gets better, the discriminator also gets better. Which means, if the fake video is this good, there must be a discriminator network that identifies fakes just as good.

We did a similar project using GAN, generating images from a text description. You can see the progression of generator and discriminator battling each other, and both get better with time.

https://github.com/yonkshi/text2imageNet


This is how their model was trained, but I think what you've said may not quite be the case.

Because the discriminator (D) and generator (G) usually compete in a minimax game, the equilibrium probability of D correctly classifying an image as fake tends to 1/2 (ignoring distributional factors). If the competing networks have enough capacity and can be stably trained, then in theory they will reach equilibrium as the data distribution from G converges to the actual data distribution. If this is the case, then the discriminator correctly identifies fake videos with a probability of 1/2.

They may not reach equilibrium (making D > 0.5), but it's not clear that the discriminator itself is a panacea for identifying fake videos/images.


The whole GAN thing is useless though if neither of the models have a concept of symmetry or that those pixels do represent actual solid objects in space. Modern neural nets can't represent abstract spatial and semantical concepts or reason, so that videos are full of glaring perspective inconsistencies and large-scale artifacts. So nope, neither AI can help to "identify fake videos", nor do we need AI for that.


The problem is that it's a game of superhuman cat and mouse.

We'll have systems arguing with each other and no way to tell which is correct. If, for example, someone is able to get a copy of the forensic AI model, they can train their decoder/generator to work against it until the results pass as legitimate. With no human ability to argue with the results of the forensics AI, we'll just trust it and pass it off as truth.

If you can clone the jury and conduct a quadrillion private trials, your chance of success in court is going to increase substantially.

Things gonna get creepy.


It could also lead to crimes like Blackmail becoming extinct. It would be hard to hold incriminating recordings of anyone over them if near-perfect audio and video synthesis was common.

Especially for public figures with lots training data available.


Yeah... The problem is try explaining deepfakes to your significant other when they are randomly sent what looks like a video of you cheating on them. Sure it’s possible but not likely.


It's all a matter of cultural awareness. Everyone now thinks when seeing an unlikely photo -- "Photoshop?"

This stuff even has the catchy name "deep fake".


Maybe but only after extensive proliferation.

Initially, every Tom, Dick & Harry script kiddie will try to blackmail people whilst the public and justice system is not really aware of the technology possible.

When eventually enough of the public is made aware to not trust photos and videos anymore then these types of blackmail attempts would be less effective. But there will still be less informed gullible targets.


Surely lawyers will know about the tech soon enough to simply show the same video evidence with the judge in place of the defendant -- reasonable doubt right there.

The blackmail angle I expect will work for a while. People falsely accused of sexual crimes can have their lives ruined by the accusation.


The justice system will hopefully catch up quickly.

But as you say the media (real and social), political propaganda, etc will exploit the uninformed masses for all it is worth, so blackmail will still work. For a while.


Someone's gonna come up with a simple way to sign pics and videos with public key encryption, after which signed media will be the only stuff that is trusted. Of course people will still make sex tapes and sign them when they are drunk because the signing software will be automatically integrated into the camera app, so blackmail is still a possibility.


But.. if the camera app automatically signs pics/vids, that would mean that a private key is available to the app without any passphrase (or one embedded in the app :O).. So why not just extract they key and sign your fake vid?


Well, for something like this the user's signing key itself would probably be managed by the OS, so not extractable. When the app is done editing, it asks the user via the OS to approve the result and the OS performs the signature. You could even embed a downsample of the original that came signed from the camera hardware.


If the key is managed by the OS but is inaccessible to the user, the concept would seem to be incompatible with free software operating systems. Also, if the camera app can have anything it "makes" approved, the app itself could take a deepfake video (from the web or device storage) and have the OS sign it.

The only way I see it working is if the key is "burnt-in" to in the camera hardware and any applications cannot MitM it.


Why would you assume it would be inaccessible to the user? I'd expect a key management interface at least. I guess I did say "not extractable", but I meant more that random apps don't have access to it directly, but they call an api to do the signing.

> the app itself could take a deepfake video and have the OS sign it

Note how I said "it asks the user via the OS to approve the result". I would expect a modal OS dialog to let the user review and approve the content before being signed and passed back to the app.

Thinking about it, there's actually nothing stopping this from happening on today's hardware using just application sandboxing. Substitute "OS" above with "Signing App" that does the same thing (accepts media signature requests from other apps, and opens dialog to request approval from user with a preview).


That's equally horrifying in a different way.


Also, map images pulled from facebook to the bodies of pornstars. The creepiness and invasion of ... personal image(?) this enables is horrifying.


That's already happened. Search for deep fakes. We already have face substitution in videos which is working surprisingly well sometimes.

There's been r/deepfakes where around 30% of the content was porn with swapped faces. It was banned though.


It was pretty limited though. For split seconds videos seemed real but the seams soon enough showed.

Fake celebrity porn has been on the internet since 1996 at the very least. It's always been crummy; but porn in general requires a thick suspension of disbelief and an intense focus in a partial object of desire (what Lacan calls the objet petit a) that blurs everything else.


DRM will be pushed hard, starting from video/audio acquisition, perhaps assisted by blockchain to keep footage verified at all processing steps.


Don't you think that a blockchain that works for anything other than a rather useless currency should be created before suggesting one for such a use? I see comments all the time about how we should use blockchain for this and that, and yet far simpler uses for blockchain haven't yet worked out.


A blockchain is actually a reasonable part of the solution here (just not nessesarily creating a new one). What a blockchain really does is decentralized timestamping. Usually you use it to combine it with a bit of cryptography to process flows of money, but in the same blockchain you can just write "I know hash XXXX", and if later you produce a work with that hash you can prove that it already existed at the point your message was written in the blockchain.

That's not all you would need for verification, but it is a big help.


The sentiment and idea is there, just need the implementation.


Isn’t this something a lookalike and some good make up could already do for ages?


That would require an entire team's worth of skill, preparation and work. This could be done with a few already existing medias, and a single person behind a computer.


Yeah, the Tweet I found this from had a similar sentiment:

https://twitter.com/PiratePartyINT/status/104296466807811686...

"Starting now, we cannot trust video or audio evidence. The ramifications for our legal & political systems will not be known for many years"


"Starting now"? How's about 5 to 10 years ago. This is being released free now, which indicates to me this is now disposable tech and the authors have much, much better in their labs.


Usually it's the other way around with ML researchers - what they show is better than what they have in the lab :)


Probably not; the GAN technique wasn't even published until 2014, and this is based on other work that's been done even more recently. The field moves fast and is more open than you might expect (because it's still so academic).


Not sure it works this way except in extremely specific areas of research (and I'm not even sure about that).

Publicly available research is usually the cutting edge in most areas of knowledge.


If you compare OSS or free software to commercial software generally, I don’t think there are that many massive gaps. It’s mostly polish and small incremental improvements, but the underlying tech is mostly the same. Why would that be different in this case?


The gaps are massive in several electrical/silicon CAD verticals and simulation. Due to IP secrecy, no OSS Verilog/VHDL synthesis alternative exists for Vivado/ISE/Quartus and the Intel/Xilinx line of FPGAs. I don't think any practical alternatives exist for Mentor Graphics' line of silicon design or simulation software, nor have I seen any OS software capable of complex mixed domain simulations like COMSOL or Ansys - many of the pieces exist, but it takes a lot of work to verify that algorithmic physical models actually work together.


On average I agree they are on par. Sometimes OSS is better (ffmpeg), sometimes commercial stuff (adobe after effects).

In this specific case I think it might be beneficial if you can spend a lot of money on gathering training material and tweaking the network. But then again someone here mentioned the quality 4chans fake porn has reached, so maybe I'm wrong after all.


This might actually be one of those edge cases where commercial is better, but the government's secret version is by far the best.


Like what though? Historically what tech does did governments have that was unknown to commercial domains?

People are motivated by money, governments pay little and there’s no way to get rich working there so it doesn’t make that much sense.


Around WWII Computers, Radar, Nuclear power, crypto, jet aircraft, rockets.

Around 1960 space flight, supersonic flight.

More recently hypersonic flight, and we don’t really know becase people back in WWII had no idea.


True but now all that stuff is pretty much produced in the private sector by government contracts with these big companies. DARPA doesn’t have better scientists than MIT and there are a lot of extremely intelligent scientists who have moral issues about working for the military.


MS Office is still miles ahead of any OSS “alternative“, despite the many glaring bugs.


MS Office is still miles ahead in the benchmark of opening it's own proprietary files.


There is no competitor, proprietary or open, that comes close to Excel. It's been relentlessly, extensively polished for years and years, and keeps gaining new features every year.

And this sticking to the spreadsheet concept, which is very limiting.

---

Contrast for example Tableau -- it's a great idea and generated a lot of enthusiasm for a while, but never quite took off as an office package one needs to have. The normal awkwardness of its first versions is still there; they don't have the deep <whatever> that the Excel team has.


Tableau is great, but it has a much narrower use case: given one or more tables of data, generate graphs for presentation or for exploring the dataset.

In comparison, Excel can do that too (just worse), but it can also solve equations, do your company's bookkeeping, and pretty much every other task that relies mostly on numbers.

I would argue Open/LibreOffice Calc comes fairly close to Excel if you ignore the worse user interface (which is fair in the original assertion that it's "mostly polish and small incremental improvements")


> if you ignore the worse user interface

considering that's a major part of "better" that's big ask!


Yes, but given switching costs and habit formation, why would people care about something that's not strictly Pareto dominant?


Adobe have definitely bested most oss competitors in their space. Although with the amount of man power at their disposal it would be hard to beat.


The vast bulk of Adobe's advantage is in UX, not technical algorithms. Which makes perfect sense because that tends to be the case with most F/OSS software—technically brilliant but with an face only a programmer could love.

Yes, Adobe do have some remarkable algorithms that would be difficult to replicate (e.g. heal brush and content aware fill) but these are a small minority of Adobe's software advantage.

The one that irritates me the most is vector drawing programs: open source programs (and even paid competitors) just can't touch Adobe Illustrator for the sort of work I do. I'm sure at least 50 percent of it is familiarity and muscle memory, but I've desperately tried switching to a few different options like Inkscape or Affinity and left wildly disappointed.


You're right that UX is one of the biggest problems they have. One thing that is also hard to replicate it how well Adobe's software works together. Embedding smart objects and illustrator files in photoshop documents, right clicking a clip in Premiere and sending it to After Affects and back again without rendering an intermediate file etc.

I would be interested in a Lightroom alternative if anyone can recommend one though.


for content aware fill and heal brush both GMIC ( https://patdavid.net/2014/02/getting-around-in-gimp-gmic-inp...) and resynthetizer ( http://www.logarithmic.net/pfh/resynthesizer ) are working quite OK.


Gimp even had content aware fill first, It was based on a SIGGRAPH paper iirc.


Do you use the Astute plugins for AI, or the native pen tool? Affinity feels different, maybe less precise, but the functionality was way better compared with the native Adobe tool. You should try Figmas pen tool, I like it.


My uses are relatively trivial- logo design, SVG generation, basic layout work and PDF tinkering. My main need is fine control of beziers with auto-guides to ensure consistency.


I'm sure the point is already crossed. People on 4chan are making fake porn that is nearly indistinguishable - and they are complete amateurs.


Are you sure they're amateurs, not professionals don't it for fun (/pro boner/!?!).


I meant something a little different. Any random dude with a random mid-level PC can download the software and produce amazing results without any special hardware or knowledge. I recommend you to try it out for yourself.


Was there a huge progress made since news about deepfakes broke early this year?

Around March/April this year, I actually did download the TensorFlow toolkit for face transfer that was used by /r/deepfakes people and tested it out (there were samples of photos of politicians included); the results were, at best, worse that I could do in 2 minutes in Gimp. Maybe they could get better if I had an expensive GPU farm at my disposal, but I'm pretty doubtful - given that the news died down pretty quickly, and no reasonable-quality faked pictures or videos were reported ever since.


It's more about experimenting with your training data and other configuration. What I've heard, getting great results takes time - but it's possible. Most of the focus of the community is on porn, so I can imagine not so many journalists are checking the newest results and reporting about them.


Probably Nvidia is not interesting in this type of tools and this is more like "hey, look what you can do with a lot of Nvidia cards, buy us a lot of them and you will do this and much more"


Yeah but it was already the case with CGI. IT's true that it's going to be easier and easier to do fake porn, fake speeches, fake voice recording, fake vides....


We reached that point of not being able to trust any media a while ago. It's just getting easier and more widely distributed.


Honestly, you just need to convince one person to execute a successful social engineering attack.


From the paper: "we have to use all the GPUs in DGX1 (8 V100 GPUs, each with 16GB memory) for training. We distribute the generator computation task to 4 GPUs and the discriminator computation task to the other 4 GPUs. Training takes ∼10 days for 2K resolution."

As I don't have a DGX1 here, training the 2K resolution net for 10 days on a p3.16xlarge instance (also has 8 V100 GPUs) would cost USD 5875 on AWS. (USD24.48 per hour on-demand pricing * 24 hours/day * 10 days)


The DXG-1 costs $129,000 so AWS is cheaper unless you need to do it 22 times. And you can have multiple instances running at once and get all of your results in ten days, instead of waiting ten days again for each run.


Well plus electricity. A DGX-1 takes 4 kilowatts or so that 10 day training run will take just about a megawatt hour or about $100 at retail. So the cross over point is more like 23 runs :)


Or you can build your own 8x2080Ti rig which will have 80% of performance for 1/10th of the cost.


DGX 1V has $7500 worth of CPU alone to feed the gpus. Throw in 8 TB of nvme ssd for training data and you're looking at something more like 1/5th the cost.

V100 has ~50% higher memory bandwidth than 2080Ti, so you probably wouldn't get 80% of the performance. Also, only two 2080Ti can be connected via nvlink.


yeah, but does not look like something many hobbyists would try for fun ;-)


Their GitHub README says 24GB, and 12/16 GB requires cropping and performance not guaranteed. I’ve only seen P100 with 16 each, and its the big Quadros that have 24


Which is why it's nice that the pre-trained models are available.

If you want to train on your own dataset, that price does not seem unreasonable to me.


Seeing the example of one facial pose video transcribed to three different looking women, I'm imagining a future where Netflix does a/b testing on its shows, using similar tech to swap out different "actors" to find which one resonates with audiences best.

They could even generate a new "cast" for each market, after only shooting the show once.


Porn industry will benefit the most.


Some applications for this kind of tech:

- Porn, yeah, first application you can think of, there are already some startups doing it.

- Doubling actors, and applied to sound, maybe you could translate from one language to another but kind of keeping the accent and tone.

- Propaganda and misinformation. Now you can get your enemy to say and do whatever you want, on video.

- Photo-realistic games. Create a rough 3D model of an scenario and train the AI for it. Instead of photo-realistic rendering with math, render it with the AI based on a rough render, in real-time.


> Photo-realistic games. Create a rough 3D model of an scenario and train the AI for it. Instead of photo-realistic rendering with math, render it with the AI based on a rough render, in real-time.

According to last month's nvidia rtx presentation/launch event [1], they are going to do something similar quite soon. Games will ship with DNN pre-trained offline on extremely high quality renderings. Game itself renders at lower resolution (limited by performance needed for proper raytracing) and uses DNN to upscale it.

[1] https://youtu.be/Mrixi27G9yM?t=51m15s


I wonder, since the NN cores of the GPU are used for real-time raytracing, will they be able to run custom NNs possibly not related to visual stuff in parallel to the ray-tracing stuff?

Edit : found the answer on Internet, apparently the RT (raytracing) cores are different and separated from the Tensor (NN) cores on the RTX


I think this is about trading storage for computation - you replace terabytes of model/texture data with a compute heavy NN.


any ideas about porn startups that work on anything related to this? i know pornhub and their network banned deepfakes and there’s also some work being done on detecting deepfakes.



> Propaganda and misinformation. Now you can get your enemy to say and do whatever you want, on video.

This could open the market for video propaganda-detector applications.


A company that used this tech to actually change the lip movements of actors for each translation would stand to make a lot of money right now.


regarding photo-realistic games, can anyone comment on how quickly this the model executes?


I am a bit surprised how shallow the comments on this one are.

Look closely, while it does generate videos of a passing similarity, they aren't "photorealistic" in the slightest. They are good locally across time and space domains, but globally they are as far from realism as Doom 2 was.

The only explanation for the attitude I see in this submission is that most IT people trained themselves to spot CGI by looking at local artifacts, assuming that global artifacts won't happen because stuff on the scene is reasonable. There is no "stuff on the scene" with those videos, it's just mindless vector manipulation with no underlying world model. Cars wave around, trees grow a feet from each other and behave in a way incompatible with 3D perspective.

Relax, it'll require at least another AI/ML revolution (or even several) to achieve photorealism.


What media would someone collect now to be used in the future to reproduce the likeness of loved ones? Video clips of them moving? Talking? Different poses of pictures? Reading the dictionary out loud to get vocal patterns?

Heck with impersonating the POTUS. What about a lost friend, sibling or parent?


Black Mirror did this I think - if we can make video, then why not VR (down the line if processing catches up), Second Life iterated.

Heaven on Earth?


Also see the film Marjorie Prime.

https://en.m.wikipedia.org/wiki/Marjorie_Prime


Wouldn't that be some kind of horrific emotional uncanny valley?


Yeah. https://www.youtube.com/watch?v=fkE6RBlfbXA This isn't the worst one, there was a time when PKD's daughter interviewed the head and he went off on a rant about how much he disliked his family. That was really rough.


Am i reading that right? Its making the videos that look real, from the simplistic input?

If so, that is amazing.

And if so, how do I turn a video I have into a simple/line version, to be able to then put a different 'skin' on it?


You could try OpenCV Canny Edge Detection: https://docs.opencv.org/3.1.0/da/d22/tutorial_py_canny.html

Example with video here https://www.youtube.com/watch?v=1Ndxtb0q76c

Obviously would need some tweaking but could be a good starting point


The level of realism can be gauged from the examples they provide right there on the page. Of course your results may vary basing on the initial bulk of data of realistic source images you use.

You have the code right there on Github, just install it on some PC with powerful GPUs (or rent one), tune some parameters, train the network and you can do the same things.


The hardware they used costs tons of money, and so does doing on a cloud provider. It’s not something to do in a weekend with your gaming card.


It is if you drop the resolution.


With edge detection. Normally edge detection means looking for local sharp changes in brightness and marking them with a white spot. The edge detection used in this case looks more sophisticated to me. I don’t know how it works


Looks just like edge detection + contouring from RGBD data.


Yes, also you might want to look up DLSS - they use pretrained upscaling network on GPU to generate 4K from native 1440p picture, instant ~20-50% performance bump with free AA.

https://www.youtube.com/watch?v=MMbgvXde-YA

of course this being Nvidia they didnt implement it universally, you need to sign up for API access to black box gameworx like scam programs in order to implement it in your game.


I'm scared. I can't trust anything I haven't took myself. The problem is that other people not even know that this technology exists and if you tell them about it, you're a lier.


I don't think there's going to be a long period where this technology is being used and isn't widely known about. It'll be used extremely quickly to abuse people using their Facebook pictures, and once there's a Facebook angle the media will be able to run with it and everyone will get it.


It's already being used for quite a while and you talk as if you aren't aware of the fact. Quite Ironic.


This is why some people believe that Assange had been dead for more than a year.


Cant wait someone to put Simpsons through this


Is there a way that a person could create a QR code made from one-way private key? The QR code would contain date and time and other Metadata that is decrypted with a public key that proved the QR code in the video was made by, for example, the person in the video? This would "prove" the video was real? I guess the speech would still be manipulated. So the whole transcript would have to be encrypted in this way...


I don't know why you're talking about QR codes and transcripting/encrypting the audio when you can just sign the file?

    gpg --sign video.mp4


I assume they are talking about signing/watermarking a video in a way that survives video encoding/lossy transmission. A QR code probably wouldn’t work for various reasons but is an easy mental analogy to think of.


For embedding, you should just put it in the metadata. Encoding it in the video itself... I don't really see the point.


It would be interesting to see the applications of this technology with the use of thermal cameras. Extracting environments from thermal imaging would be nice.


One of the example translates a full human pose to a video of a dancer. If the network would be trained on the facial pose(?) / features only, would that recreate something like the facial reenactment in http://niessnerlab.org/projects/thies2016face.html (source code for face2face is not public)?


Look into Deepfake, that's the tool 4chan is using for face swapping in their fake porn


Yes, but it just cuts out the face and pastes it on a different person/background. It does not do full reenactment where you keep the entire target video environment.


If you have any doubts that the face synthesis one is faked (faked fake?), watch the face of the woman in the bottom left as it loops.

https://github.com/NVIDIA/vid2vid/blob/master/imgs/face.gif


I'm not following - can you explain?


Her(?) face morphs slightly in the first few frames.


Does anybody know a) the performance (e.g. introduced latency) and processor requirements for the client/input (e.g. is real time canny edge detection good enough and how fast would that run)? and

b) what the latency impact on the NN side to build the images (e.g. how many ms are we talking about?)

Thank a lot!


Kind of surprised this stuff hasn't totally blown away traditional video compression techniques yet.


This got me thinking: in future we will probably stop streaming video and instead just send simple vectors. The data then gets rendered in real time into actual video. The customers will even be able to customize the movie: pick their favorite actors, the environment, etc.


I suspect that a combination of the two is where it's at, ie, store a lossy classical compressed version, then remove the artifacts/dream up details with deep learning


or pretrain the network on the movie you want to compress, then ship cartoon compressed version + trained network.


Right that was exactly what i was thinking


As someone not into AI / Machine Learning.

Can we expect CUDA to be the x86 on PC and Servers? Literally all works are defaulting to CUDA and Nvidia's library. I don't even see a contender trying compete. I don't even see AMD's ROCm being used or even mentioned anywhere.


I'm curious what would happen if somebody tried to impersonate the US president using this.

He would say it's fake, but who would believe him?

How exactly can computer scientists explain deepfakes to laymen?



Well I’m this case the results are pretty good locally but have pretty obvious artefacts too. Especially the synthesised road videos, look at the trees or even more at the Lane change in the linked video.


Can't you easily obfuscate that by making the video intentionally grainy and low-res and passing it off as "caught by CCTV" or "found footage"?


Probably much simpler to get a lookalike, and that's been possible for a long time.


You say 'pretty obvious artifacts' but that is because you have some clue about the process and know what you are looking for and are interested enough to look.

I tried pointing out to some friends some really bad artifacts in a video we watched, and they just could not grasp it. They couldn't see what I was seeing as it didn't look out of place to them. It isn't for lack of intelligence, they just didn't care enough to understand. That pretty much describes vast swathes of the population.

You show a video using the above technique to anyone with strongly held political/ideologic beliefs and an inclination to accept 'alternative facts' over actual facts and videos using these techniques will be like a wildfire and almost impossible to refute!


I thinka layman would be perfectly capable of understanding "computers can now create fake videos so real you can't spot them". Not that that claim is quite true yet.


Chances are there would be corroborating evidence to the contrary, since it's not often the President does or says anything without multiple witnesses and cameras being involved. In that case, it might be easier to impersonate them through their social media accounts.

The real danger here, if anything, is impersonating common citizens or lower ranking government officials.


Are there any high resolution examples available? Am I just not finding them in the README?


holodeck programming step 1.


ouch, this hurts




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: