Hacker News new | past | comments | ask | show | jobs | submit login
One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing (nvidia-research-mingyuliu.com)
144 points by jonbaer on June 28, 2021 | hide | past | favorite | 49 comments



Wow, this works pretty well.

Makes me think of that chapter in Infinite Jest where videoconferencing gets popular, until people start using "optimized" computer-rendered images instead of showing their actual faces, at which point everyone goes back to audio-only.


I wouldn't mind video conferencing with computer generated avatars. I don't video conference to know what the other person looks like, and in fact knowing what they look like just creates lots of unnecessary bias. I do it for the cues from their gestures, facial expressions, the direction they are looking, etc. With a good tracking setup that works perfectly well today with digital avatars.


Ah yes, the solution to bias: Hide everyone's faces. That'll teach everyone to celebrate our differences.


I like faces. Why deny a key part of being human?


[flagged]


Why is everything about race nowadays?

Anyway, your comment is bullshit. I'm a white guy (actually from Spain, so I don't know if I qualify as "white" from the USA point of view) living in Japan, so I am the discriminated minority here (and yes, I have been a victim of minor racism), and I agree with GP. We humans evolved for face to face conversation, if racism is the problem, hiding behind a mask is not the solution.


> Why is everything about race nowadays?

Because racism exists and is a real problem for lots of people.

You don't fix that by pretending race doesn't matter, because in the lived experience of all of those people, it matters, a lot. They would love it if it didn't matter, but that is not what happens in reality.


> I'm a white guy (actually from Spain, so I don't know if I qualify as "white" from the USA point of view)

No, Spanish-speaking countries are considered nonwhite.[1] Portugal and France are both white; Spain not so much.

Brazil is a hazier issue.

(The reason is that Spanish speakers in the US don't tend to be Spaniards, and the laws were written in a sloppy way.)

[1] Technically, the "Hispanic" category is not supposed to be exclusive with the various race categories. But that is how it is treated.


So following this logic train, the only actual faces one may see would be white males. Everyone else gets a mask?


Maybe? But I'm going to get what I wanted.


Looking white male, isn't enough if you go down that road. Accents, tone, pitch of the voice would also need to match


That's much easier if you've spent enough time in the US and basically "sound white" (whatever that means) but don't look white.

But I imagine accent change could be another subject of a future deep learning project.


Would that then just be a proxy for not being a white male and present the same problems? I am not sure what this solves to prevent discrimination.


It doesn't prevent discrimination, it just allows me to do what I want to do in the short term, e.g. raise funding for startups or get dream jobs or whatever.

I mean if a VC hands me a term sheet or a hiring manager hands me a job and the ONLY thing I misrepresented is my face, I don't think they have any ethical grounds to retract their offer.

Solving the discrimination problem is another matter, and will take years if not decades, my dreams cannot wait for that.


Sincerely, I'm sorry you feel that you haven't achieved certain things because of skin colour/accent/etc. However let me challenge that belief, only because if your assumption isn't true, that your lack of desired success is because of things outside of your control, then perhaps you won't make the adjustments potentially needed to actually achieve them. If your dreams truly cannot wait (as you said, and they shouldn't!), then imaginary AI solutions to not-catastrophic-not-life-ending barriers shouldn't be seen as anything relevant to your eventual success.

None of us can control skin colour, accent, connected families, or other benefits that others may have. But we can outwork the privileged, and today's world, with just a computer and internet connectivity, gives us all much more options than previously imaginable.

I do sincerely wish you great luck and success, but I really think you shouldn't get stuck on things you can't control, and devote all your energy to things you can.


Oh yes of course. I just think it would be a super cool technical project to actually make this work real time in a videoconference, and there would be lots of use cases for it. Eliminating survey biases during interview-style surveys is another.


Good luck to you. I am not convinced people would be more likely to give money or a job to someone when they have only seen an avatar, but what do I know.


They don't have to know it's an avatar. For most jobs if they make any decision based on how the candidate looks they are in the wrong.

It's not any different from make-up, really, which is also hiding your real face, in a different way.


If you deceive someone by wearing a different face during an interview, I expect that you would not be hired. How do you prove you were the person interviewed?


There are plenty of ways to prove that, including stating a public key during the interview, interviewer signs it with their key, vice versa is done with the interviewer's key, and then the same is demonstrated in person with the same keys.

Hell, how to do that can even be an interview question.


That's a terrible answer to the hypothetical interview question because your solution only works if the person that took the interview is not colluding with the person that wants the job. The interviewee could just email their private key to the job seeker.

The key signing only protects against some very uncommon threats


This is no solution.

What happens when you need to meet face to face?


There is a crappy movie called "Surrogates" that goes down this path.


> at which point everyone goes back to audio-only

That defeats the entire purpose of using facial and body expressions that only video provides.

We already have video filters that remove wrinkles and blemishes in videoconferencing to make you look better.

Even if we replace ourselves entirely with computer-rendered images, they're still going to be reproducing our expressions, movements and gestures, which is what matters.


My phone already has a video call beautification setting built into the OS at the camera level.

We've been skirting the line for a while.

If I could, right now I absolutely would prefer to be sending a synthesized avatar then the real me - my desktop setup doesn't allow very optimal camera placement with large monitors, but for maximum impact I ideally want to send my face making direct eye contact with the camera.


Is there a filter available that just fixes the apparent direction of the pupil, so the image is looking at the camera, without doing any other edits? That would be really useful.


Permutation City has another take, where people virtually meet one another but use masks to hide their emotions.


Inviting terms as usual:

When you upload, submit, store, send or receive User Content to or through the NVIDIA Research AI Playground, you give NVIDIA (and parties NVIDIA works with, including its affiliates, suppliers and customers) a worldwide license to use (including without limitation for neural network training), host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes), communicate, publish, publicly perform, publicly display and distribute such User Content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving the NVIDIA Research AI Playground and content available to all users, and to develop new NVIDIA offerings. This license continues even if you stop using the NVIDIA Research AI Playground. The NVIDIA Research AI Playground may offer you ways to access, download, and remove content that has been provided, but make sure to keep your own back-up copies of your User Content. Also, the scope of services is limited and not all content in all formats can be loaded in the NVIDIA Research AI Playground.


The reason it’s usual is because all sites with upload features need the right to host what you upload. Facebook, for example, has the exact same thing.

“Permission to use content you create and share. [...] when you share, post, or upload content that is covered by intellectual property rights on or in connection with our Products, you grant us a non-exclusive, transferable, sub-licensable, royalty-free, and worldwide license to host, use, distribute, modify, run, copy, publicly perform or display, translate, and create derivative works of your content”

https://www.facebook.com/legal/terms


Sounds like a general CYA license so that developers can try it out before an actual product is rolled out with a proper commercial license.


It doesn't sound like they are trying to sell or somehow leverage to make profit? I guess 'develop new NVIDIA offerings' is kind of vaguely suspicious but the rest doesn't seem onerous to me unless I'm missing something?


Those license terms seem to me like they'd allow using someone's picture in unlimited advertising, for free. They also sound like they'll definitely be using them to train models. I wouldn't be surprised if some of that means they can sell them to advertisers for whatever nonsense.


My take is that they are limiting their liability in case anyone is overly letigius, or if they mishandle the data. Yes, they could do the things you describe, but my guess is they don't want to get sued by people they are providing a free service to.


That doesn't really change what the license they wrote allows though.


The side by side against other algos is pretty cool:

https://www.youtube.com/watch?v=nLYg9Waw72U&t=125s


I'm getting connection refused errors on the request to http://54.186.34.220:443/face_vid2vid_rotate port 443 on http? Hm.


Interesting that they're running plaintext HTTP on the HTTPS port. Are some networks filtering 80 on egress now and this is how people get around it? Or are the devs just using some cloud setup that tries to force serving HTTPS but the builders of the service don't want to?


Surprisingly flexible algorithm. I threw a headshot illustration of a comic villain into it and it was able to successfully do all the rotations and even manipulate the eyes despite it not being photorealistic.


There was an app going around a few months ago called “wombo.ai” that will make a headshot sing. It had the same outcome when given non human pictures as a source. Easily killed an afternoon trying out different things.


That was using something similar to First Order Motion Model & not the same model as what Nvidia has been using for Maxine or subsequent improvements to it as in the linked demo


If you allow me, about such animated pictures, you should be interested in my pet project -- 1000x animated(!) fantasy avatar faces. 100% AI-Generated. Check it out, it's free and beautiful, it feels like next-gen.

https://www.cgtrader.com/free-3d-models/character/fantasy/3d...

I made it shamelessly with https://www.fantasy-faces.com/ for the GAN and myheritage.fr for animation -- it was still a lot of work to select 1000 of the most beautiful.

Ultimately, I did it for the lulz...


Interesting tool, but it's pretty ridiculous that the clientside JavaScript is obfuscated (not only minified) for what little it does.


While the client-side has input validation, it appears that the server-side does not, as I can edit the request body freely and it'll return accordingly.

It's interesting to see how the model fails at extreme values. I can see why they chose the cutoffs they did!


Seems to use the EXIF data to display the image, so if uploading an image from a portait image you might want to strip that or it breaks.

Its pretty cool, but one thing that always bugs me - why is the demo page so bad?

For the majority of people, this is the only way they are going to see this thing working - would it really have hurt to have had an actual frontend dev nock something together for this? It takes away from the great work behind the scenes imho


This makes a request to server to get the result back. Hacker News hug of death has already happened.

I wish this was deployable to browsers so it was fully stand alone.


https://www.youtube.com/watch?v=xzLHZbBvKNQ talks of the https://developer.nvidia.com/maxine which requires GPU, so I'm assuming an RTX level GPU will be required for full motion? The demo calls an API on what appears to be a static IP which returns the end result using the algorithm.

The full paper is on https://nvlabs.github.io/face-vid2vid/main.pdf . (It only mentions GPU once, for the training set.)

I'm quite impressed by how NVIDIA Broadcast cleans up a simple webcam image already, on a 3070 GPU; the background blur will get the gap between headphone bridge and head with sharp cuts - it's impressive enough in my books to warrant such a gaming grade GPU for work purposes, if a remote worker.

I have my cam off to the side; I'm really looking forward to being able to try the angle correction!


I wish they had done this client-side with e.g. Tensorflow.js. Would have been much more fun to play with.


I'm getting a "Failed to load resource: the server responded with a status of 500 (INTERNAL SERVER ERROR)". Looks like we broke it?


Unable to upload custom photos and tried to use Chrome and Firefox but both failed either.


Nothing about this is nice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: