With that kind of technology, what are the key problem still to be solved before being massively applied to deepfakes ? More specifically:
- how much datas (pictures or video) of the "target" is needed to use this ? Does it requires a specific lighting, a lot of different poses... or is it possible to just use some "online" videos (found on tiktok for example) or to record the "target" in the street with a phone ? How is it to create a "virtual doppelganger" ?
- when there is a "target" model, is it possible to use this in realtime ? How much power would it need ? A small laptop ? A big machine in the cloud ? Only a state-sponsored infrastructure ?
It looks like this technology has a real potential to "impersonate" anybody really efficiently
I worked on a DARPA anti-deepfakes project up until spring 2021, so just before the real tidal wave of generative AI. At that time, state of the art (of publicly known tech) required a few hours of target footage to train something passably deepfaked. Since then, there's been huge advancements in the generalizability of models. I don't know how little the threshold is, but it has gone from "only really feasible on celebs/politicians/folks with extensive video presence" to "feasible from a handful of videos". Like your average American's social media footprint.
You still need a pretty beefy rig (array of multiple 4090 gpus) to do convincing video generation in a non-glacial amount of time but it's totally possible with readily available hardware.
The bigger problem is actually "cheapfakes", so many people are so confirmation-biased that they will readily amplify even poorly put-together disinformation.
It's not clear to me that you can have an unrelated target?
But that's a good question, can you take a canonical pose of peter, and make it perform an animation of jenni's dance? Jenni will have breasts and hips. Those offsets in the texture map could be enough to throw it, who knows?
At least for the work they did, it seems they did all the work for each subject separately. Which is useful, but it's obviously going to constrict the use cases for the technology.
I could be misunderstanding however. I only looked at it for the past 5 or 6 minutes.
For animations video games use motion capture already, and an interesting thing is that you cannot ask your mocap actor to act normally and expect the result to look good, for some reason natural motion is not desired in games and studios both ask the actor to play in a very specific way, and then animation artists still have lots of work tweaking the animations so that it looks good.
It's the same reason why acting on a theater scene doesn't just requires that you project yourself in the role and try embody the character but also adapt your motion and expressions so they can be understood by the public sitting afar.
That's super interesting. Do you think that a model could be trained to exaggerate the motions necessary, and then that output could be splatted? There is probably enough mocap data already to "rebuild" games with new characters using similar techniques.
So much cool work bringing realism to Gaussian sources I think avatar mediated collaboration will get across the uncanny valley so where you are and how well you can communicate are not related anymore.
Also if you like Degas, this is another state of the art project in progress called VR-GS: A Physical Dynamics-Aware Interactive Gaussian Splatting System in Virtual Reality
The second to last video shows a Gaussian splatting advantage I didn't think of: when gaussians clip into each other the failure is more gradual than when polygons do it.
Realism in more hands will make everyone a loveable artist. I look forward to being able to jump scare people with The Mummy face in video meetings until the last of their believability is shot.
With that kind of technology, what are the key problem still to be solved before being massively applied to deepfakes ? More specifically:
- how much datas (pictures or video) of the "target" is needed to use this ? Does it requires a specific lighting, a lot of different poses... or is it possible to just use some "online" videos (found on tiktok for example) or to record the "target" in the street with a phone ? How is it to create a "virtual doppelganger" ?
- when there is a "target" model, is it possible to use this in realtime ? How much power would it need ? A small laptop ? A big machine in the cloud ? Only a state-sponsored infrastructure ?
It looks like this technology has a real potential to "impersonate" anybody really efficiently