Stuff like this can obviously be used to make things like deepfakes 'better'. But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.
When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).
> But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.
You might be interested in project HeadOn at TU München:
Justus Thies gave a presentation at our University about a year ago. IIRC they don't use any fancy NN stuff but instead extract the face geometry using a stereo camera and use interpolation to project the movement onto a target mesh. Using stereo goggles for a VR meeting and various other applications were discussed during the presentation, but the main focus was of course on entertaining the audience with fake videos.
On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.
These photogrametric, structure from motion, structured light, etc. techniques have been around awhile so I don't think this changes too much though it may make it a bit easier to generate realistic depth maps for other purposes.
At some point you can probably reconstruct large portions of scenes in existing movies and change perspectives, especially if you had good techniques (perhaps AI based) for filling occlusions in the data.
One of Vernor Vinge's novels describes 3d models transmitted as part of video conferencing, and re-skinning used to fool an adversary (while depending on causing a noisy, reduced bandwidth connection to make it plausible why the model is imperfect, I seem to remember).
Lots of people don't have perfectly good voices, and if you can copy the best voices, why settle for lesser ones? There are lots of voices people like more than yours, likely.
To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?
This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.
Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor, while text to speech requires no extra effort on their part?
> Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor
This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.
Could also have speech problems. Could be lazy. Could want to save time. Could be useful at producing consistent CC information across mediums. Could allow people to choose arbitrary voice synthesis in the future which super futurists may like the idea of. Could have used a translator to produce the text (I haven't listened) and not know English atall.
Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.
Many academics like the ability to "compile" latex, and probably want to "compile" their videos too, complete with autogenerated script. That way, when they make a small change to their source code, the new version will autogenerate a new video with an updated script.
I like real voice but I could see if you were generating lots of videos and doing so in multiple languages, you could abstract that away a bit and dynamically generate videos these days. This is part of the reason video containers are often separated into components/layers (for video and audio tracks). I dont see why you couldn't also have the subtitle data read and generate the audio dynamically based on language. Some of this probably already happens somewhere by some group. Just an idea that I found interesting, similar to composing documents with LaTeX etc. Think of the audio as the "presentation" layer for a lot of visual frameworks used and think of similar structures for audio. It's especially useful for videos where the speaker isn't visible so syncing audio with lip movements across languages isn't a problem.
My thought as well. The TTS is good enough that it won't take much of an accent before the accent is harder to understand than the TTS as well. I know my own accent is strong enough that I'd have to put in very conscious effort to be easier to understand than this video.
I did research on a speech-to-text-to-speech system, and lots of non-native English speakers were self-conscious of their speech and prefer text-to-speech that wasn't in the style of their original voice.
Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.
Judging from SIGGRAPH videos/presentations, it's pretty common for graphics researchers who are non-native speakers to use text-to-speech or a native-speaking acquaintance for narration. I think it's done explicitly to help comprehension, although I think self-consciousness or fear of public speaking plays a role, too.
Well I have a slight speech impediment that probably 1 in a 1000 people notices maybe once in a while and then thinks they must be mistaken when speaking to me in person, but if they were listening to a video they might here it clearer.
In addition to the possible reasons stated by peer comments, revisability. One-click builds are just as attractive for writing as they are for programming.
Please fix the title of submission. It is "
Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild". No anime examples to be found in the paper :(
When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).