Unsupervised learning of probably symmetric deformable 3D objects from images

jcims · on March 25, 2020

Stuff like this can obviously be used to make things like deepfakes 'better'. But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.

When you can rig the meshes to drive each other you could wear 'masks' of other peoples faces (or critters).

st_goliath · on March 25, 2020

> But i think it might be cool for creating virtual meeting rooms, where you can take a webcam shot of a persons face, normalize the skin tones for lighting, map to a 3d surface, then relight it for the virtual room.

You might be interested in project HeadOn at TU München:

https://www.niessnerlab.org/projects/thies2018headon.html

Justus Thies gave a presentation at our University about a year ago. IIRC they don't use any fancy NN stuff but instead extract the face geometry using a stereo camera and use interpolation to project the movement onto a target mesh. Using stereo goggles for a VR meeting and various other applications were discussed during the presentation, but the main focus was of course on entertaining the audience with fake videos.

On a side note: This is probably the closest to a real-life Max Headroom that we have so far. Not sure if that influenced the name.

Frost1x · on March 25, 2020

These photogrametric, structure from motion, structured light, etc. techniques have been around awhile so I don't think this changes too much though it may make it a bit easier to generate realistic depth maps for other purposes.

At some point you can probably reconstruct large portions of scenes in existing movies and change perspectives, especially if you had good techniques (perhaps AI based) for filling occlusions in the data.

vidarh · on March 25, 2020

One of Vernor Vinge's novels describes 3d models transmitted as part of video conferencing, and re-skinning used to fool an adversary (while depending on causing a noisy, reduced bandwidth connection to make it plausible why the model is imperfect, I seem to remember).

jcims · on March 25, 2020

Oh man, this plus the obs thread earlier plus zoom. The mind boggles haha.

thomasahle · on March 25, 2020

Video here: https://youtu.be/5rPJyrU-WE4

Can anyone explain why people would use text to speech for something like this, when they have perfectly good voices themselves?

gwern · on March 25, 2020

Lots of people don't have perfectly good voices, and if you can copy the best voices, why settle for lesser ones? There are lots of voices people like more than yours, likely.

To give an example, 15-kun recently built on the Pony Preservation Project to use neural nets to voice-clone among others, My Little Pony voices, offering it as a service: https://fifteen.ai/ People have used it for all sorts of things: https://www.equestriadaily.com/2020/03/pony-voice-event-what... Suppose you want to do, say, an F1 commentary on Austrian GP 2019 (#4) - why do it with your voice if you can do it with Fluttershy's voice?

This will be the next evolution of streamers, especially Virtual Youtubers and their ilk.

dointheatl · on March 25, 2020

Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor, while text to speech requires no extra effort on their part?

thaumasiotes · on March 25, 2020

> Because reading from a script for five minutes is likely to require multiple takes for someone who isn't a practiced voice actor

This depends on how much you can tolerate speech errors. Most listeners will gloss over them, preferring the human voice to the speech synthesizer while not even really noticing the errors.

Tade0 · on March 25, 2020

None of the authors appear to be native English speakers, so perhaps they're self-conscious about their accents?

Frost1x · on March 25, 2020

Could also have speech problems. Could be lazy. Could want to save time. Could be useful at producing consistent CC information across mediums. Could allow people to choose arbitrary voice synthesis in the future which super futurists may like the idea of. Could have used a translator to produce the text (I haven't listened) and not know English atall.

Personally, I'll take the human voice unless you literally cannot speak (e.g. disability) or feel uncomfortable.

londons_explore · on March 25, 2020

Many academics like the ability to "compile" latex, and probably want to "compile" their videos too, complete with autogenerated script. That way, when they make a small change to their source code, the new version will autogenerate a new video with an updated script.

Frost1x · on March 25, 2020

Yup.

I like real voice but I could see if you were generating lots of videos and doing so in multiple languages, you could abstract that away a bit and dynamically generate videos these days. This is part of the reason video containers are often separated into components/layers (for video and audio tracks). I dont see why you couldn't also have the subtitle data read and generate the audio dynamically based on language. Some of this probably already happens somewhere by some group. Just an idea that I found interesting, similar to composing documents with LaTeX etc. Think of the audio as the "presentation" layer for a lot of visual frameworks used and think of similar structures for audio. It's especially useful for videos where the speaker isn't visible so syncing audio with lip movements across languages isn't a problem.

vidarh · on March 25, 2020

My thought as well. The TTS is good enough that it won't take much of an accent before the accent is harder to understand than the TTS as well. I know my own accent is strong enough that I'd have to put in very conscious effort to be easier to understand than this video.

BorisTheBrave · on March 25, 2020

I've found it's surprisingly hard to get a satisfactory recording setup - noise, volume, echoes.

fatso784 · on March 25, 2020

I did research on a speech-to-text-to-speech system, and lots of non-native English speakers were self-conscious of their speech and prefer text-to-speech that wasn't in the style of their original voice.

Also, it's much simpler to make changes to a publication video, since using original voice requires re-recording with a high-quality microphone and post-processing of background noise.

jcl · on March 26, 2020

Judging from SIGGRAPH videos/presentations, it's pretty common for graphics researchers who are non-native speakers to use text-to-speech or a native-speaking acquaintance for narration. I think it's done explicitly to help comprehension, although I think self-consciousness or fear of public speaking plays a role, too.

bryanrasmussen · on March 25, 2020

Well I have a slight speech impediment that probably 1 in a 1000 people notices maybe once in a while and then thinks they must be mistaken when speaking to me in person, but if they were listening to a video they might here it clearer.

Although I can get rid of it if I focus.

usrusr · on March 25, 2020

In addition to the possible reasons stated by peer comments, revisability. One-click builds are just as attractive for writing as they are for programming.

deckar01 · on March 25, 2020

"We store a copy of the uploaded image ..."

Is this really happening with a "deep network in the browser"? It looks like it is happening on a server, then the 3D result is viewed in a browser.

Karuma · on March 25, 2020

Yeah, I'm glad they changed the completely inaccurate title. Right now it doesn't even work due to server overload...

ijpsud · on March 25, 2020

Not loading properly for me :( I think server may be being overloaded by HN traffic?

thedance · on March 25, 2020

Unsupervised Static Asset Server Converts Probably Simple HTTP Requests to 404 Errors.

nalaka · on March 25, 2020

Yes- you nailed it. :)

jotaf · on March 25, 2020

Should've updated the title! Now it converts HTTP requests to proper 3D meshes (seems to work again).

nalaka · on March 25, 2020

Looks like HN has broken their server. I am getting a "Failed to send request to server!" error when I upload an image.

coding123 · on March 25, 2020

I think we slammed it dead.

hirako2000 · on March 25, 2020

If it was truly processing on the browser, we wouldn't have jagged it so easily.

michaelt · on March 25, 2020

This looks impressive - shame there aren't any examples larger than postage-stamp-sized in the paper or the video.

awinter-py · on March 25, 2020

if you rotate it 180 deg Z axis, it's like she's watching you

schoen · on March 25, 2020

Very striking!

This effect is well-studied and also happens with a real physical mask when viewed from the inside:

https://en.wikipedia.org/wiki/Hollow-Face_illusion

h91wka · on March 25, 2020

Please fix the title of submission. It is " Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild". No anime examples to be found in the paper :(

michaelt · on March 25, 2020

Figure 11, "Reconstruction on abstract face drawings", has a reconstruction of Naruto's face. The results are... unsettling.

dang · on March 25, 2020

Submitted title was "Deep network in the browser converts to 3D any image of a person/cat/anime face". Changed now.

itronitron · on March 25, 2020

because we can