Audio2Photoreal

1shooner · on Jan 5, 2024

I appreciate this is research, but I wonder: are these gestures actually semantically distinct information, which the model is better at extrapolating from audio than the listener? Or are they just redundant visual cues that perhaps relieve some cognitive load when communicating with someone?

I'm apprehensive about accepting nonverbal communication that a model has appended to a human source.

holmesworcester · on Jan 4, 2024

The sample conversations here are hilarious, especially compared to the typical academic or corporate AI paper.

55555 · on Jan 5, 2024

I was also shocked. I think Meta is making this for their metaverse project, and their internal Facebook data shows that most people most of the time talk like complete idiots, barely able to mash a thought out of their mouth. I’ve never heard more inarticulate training data.

oefnak · on Jan 5, 2024

They're probably people who are paid minimum wage for just random talking while being recorded.

white_eskimo · on Jan 5, 2024

Who came up with these conversations? Hard to believe this was part of the training data. And where does the training data come from? To what extent is it applicable to academic or corporate settings?

airstrike · on Jan 5, 2024

They're hilarious but the stuttering feels a little over the top IMHO

ArekDymalski · on Jan 4, 2024

Impressive. Even at current state it would make RPGs like Fallout or Skyrim sooo much more alive ...

godelski · on Jan 4, 2024

This is one of the coolest things I've seen that I also cannot understand... why? Aren't you going to need to tune it on yourself? Because otherwise you're going to adopt the gesticulation of others (who it was trained on). Maybe for videogames? Or like NPCs in VR environments? But then doesn't that become robotic and then we get back to feeling uncanny valley after we normalized? I mean the network __has__ to do significant amounts of memorization unless conceivably the microphone can pick up a signal that actually corresponds to the 3d spatial movements (could be possible, but this doesn't seem that). Maybe that's what they're working towards and this is an iteration towards that?

It's technologically impressive, but I'm failing to see the use. Can someone else enlighten me? I'm sure there's something I'm failing to see.

55555 · on Jan 5, 2024

Google “Horizon Worlds”

godelski · on Jan 5, 2024

I am quite aware of the metaverse

smcleod · on Jan 5, 2024

Why do so many of these news demos require old versions of CUDA? It’s quite annoying having to juggle the installs and disk usage of CUDA 11.6,11.7,11.8,12.1,12.2,12.3

fxtentacle · on Jan 5, 2024

Because new CUDA versions often come with new bugs, so as a researcher with a publication deadline, you're incentivised to never upgrade. I still vividly remember how messy it was to downgrade a large TensorFlow project to a previous version which still supported an older CUDA because the current CUDA cuBLAS couldn't do matrix multiplication without silent memory corruption. (I believe it was 9.x to 10.0 that broke it) And we lacked the connections to directly talk to any engineer at NVIDIA, which meant that nobody even confirmed the bugs existence after we submitted a GitHub issue. Eventually, we found a way to trigger a crash on Google Colab using the bug and then someone from Google filed it again with NVIDIA and then things finally got fixed. About 6 months after the research paper was finished.

gwervc · on Jan 5, 2024

Because it takes time to write a paper and publish it. And researchers usually don't care about code so the version used when starting the project is never upgraded.

speedgoose · on Jan 5, 2024

Never upgrading the dependencies is not specific to research.

For example the npm installations of react show that old versions are still downloaded a lot in the last 7 days: https://www.npmjs.com/package/react?activeTab=versions

jQuery on a random CDN with stats also shows that many websites are likely never going to update it: https://www.jsdelivr.com/package/npm/jquery?tab=stats

You also sees that with system administrators: here is curl on debian from people who shares about it: https://qa.debian.org/popcon-graph.php?packages=curl&show_in...

kridsdale1 · on Jan 4, 2024

Goddamn that’s cool.

End-state for Winamp vizualizers: synthesize an entire living world from the audio alone.

bagful · on Jan 4, 2024

I dare you to make it look better than the pictures that form in my mind’s eye

error9348 · on Jan 4, 2024

Easy to beat my mind's eye

https://en.wikipedia.org/wiki/Aphantasia

ghhdvki · on Jan 5, 2024

Huh. Didn't know this was a thing, but I guess I'm aphantasic.

Thanks for letting me know.

GauntletWizard · on Jan 5, 2024

When it was brought up a bit ago, I thought I was aphantasic. It turned out I'm just poorly trained at it - I don't tend to rely on memory for physical details, and I'm aggressively poor at visual arts. With a little thought put into it, though, I realized I can "See" in my "Minds Eye", and with some active training I'm still not good at it - It's not a tool I reach for naturally - but I can "Imagine" some images if I try at it.

monitron · on Jan 5, 2024

Fascinating. How did you train yourself to improve this?

GauntletWizard · on Jan 5, 2024

Look at something. Close your eyes. Retain the image. Try to "look" at parts of the image. Focus on parts of the image. Then... change them. No, that apple was green, not red. You know what a red apple looks like, you know what a green apple looks like; Focus on the overall feel of the rest of the image while the apple is different.

It's not easy, it's a trained skill, though I guess one that's pretty easy to acquire accidentally by daydreaming and doing art. I think I actually trained myself out of it as part of "safety" training; I stopped relying on mental models of many things and started forming mental models of the checklists to get them into that state.

leshokunin · on Jan 4, 2024

Pretty cool. It's going to take a while to make it into a usable product though. Having conversations with people flailing their hands algorithmically is going to feel weird until it gets more natural. Right now it feels like those "blink every n" scripts.

kridsdale1 · on Jan 4, 2024

Every video game NPC is basically following such an algorithm.

leshokunin · on Jan 4, 2024

It's incorrect to generalize like this. NPCs maybe use broad procedural rules, full mocap or really anything in between.

Yes, this could apply to them. No, it doesn't mean it always does.

This also doesn't really add to the point that it wouldn't be suited to human conversation. You rarely use NPCs to have human like chats, especially for several minutes. The few games that do would be Mass Effect or LA Noire, both of which use mocap to avoid the effect I'm referring to.

alanbernstein · on Jan 5, 2024

See this classic deep-learning based example: https://m.youtube.com/watch?v=pAsw16y3U-c

jillesvangurp · on Jan 5, 2024

Like with audio, the challenges are going to be doing this in real time. You can have a conversation with e.g. chat gpt but it needs a few seconds to process what you say, come up with an answer, and then talk to you. It's a bit awkward. Imagine a version of that with an AI avatar like this. It would probably require quite a bit of performance improvements to get that to level of responsiveness that feels natural. And real life conversations are not just about talking realistically but also about non verbal communication when listening. So, it would have to adapt to what you are saying while your saying it for it to be completely natural.

That being said, this is pretty awesome for a lot of use cases that are less interactive.

iamleppert · on Jan 4, 2024

This reminds me of the old Titanic CD-ROM adventure game avatars.

https://youtu.be/0pXBXIrB478?si=iQ5YtDPBSaq0ynsv

I honestly prefer the Titanic avatars though.

philsnow · on Jan 5, 2024

Really want to see this on the broccoli man bit about wanting to serve 5TB

dweekly · on Jan 5, 2024

That's Google! The Facebook video you want is "pusher, I need a hotfix"

aantix · on Jan 4, 2024

Why would we want an avatar vs a real video stream of the actual person?

kuschku · on Jan 4, 2024

Being able to have an avatar that fits your voice without having to actually look like that has many applications.

Whether you're trans or you just want to join a video call early in the morning without dressing up, the applications are endless.

In many situations we demand that people dress or present a certain way, just out of bullshit social expectations. This is one way to eat your cake and have it too.

godelski · on Jan 5, 2024

I get that this is to drive the avatar, but I'm curious as to why. There's stronger signals with video, which I'm certain taking in even a very low resolution image would have stronger signals to convey movement than audio does (either the network is memorizing a lot (which is fine, but limited), or this is an iteration towards a 3D high sensitivity audio driven for precise sound? Something else?). I mean the quest has cameras in it, so why not use those? Computation? They aren't big models (largest is 1.42G, smallest 0.58GB)

zamadatix · on Jan 4, 2024

For those use cases you should be able to get much more accurate results using a base video stream. This more fits use cases where you're lacking a video stream but not necessarily because you just don't want to turn it on.

kridsdale1 · on Jan 4, 2024

A video stream isn’t volumetric.

This is for the metaverse.

zamadatix · on Jan 4, 2024

I think metaverse is their primary target in making it as well, as I mentioned in a sibling comment, but modeling an avatar then rendering it is probably the easiest computational way of generating a video, even if the video itself isn't volumetric. See: traditional 3D avatars already in meeting apps which use the video feed.

I'm not sure if there would be any other potential use cases beyond these two. Or rather, I'm not able to think of them at so far.

plaguuuuuu · on Jan 4, 2024

Either games or its just interesting research that mostly ties in with what FB is doing. Cause there are problems like, e.g. imagine the bandwidth requirement of streaming 3D copies of like 20 people in a room

it's simply not possible within the near future, even today zoom/teams video conferencing is somehow highly compressed and shit quality with just low res 2D video.

bigfishrunning · on Jan 4, 2024

You could generate the avatar clientside and save a ton of bandwidth vs a compressed video stream...

airstrike · on Jan 5, 2024

Because my avatar can more easily travel to a made-up photorealistic 3D render of an exoplanet

RobCodeSlayer · on Jan 4, 2024

I’m imaging video game applications where the avatars are controlled by both online users and LLMs

zamadatix · on Jan 4, 2024

Given it's by meta I'm guessing it's related to their metaverse goals.

esafak · on Jan 4, 2024

Old recordings of people without pictures, for one!

ilaksh · on Jan 4, 2024

That's amazing. It's a non-commercial license though.

How feasible is it to imitate what this model and codebase is doing to use it in a commercial capacity?

Did they release the dataset?

It would also be nice if Facebook would consider making an API to give Heygen and Diarupt some competition, if they aren't going to allow commercial use.

Although there will probably be a bunch of people who become millionaires using this for their porn gf bot service who just don't care about license restrictions.

19h · on Jan 5, 2024

I expected something this: https://speech2face.github.io/ (arbitrary voices) .. this model seems to have been trained for each and every specific speaker?

tafekih · on Jan 5, 2024

Now try this after training the model with italians

CrzyLngPwd · on Jan 4, 2024

It's really impressive.

I wonder where it is headed.

aaroninsf · on Jan 4, 2024

Below the right wing, the world famous Uncanny Valley of Menlo Park, one of the seven blunders of the natural world.

pseudosavant · on Jan 4, 2024

Like the rest of Facebook's AI research... I find this underwhelming. Not even good enough to trigger uncanny valley issues.

dtauzell · on Jan 4, 2024

Are there some similar models that are currently better?

pseudosavant · on Jan 4, 2024

I don't know, but I can't imagine having this as a feature in any app (Zoom, etc) and leaving it on. That is how most of FB's AI research seems. Not good enough to make into a real product or feature.

SequoiaHope · on Jan 4, 2024

The nature of this type of research is that there are long term goals which are currently unachievable with no clear concept for how to approach them, so researchers need to start putting small pieces together and working out how to make it all work smoothly as a single concept. It looks like someone had a neural network for mouth movement. Someone had one for body movement, etc. Composing multiple systems in to one teaches us how we can approach more complex problems and how to better tie things together than just inserting the output of one in to the input of another.

Long term this type of work helps solve big problems even if the intermediate steps don’t produce exciting results.

As an example, early image generators were pretty uninteresting but today they are widely utilized and generally considered impressive. The thing that researchers in the field know that the public doesn’t is that there’s 100 boring steps before the exciting release, and some of the boring steps are very exciting on a technical level. Those intermediate achievements represent 99% of what machine learning research actually is and others in the field appreciate those works.

smusamashah · on Jan 4, 2024

This is amazing if used in games. Game designer can easily create realistic body movement just using audio.

echelon · on Jan 4, 2024

Also CC-NC. They want free feedback, but won't let you use it to make anything yourself.

pk-protect-ai · on Jan 5, 2024

Let's suppose the code will be re-implemented within huggingface transformers library and then you pour bunch of money into new dataset and the training and license it under MIT or create a separate product and sell the result of its work. Will this violate the CC-NC?

nmfisher · on Jan 5, 2024

If you've reimplemented the code and recreated your own dataset, I don't see how that would violate the licence.

pk-protect-ai · on Jan 5, 2024

That was my point. What you need is money and time. If you invest into independent implementation, you will have your model and the product for any applicable purposes at required quality.