I appreciate this is research, but I wonder: are these gestures actually semantically distinct information, which the model is better at extrapolating from audio than the listener? Or are they just redundant visual cues that perhaps relieve some cognitive load when communicating with someone?
I'm apprehensive about accepting nonverbal communication that a model has appended to a human source.
I was also shocked. I think Meta is making this for their metaverse project, and their internal Facebook data shows that most people most of the time talk like complete idiots, barely able to mash a thought out of their mouth. I’ve never heard more inarticulate training data.
Who came up with these conversations? Hard to believe this was part of the training data. And where does the training data come from? To what extent is it applicable to academic or corporate settings?
This is one of the coolest things I've seen that I also cannot understand... why? Aren't you going to need to tune it on yourself? Because otherwise you're going to adopt the gesticulation of others (who it was trained on). Maybe for videogames? Or like NPCs in VR environments? But then doesn't that become robotic and then we get back to feeling uncanny valley after we normalized? I mean the network __has__ to do significant amounts of memorization unless conceivably the microphone can pick up a signal that actually corresponds to the 3d spatial movements (could be possible, but this doesn't seem that). Maybe that's what they're working towards and this is an iteration towards that?
It's technologically impressive, but I'm failing to see the use. Can someone else enlighten me? I'm sure there's something I'm failing to see.
Why do so many of these news demos require old versions of CUDA? It’s quite annoying having to juggle the installs and disk usage of CUDA 11.6,11.7,11.8,12.1,12.2,12.3
Because new CUDA versions often come with new bugs, so as a researcher with a publication deadline, you're incentivised to never upgrade. I still vividly remember how messy it was to downgrade a large TensorFlow project to a previous version which still supported an older CUDA because the current CUDA cuBLAS couldn't do matrix multiplication without silent memory corruption. (I believe it was 9.x to 10.0 that broke it) And we lacked the connections to directly talk to any engineer at NVIDIA, which meant that nobody even confirmed the bugs existence after we submitted a GitHub issue. Eventually, we found a way to trigger a crash on Google Colab using the bug and then someone from Google filed it again with NVIDIA and then things finally got fixed. About 6 months after the research paper was finished.
Because it takes time to write a paper and publish it. And researchers usually don't care about code so the version used when starting the project is never upgraded.
When it was brought up a bit ago, I thought I was aphantasic. It turned out I'm just poorly trained at it - I don't tend to rely on memory for physical details, and I'm aggressively poor at visual arts. With a little thought put into it, though, I realized I can "See" in my "Minds Eye", and with some active training I'm still not good at it - It's not a tool I reach for naturally - but I can "Imagine" some images if I try at it.
Look at something. Close your eyes. Retain the image. Try to "look" at parts of the image. Focus on parts of the image. Then... change them. No, that apple was green, not red. You know what a red apple looks like, you know what a green apple looks like; Focus on the overall feel of the rest of the image while the apple is different.
It's not easy, it's a trained skill, though I guess one that's pretty easy to acquire accidentally by daydreaming and doing art. I think I actually trained myself out of it as part of "safety" training; I stopped relying on mental models of many things and started forming mental models of the checklists to get them into that state.
Pretty cool. It's going to take a while to make it into a usable product though. Having conversations with people flailing their hands algorithmically is going to feel weird until it gets more natural. Right now it feels like those "blink every n" scripts.
It's incorrect to generalize like this. NPCs maybe use broad procedural rules, full mocap or really anything in between.
Yes, this could apply to them. No, it doesn't mean it always does.
This also doesn't really add to the point that it wouldn't be suited to human conversation. You rarely use NPCs to have human like chats, especially for several minutes. The few games that do would be Mass Effect or LA Noire, both of which use mocap to avoid the effect I'm referring to.
Like with audio, the challenges are going to be doing this in real time. You can have a conversation with e.g. chat gpt but it needs a few seconds to process what you say, come up with an answer, and then talk to you. It's a bit awkward. Imagine a version of that with an AI avatar like this. It would probably require quite a bit of performance improvements to get that to level of responsiveness that feels natural. And real life conversations are not just about talking realistically but also about non verbal communication when listening. So, it would have to adapt to what you are saying while your saying it for it to be completely natural.
That being said, this is pretty awesome for a lot of use cases that are less interactive.
Being able to have an avatar that fits your voice without having to actually look like that has many applications.
Whether you're trans or you just want to join a video call early in the morning without dressing up, the applications are endless.
In many situations we demand that people dress or present a certain way, just out of bullshit social expectations. This is one way to eat your cake and have it too.
I get that this is to drive the avatar, but I'm curious as to why. There's stronger signals with video, which I'm certain taking in even a very low resolution image would have stronger signals to convey movement than audio does (either the network is memorizing a lot (which is fine, but limited), or this is an iteration towards a 3D high sensitivity audio driven for precise sound? Something else?). I mean the quest has cameras in it, so why not use those? Computation? They aren't big models (largest is 1.42G, smallest 0.58GB)
For those use cases you should be able to get much more accurate results using a base video stream. This more fits use cases where you're lacking a video stream but not necessarily because you just don't want to turn it on.
I think metaverse is their primary target in making it as well, as I mentioned in a sibling comment, but modeling an avatar then rendering it is probably the easiest computational way of generating a video, even if the video itself isn't volumetric. See: traditional 3D avatars already in meeting apps which use the video feed.
I'm not sure if there would be any other potential use cases beyond these two. Or rather, I'm not able to think of them at so far.
Either games or its just interesting research that mostly ties in with what FB is doing. Cause there are problems like, e.g. imagine the bandwidth requirement of streaming 3D copies of like 20 people in a room
it's simply not possible within the near future, even today zoom/teams video conferencing is somehow highly compressed and shit quality with just low res 2D video.
That's amazing. It's a non-commercial license though.
How feasible is it to imitate what this model and codebase is doing to use it in a commercial capacity?
Did they release the dataset?
It would also be nice if Facebook would consider making an API to give Heygen and Diarupt some competition, if they aren't going to allow commercial use.
Although there will probably be a bunch of people who become millionaires using this for their porn gf bot service who just don't care about license restrictions.
I expected something this: https://speech2face.github.io/ (arbitrary voices) .. this model seems to have been trained for each and every specific speaker?
I don't know, but I can't imagine having this as a feature in any app (Zoom, etc) and leaving it on. That is how most of FB's AI research seems. Not good enough to make into a real product or feature.
The nature of this type of research is that there are long term goals which are currently unachievable with no clear concept for how to approach them, so researchers need to start putting small pieces together and working out how to make it all work smoothly as a single concept. It looks like someone had a neural network for mouth movement. Someone had one for body movement, etc. Composing multiple systems in to one teaches us how we can approach more complex problems and how to better tie things together than just inserting the output of one in to the input of another.
Long term this type of work helps solve big problems even if the intermediate steps don’t produce exciting results.
As an example, early image generators were pretty uninteresting but today they are widely utilized and generally considered impressive. The thing that researchers in the field know that the public doesn’t is that there’s 100 boring steps before the exciting release, and some of the boring steps are very exciting on a technical level. Those intermediate achievements represent 99% of what machine learning research actually is and others in the field appreciate those works.
Let's suppose the code will be re-implemented within huggingface transformers library and then you pour bunch of money into new dataset and the training and license it under MIT or create a separate product and sell the result of its work. Will this violate the CC-NC?
That was my point. What you need is money and time. If you invest into independent implementation, you will have your model and the product for any applicable purposes at required quality.
I'm apprehensive about accepting nonverbal communication that a model has appended to a human source.