Hacker News new | past | comments | ask | show | jobs | submit login
Avatarify lets users run realtime deepfakes on live video calls (inputmag.com)
411 points by adenner on April 18, 2020 | hide | past | favorite | 117 comments



It would be interesting to see how far you could get using deepfakes as a method for video call compression.

Train a model locally ahead of time and upload it to a server, then whenever you have a call scheduled the model is downloaded in advance by the other participants.

Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end. When the tech is a little further along, it should be possible to get good quality video using only a fraction of the bandwidth.


This is a minor plot point in Vernor Vinge's excellent SF novel A Fire Upon the Deep.

One of the premises of the novel's universe is that computational power is generally absurdly plentiful, but communications bandwidth over interstellar distances is not. Most communications are in plain text (modeled after USENET) but in some cases, "evocations" are used to extrapolate video and audio from an ultra-compressed data stream.

The trouble, of course, is that it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.


A main premise of the Fear the Sky trilogy as well but solved a different way. Machines representing various political factions from the home planet are uploaded with AI that mimics them emotionally and politically for all intents and purposes. I really enjoyed this book.


Eh, I personally enjoyed the series, but I wouldn't recommend anything beyond book 1. Book 2 is ok. Book 3 really spoiled the series for me because of the inconsistent behavior if the main character. (Keeping it vague to avoid spoilers)


Same. Notice I said "the book" while mentioning the trilogy ;-)


+1 recommendation for this trilogy


> it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.

It would be quite obvious unless the raw data before extrapolating is destroyed, for which there are no reason nor is it possible to stop others in the vincinity from receiving this raw data.


That assumes that the "raw data" is reasonably human-comprehensible (which neural network weights and activations are notoriously not) and/or that you have time to sit down and analyze the data at your leisure.

But saying more would be spoilery...


For that to be true the compression algorithm mustn’t be very efficient.


Google recently introduced something like that for Audio in Duo: https://ai.googleblog.com/2020/04/improving-audio-quality-in...

> WaveNetEQ is a generative model, based on DeepMind’s WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech.

I don't think you need to train for each person specifically, you can just train a model for all heads, then maybe transmit a few high quality pics when the call starts, and interpolate from that afterward.


Excellent idea and we'll surely be seeing something like this, there are AR apps that already map facial expressions to avatars.

Downside could be some uncanny valley if the models are not very high quality.

But if I had to make a prediction, I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.


> I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.

Low latency is probably more important to me.

Recently I seem to have a 3 second delay on many VC calls at work (and just for me it seems), and I either end up interrupting people or feeling reluctant to talk at all since it becomes impossible to time gaps and conversations right.

Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).


> Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).

If you're really willing to sacrifice video completely, at least for Zoom, and probably for lots of other videoconferencing solutions, you can call into meetings with your phone. In fact, I think Zoom allows you to join with the computer for video and the phone for audio, which might be the best of both worlds.


Yes, zoom supports that.

Slight issue in Toronto has been cellular system overloading and calls not being completed. But once on, no problem.

I can’t blame the providers though. How could they have predicted that people would use the service they’re paying for?


This can be helped with hand-raising (queue style) and a dedicated facilitator for each meeting.


This is a long shot - but are you running on battery while this is happening? Had some weird issues that worked themselves out by plugging in the charger. Probably had to do with power savings and cpu throttling.


> Downside could be some uncanny valley if the models are not very high quality.

That can be controlled, since these compression algorithms usually work by making a prediction and sending the difference between the prediction and the actual value.

That works both for lossless compression - where the difference is sent in full - and lossy as well - where only the most important part of the difference is sent.


Even better would be for RPGs and things like Roll20. I’d love to deep fake different voices/ character faces on cue.


This is very loosely what nvidias dlss game upscaling does. Generalized NN trained on super high resolution game engine output. You can run a game at like a quarter to half resolution and it upscales the rest.

https://www.nvidia.com/en-us/geforce/news/nvidia-dlss-2-0-a-...


Very cool idea. The coding used in H264 is a variant of the DCT, so moving one layer of abstraction up from there basically moves from semi-analog to fully digital. I agree that it should only require a fraction of the bandwidth because you'd only be sending parametric data rather than full video.


I think this is largely possible, and accuracy to a human is very different than MSE accuracy used in a traditional lossy compression algorithm.

To a human, for example, the exact pattern of every strand of hair isn't important at all -- all that matters is that the hairstyle and hair color stays the same.

The algorithm can also not worry about encoding and re-constructing skin blemishes because humans would possibly actually enjoy not having to put on makeup for a video call.


I was thinking the same thing today. I wonder if it can be done in on-the-spot, like capture your image from the camera initially and then send the rest as data points for deepfake generation on the other side, but based on your own image. That would be amazing for low/limited bandwidth situations.


Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end.

The MPEG-4 part 2 actually had something like that, called "face and body animation (FBA)". As far as I know, there are no implementations in widespread use.


I wonder if the same kind of thing is feasible for someone's voice.


Update: I did some searching and found some interesting demos of a hybrid of neural nets and more conventional DSP called LPCNet:

https://people.xiph.org/~jm/demo/lpcnet_codec/

Sure enough, it was discussed on HN when it came out last year. I think I missed it then.

For those who didn't catch this from the URL, this is by Jean-Marc Valin, of Speex and Opus fame.


I believe that’s what the vocoder was created for.

https://en.m.wikipedia.org/wiki/Vocoder


Mentioned in Jim Akhaleli's (sp?) "Revolutions" episode about the smart-phone currently on Netflix (my lad was watching it, really good for juniors or non-technical people IMO).


Almost right. It's Jim Al-Khalili. Great show.


I second the already expressed sentiments!

An utterly brilliant great idea!


This had already been done with criminals posing as CEOs.


Actual Github repo for this called Avatarify: https://github.com/alievk/avatarify

Original repo code that Avatarify is based on called First Order Model: https://github.com/AliaksandrSiarohin/first-order-model

Short video demonstrating First Order Model: https://www.youtube.com/watch?v=mUfJOQKdtAk


Voice fake has been done already as well.

https://github.com/CorentinJ/Real-Time-Voice-Cloning


As someone whose tried this repo extensively with politican sound clips (I wanted to troll buddies on discord) - it kinds blows. Don't get me wrong - it's really neat but the results are far less good than one may expect.

Sometimes it almost works, and then it's just totally absurd. Long pauses, voices that don't sound compelling, total failures on female voices. It's great in theory but it showed me that there's a ton of work to be done with voice cloning.

Props to the author for using UMAP to seperates voices though

Also lol at the demand being so high that there are open issues of people offering to pay others to install this on their machine. Freelancing opportunities show up in the strangest of places...


Real Time Voice Cloning certainly has iffy output, but it's probably the most popular because it provides the easiest plug-and-play experience with even a simple UI to get started.

The author says he's working on a more polished toolkit called Resemble.AI, but I've never tried it. https://www.resemble.ai/

There's certainly a market out there for just beautifying existing repos to making it easier for non-scholars to get going. Even having a Colab Notebook ready to click-and-start is quite powerful -- probably a big reason why First Order Model (source paper to the original story) got so much traction so quickly.


See https://www.descript.com/lyrebird-ai for another one with an on-site demo.


And sometimes it's not even non-scholars, just people from a different, and not even very-different field.


could you post some examples with dropbox of the 'less than stellar' results you were getting?


Find audio clips of trump speaking Try running then through this

See what you get.

To be fair, I was trying this repo almost a 6 months ago so updates may have improved it.


Voice fake can copy timbre but not intonation or many other stylistic features.


I ran this on ubuntu 18.04. It took a little work to get around a small bug that will be squashed when the v412loopback library gets officially rebuilt but here is a solution https://github.com/alievk/avatarify/issues/37#issuecomment-6...

anyway, on a 4930k at 4.5 ghz i am seeing reasonable performance(~30fps) but minimal cuda utilization(titan pascal). The biggest issue is that you need a well lit, stationary face for it to properly map features of the jpg you are using to substitute for your face. Also the jpg needs clearly identifiable features. Even then, the amount of facial expression is subdued (for example closing your eyes is not properly processed).

I seem to recall software about 10 years ago where you drew line segments of corresponding features on 2 images and the jpg was then mapped onto the video. It was more accurate and expressive than this is but did require time to set up.


The way that the mouth and eyes move, but that the rest of the face is pretty static, is actually really charming to me.

It betrays that it's a fake, which I think makes it easier to be a fun joke.


That’s simple. If you are trying to fool someone....

“My internet speed is very low. I know the video is choppy”.


Which is the line taken by the guy who made his own video call bot shown in social media recently.

https://redpepper.land/blog/zoombot/


Reminds me of the chuckecheese robots


This is the best application of deep fakes I have seen! If someone was selling deep fakes for StarTrek/StarWar, it would be hot cake among crowd here, except for too many Kirk and Picards might be seen in meetings.


More meetings should definitely start with arguments about who gets to use which avatar. "I only took Riker because my boss is on the call!" "Geordi is already taken; you have to be Wesley!"


Sounds like it would be with a few minutes of fun on a daily status call during these work-from-home days of quarantine. For the time being the goofiest it’s gotten was someone using a virtual webcam that allowed for green-screen and looping video of the Max Headroom background... Which reminds me, though slightly off topic, if there’s virtual webcam software whose sole point is to keep a log of all of the applications using the built-in camera or mic. That might be useful.


I’m sure we will see more and more of this. One question is when they get so good, how can people figure out that it’s a deep fake or a real human? Will we have captchas for videos soon ? sigh


Ask the person to turn their head left and right. Or to put on glasses. Combine simple tests like this and it becomes exponentially more complex for the gadget to pass them.


Like the Turing Test computational attack from Rick and Morty.

Rick speaking to a crowd of hologram characters: Everyone who's first name starts with an "L" who isn't Hispanic, walk in a circle the same number of times as the square root of your age times ten! - https://www.imdb.com/title/tt3333830/


My opinion on this is that we will "soon" have "trusted" webcams that create some sort of signature in each frame. In other words, a way to say "this is what the camera saw, guaranteed".

Probably something that would be built into phones and laptops first.

Of course, we know that messing with such a datastream would be easy for us, but for enterprise users (think liability) and news organizations it could be a real boon.


That's when you point your camera at a screen.


Will the "trusted" webcam manufacturers then attempt to detect any screen curvature?


Technically it would not be "this is what the camera saw, guaranteed" but "this is what the possessor of some signing key asserts they saw".

If a single shoddy manufacturer of cheap webcams somehow leaks a key that should have been on that webcam, then any software can easily encode a video stream with signatures asserting "this is what the camera saw, guaranteed".


Good question, maybe sign your videos with a key?

Or perhaps another neural network to distinguish the real from the fake?


Isn't that already done? They're called generative adversarial networks...


Yeah, the real reason Zoom should be actually be E2E encrypted.


Authenticity is about signing more than encrypting. But having that actually work would require a robust key distribution system that's actually usable by the general public.


If I understood it correctly, deepfakes are relatively easy to detect by software if you know what you’re looking for. I can imagine video conferencing software can implement this, and display a warning in similar ways that phishing emails are currently handled.


Deepfakes are implemented with Generative Adverserial Networks (GANs), where one component is a discriminative network that is already trying to distinguish real from fake, to provide feedback to the generation.

So I think any detection algorithms would get into a never-ending arms race.


None of the popular distributions use GANs, except that GANs have been used in experimental (later abandoned) modules. Unfortunately the Deepfake/GAN fallacy has been stuck in the Wikipedia entry for years.


The discriminative networks aren't very good at discriminating. I remember hearing people in the field saying they deliberately used under-powered discriminators because they got better generators that way. Was a year or two ago though so who knows if that's still true.


Spotting it in a static photo may be harder, but the artifacts will be really obvious in a video until the technology progresses considerably


Voight-Kampff test


I seriously wonder how this will affect online dating. Not that I've dated in quite a few years, but even if I wanted to, I wouldn't go back because last time I did, the proliferation of obnoxious Instagram filters and photoshopping made the experience unenjoyable. Fake people aren't appealing. I would bet that the ease at which the average person can deep fake will only make matters much worse. There will always be a demand for Tinder, Match, Bumble, etc., but they will be strictly used for hookups.(I know some will say that's what they are already for, but people are always making that argument for every dating app, thus I can't take that opinion very seriously) Actual dating will have to either go back into being more in-person or require a third-party to handle photography.


Why do you think people would use fake pictures on mass. If they wanted to create a profile with pictures that aren't themselves then surely it's easier to just steal someone else's?


Because they already do, to some extent. Now this is my own experience and intuition, but I would say that at least 15% of people's photos on dating apps are doctored in some way. I know some might dispute this, but I've seen enough questionable artifacts in such photos that I believe quite a few of them are artificially flattering. This isn't even including those puppy-face photos. And yes, already there are people who outright steal other people's photos. I've been on many dates where the person looked nothing like the photos.

There are a few problems with these kinds of fakery. If you are simply touching up photos to make them more flattering, that takes a lot of work that the average person doesn't want to do. There's AI that can make you look sexier, but said AI is often inconsistent, and isn't that good at arbitrarily modifying features. People who might want to use photos of people who aren't them may have a moral hangup from using someone else's likeness, and that prevents them from going ahead with it.

Deepfake technology solves both of those problems. You can use it to change your eyes, your hair, the shape of your face, consistently at every angle, and possibly I real time. More people would feel comfortable using a likeness that isn't theirs because they don't feel like they are stealing, especially if they have no intention of meeting people in real life.

I'm not saying that all or even most people would use this technology in dating, but it would require only an appreciable portion of an audience to be dishonest to cause many people to throw in the towel.

These are just my crazy theories, so maybe I'm way off base. Soon we will have deepfaking that is easy to use and requires minimal training, and I believe they will be used often and eventually be normalized in society.


> These are just my crazy theories, so maybe I'm way off base.

You absolutely are not. From my experience I'd say about 30%+ of the pictures that were posted to OkC 2+ years ago were doctored in some major way (I'm a straight man so this anecdotally only applies to women's pictures).

Using your "own" picture that's doctored is easier to justify for these folks vs. just using someone else's. Also, these people are still going on real-world dates, so if they look like an entirely different person they're taking on more risk vs. using weird Instagram filters.

> I'm not saying that all or even most people would use this technology in dating

I think most wouldn't employ this sort of deceit like you posit, as MOST of people I went on dates with were totally operating in good-faith.

BUT - there will be a huge chunk of people that are happy to do the mental gymnastics to justify posting a fake picture of themselves.. and those people are exhausting.

Protip: NEVER meet up at a bar initially because it's expensive when this happens (AND IT DOES!!!). ALWAYS do an informal coffee. It is easy to keep coffee to a 15-20 minute thing and not get on the hook for at-minimum 2x $15-20 drinks. If you chose a coffee place, do so an hour or so before lunch/dinner in walkable distance to a nice restaurant. This gives you both the opportunity to keep the date going if it's a positive experience =)


It's easier to automatically detect in large scale if you're reusing someone else's pictures, so for some social media sites this will immediately flag you as a potential fake profile and escalate to requirements to e.g. use a mobile phone number verification, etc.


All the new mobile banks seem to do their ID verification/KYC using video selfies. Interesting to consider whether deepfakes are being used in wild to fool these systems and commit fraud.


They try to mitigate this. They ask you for example to move your hands around your face and bend your ID card.


Just apply the same technology to mitigate their mitigation. Arms race all the way down.


Finally a solution for daily standups.

I can pay someone to attend meetings all over the world, by just impersonating me.


I am just waiting for someone to build a deep_nude_realtime_zoom plugin so I can finally tell people that we should take digital security, privacy and identity seriously.


Absolutely. I could do my video conference calls naked and blame the result on my new deep fake software. Hell, I can even do that now without having the software. My colleagues are nerdy enough to assume such software already exists


That would give everyone an alibi and make them care even less about privacy.


I really wonder what government agencies will do with deepfakes. Will they doctrine evidence like they do now at a greater scale?

They already distribute CP, encourage people to believe in Russian conspiracies, troll, encourage violence and the list goes on. So fucked up that we think it is normal.


If it truly becomes commonplace, then there is a chance video of any kind will lose some of its visual impact. Having seen enough of these, people will hopefully be more cautious about assuming videos are true.


Hey guys! I'm one of the founders of Impressions, the first mobile deepfake app on a phone. Try it out and give us your thoughts. Right now it's out for IOS but Android is on route. Here's our website https://impressions.app


I strongly recommend you read up on "Likeness Rights" (as in celebrity likeness rights) - they are a big deal, particularly for any product that is available within the state of California (which has extremely strong laws around them). As your product stands today you will be sued and you will lose because you are absolutely using celebrity likenesses to sell your product in the Apple Store. It doesn't matter whether the images you generate are debatably non-celebrity or not because you're using the likeness of specific celebrities to sell the app.

For reference on Likeness Rights, one of Kevin Smith's Jay and Silent Bob movies has an entire sub-plot that revolves around the financial significance of likeness rights.


Very weak output I’ve seen so far. Trying another before passing judgment, but honestly this is so obviously fake output so far.

Edit: now done three images, all well lit face, no dramatic movements, and the output is just terrible.


It really depends on lots of factors. Lighting, angle, face type, the celebrity you choose and your facial hair. If you're squinting your eyes for example, you won't get decent results. DM me your ID from under settings so I can give you credits to play with.


I wanted to play with this, but ran into an error upon starting a Miniconda prompt after a fresh install of Miniconda3 on Win7:

    ModuleNotFoundError: No module named 'conda'
It's triggered by this line in conda-script.py:

    from conda.cli import main
Same thing happened when I tried installing Anaconda instead. Any suggestions?

----

EDIT: To get this to work, I had to remove a PYTHONHOME environment variable lingering from an old (but still valid) Python install (see https://github.com/conda/conda/pull/9239), and set CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1 to avoid mkl_intel_thread.dll issues.

Leaving my comment here for anyone else who gets stumped.


I had a different set of challenges on Windows 10

- Miniconda installer suggests you don't add it to the path, so i didn't but install_windows.bat assumes you do

- There is something with windows permissions happening with the installer, where it can't access the network and can't run the conda activation process. I didn't spend any time figuring this out, i just execute dthe command from the (short) batch file one by one and it worked fine.

- i had to add 'fomm' to PYTHONPAT in order for it to find sync_batchnorm.


I would love to use this with an avatar of myself so that I don’t have to get dressed up for calls!


Nice, now I can pass my Binance verification with fake KYC more easily


Any hope of being able to run this on a machine with an AMD GPU?


Finally, the solution to the “haircut problem” with videoconferencing. My gf spends 3-4 hours in meetings every day and is quite concerned about this.


I wish deepfake could get the mouth part right. It's been the biggest challenge since the first CGI face animations and is always the giveaway.


Good answer to the question asked in a different comment regarding detecting such fakes!


This is pretty funny. I wonder, how hard would it be to use tech like this to make it look like you are looking the camera “in the eyes”?


Apple is doing this on FaceTime with the latest iOS. Pretty neat feature, makes Video calls more personal.



You're right, I didn't know about that. Thanks!

Link: https://9to5mac.com/2019/07/03/facetime-eye-contact-correcti...


Ok so how about an app that does the opposite of this, and can prove that the video/audio data hasn't been tampered with since it was created? I don't think it would need much more than public/private key encryption, and a blockchain ledger to record every-time the data was transmitted from one user to another. Thoughts?


How would you prevent against a deepfake program presenting itself as a camera? From your applications perspective, it's raw data directly from the camera but there's no guarantee the camera is just a camera.


The "secure video stream" would probably need to be baked into the hardware, like Apple's Secure Enclave https://support.apple.com/guide/security/secure-enclave-over...


And the Cam industry was never the same.


Theoretically you can deepfake yourself but then be not at the place, just pipe in your voice. Prob be accomplished by linking it to some Twilo API


Why would you do remote work even more remotely?


So you can go to the office ;-)


This is a posting about an article (inputmag) about another article (vice) about a git repo. Wow.


"I thought what I'd do was, I'd pretend I was one of those deaf-mutes."


81 comments at the moment and not one single mention of Infinite Jest?! How disappointing...


I read that over 20 years ago; could you explain the relevance? deepfakes or something similar would have fit right in with "the entertainment" etc., but I don't recall that happening...

In general I find that adding an oblique reference is more appreciated on HN than simply name-dropping a particular fictional work.


In the book there is a long passage about how video calls were initially very popular, but after a while people realized that with video calls they were supposed to be looking to the screen all the time and how that led to unnecessary anxiety - e.g, on the telephone you could talk on the phone while clipping your nails, on the video it would never happen.

There is also a part where people had to start worrying about their looks on the screen, so that led to a cottage industry of virtual make-up, stand-ins to pose in front of the camera while you could stay away, etc. After everyone realizes that no one is actually in front of the camera anymore, people start going back to regular phone calls.


I don’t understand why everything thinks they have to be on video for every single meeting. I never turn on my webcam. It makes no difference to the outcome of the call.


It is most important in calls with 6+ participants. It gives folks a chance to signal that they have something to say when many times a call may be dominated by one or two individuals.

Perhaps there are other approaches for that particular case such as a button to 'raise your hand', however, many people have friendly relationships with their colleagues and simply enjoy seeing them when conversing.


A lot of information can be transmitted through non-verbal communication (e.g. facial expressions). Do you look at the faces of other people on the call with you who do have their cameras turned on? If so, why?


Nope since someone is usually presenting their screen anyways.


I guess we should all start wearing blindfolds in face-to-face meetings because it will make no difference to the outcome of the meeting.


From the site's text:

> and plenty of Americans and Russian's willing to use whatever means necessary to ensure Trump gets re-elected

That is not what the Russian attacks are about.

In fact, the maker of the site seems to have fallen for the attacks himself.

The attacks are about dividing society. Radicalizing people towards the left and right. The "Trump haters" are just the same as the "Trump fanatics".

The author, by implying that the "bad guys" are just one site, is playing into the hands of the Russian trolls. Its exactly what the attacks are trying to achieve.


Both things can be true. They don't really care for Trump personally, they just want to sow discord, but Trump is a great way to sow discord.

Non-Russian-aligned intelligence agencies were pretty unanimous in their assessment that Putin was trying to get Trump elected and that they troll both extremes of the political spectrum.


Wasn't there an investigation that found no evidence on that?


Couldn't I just wear an N-95 mask?

Once the pandemic is over, I'm never using Zoom again anyway.


You’re never working remotely or do you plan to use a different app/service?


And good news... this works with other services, not just Zoom.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: