VToonify: Controllable high-resolution portrait video style transfer

viraptor · on Jan 15, 2023

Also reviewed at Two minute papers with a few examples https://youtube.com/watch?v=C9LDMzMRZv8 (hold on to your papers...)

burntalmonds · on Jan 15, 2023

What a time to be alive!

nineteen999 · on Jan 15, 2023

Well it's cool and all, but the results sit right at the very deepest part of the uncanny valley.

Gigachad · on Jan 15, 2023

It's like a tech demo, a preview of the future. Give it 5 years and it will be super refined and probably the future of low cost animation for kids TV shows and stuff. Then even further, like how no one animates without a computer now, no one will animate without AI assistance.

tgv · on Jan 15, 2023

Idk about the future of animation. It would require look-alike actors. Wouldn't it be easier to use already available tech to transfer gestures and facial expressions to cartoon models?

crazygringo · on Jan 15, 2023

As far as I can tell, it really depends on the intensity of the "slider".

When the insensity of the style transfer is pushed mostly to the right (high), it just seems like Pixar or cartoons. Nothing uncanny whatsoever.

But when they show is about a quarter of the way to the right... it's utter nightmare fuel, like plastic surgery taken way too far. The worst kind of uncanny valley, so I definitely agree with you there.

morjom · on Jan 15, 2023

I wonder what the implementation into e.g live streaming would require.

nmstoker · on Jan 15, 2023

On Linux there are techniques involving a loopback, such as here: https://github.com/umlaeute/v4l2loopback

Effectively these let an app (eg some VToonify tool) generate content that from the perspective of your live streaming app look like they are from a webcam

marcAKAmarc · on Jan 15, 2023

I'm glad things are progressing, but it bugs me that AI is largely being innovated for the use of... things like this? I know this comment is a bit disparaging and minimizes greater achievements, and I apologize for that, but the closeness of content-consumerism and AI is becoming quite off putting.

techdragon · on Jan 15, 2023

Based on the numbers in the paper this is just a little bit too slow for use as a real time video effect. At ~0.1 seconds per frame we just need about a 3x improvement in performance to get to 30fps “real time” video frame rates.

And on that thought since it appears they used nVidia hardware based on the CUDA dependency, it would be interesting to see how this performs on something like an M1/M2 where there’s dedicated ML hardware to help offload and accelerate things.

speedgoose · on Jan 15, 2023

The paper also says they used 8 Tesla V100. They are GPUs that are dedicated for ML and quite a bit more powerful than a m2.

godmode2019 · on Jan 15, 2023

Can you confirm that was for inference? I thought that was only for training 55min on 8x v100

speedgoose · on Jan 15, 2023

You are right, inference only uses one single v100 according to the paper.

techdragon · on Jan 15, 2023

I missed the bit about using 8 of them to run it! Wow that’s a lot of GPU horsepower to do this. More efficient to just use a vtuber style pipeline using unreal engine and the metahumans or other avatars… only need one good GOU for that.

techdragon · on Jan 15, 2023

Follow up,glad to know I read it right the first time.

drusepth · on Jan 15, 2023

Does M1/M2 really outperform CUDA on beefy ML GPUs in tasks like this? I'd love to see numbers if so; this seems extremely surprising.

fathyb · on Jan 15, 2023

M1 performs better in some realtime use-cases because of the unified memory: the GPU and ML hardware can work on a camera framebuffer directly without any copy.

CUDA always requires sending data over the PCI bus, at least when it comes to realtime camera processing. GPUDirect exists but it's optimized for disks and NICs, I don't believe it's possible to use it with cameras.

techdragon · on Jan 15, 2023

No idea actually, I just find all sorts of odd benchmarks crop up for things where the Unified Memory Architecture on the M1/M2 give things surprising good performance due to the DMA transfer performance hit on other CPU/GPU combinations… it’s far from universal, but it’s just been surprising to see and this looked like the sort of thing that might be one of them between the camera decoding the ML & GPU processing and then “rendering” back out… where it might have had some benefits, hence my “wondering out loud”.

kevingadd · on Jan 15, 2023

nVidia hardware has dedicated ML silicon, though?

techdragon · on Jan 15, 2023

It’s been a while since I bothered looking at anything made by nVidia, I didn’t like their business practices and to be honest… Living in the Mac ecosystem for 90% of my work/life means I followed the paltry AMD rocM stuff ardently and was so happy when Apple made an effort up front to get the M1 and M2 chips supported by the major ML frameworks.

Uehreka · on Jan 15, 2023

Sure, but that’s not too far from the 12fps that cartoon animators often actually use.

techdragon · on Jan 15, 2023

That’s a good point. I did briefly think about 24fps for “cinema” but as that’s a bit of a weird frame rate for a streaming webcam I dismissed the idea. But the video does actually have just enough frame rate to look smooth and I suppose that’s because it’s landing near enough to 12fps that combined with “looks like cartoon”, my brain is filing in the rest and barely noticing.