Hacker News new | past | comments | ask | show | jobs | submit login
VToonify: Controllable high-resolution portrait video style transfer (github.com/williamyang1991)
149 points by godmode2019 on Jan 15, 2023 | hide | past | favorite | 22 comments



Also reviewed at Two minute papers with a few examples https://youtube.com/watch?v=C9LDMzMRZv8 (hold on to your papers...)


What a time to be alive!


Well it's cool and all, but the results sit right at the very deepest part of the uncanny valley.


It's like a tech demo, a preview of the future. Give it 5 years and it will be super refined and probably the future of low cost animation for kids TV shows and stuff. Then even further, like how no one animates without a computer now, no one will animate without AI assistance.


Idk about the future of animation. It would require look-alike actors. Wouldn't it be easier to use already available tech to transfer gestures and facial expressions to cartoon models?


As far as I can tell, it really depends on the intensity of the "slider".

When the insensity of the style transfer is pushed mostly to the right (high), it just seems like Pixar or cartoons. Nothing uncanny whatsoever.

But when they show is about a quarter of the way to the right... it's utter nightmare fuel, like plastic surgery taken way too far. The worst kind of uncanny valley, so I definitely agree with you there.


I wonder what the implementation into e.g live streaming would require.


On Linux there are techniques involving a loopback, such as here: https://github.com/umlaeute/v4l2loopback

Effectively these let an app (eg some VToonify tool) generate content that from the perspective of your live streaming app look like they are from a webcam


I'm glad things are progressing, but it bugs me that AI is largely being innovated for the use of... things like this? I know this comment is a bit disparaging and minimizes greater achievements, and I apologize for that, but the closeness of content-consumerism and AI is becoming quite off putting.


Based on the numbers in the paper this is just a little bit too slow for use as a real time video effect. At ~0.1 seconds per frame we just need about a 3x improvement in performance to get to 30fps “real time” video frame rates.

And on that thought since it appears they used nVidia hardware based on the CUDA dependency, it would be interesting to see how this performs on something like an M1/M2 where there’s dedicated ML hardware to help offload and accelerate things.


The paper also says they used 8 Tesla V100. They are GPUs that are dedicated for ML and quite a bit more powerful than a m2.


Can you confirm that was for inference? I thought that was only for training 55min on 8x v100


You are right, inference only uses one single v100 according to the paper.


I missed the bit about using 8 of them to run it! Wow that’s a lot of GPU horsepower to do this. More efficient to just use a vtuber style pipeline using unreal engine and the metahumans or other avatars… only need one good GOU for that.


Follow up,glad to know I read it right the first time.


Does M1/M2 really outperform CUDA on beefy ML GPUs in tasks like this? I'd love to see numbers if so; this seems extremely surprising.


M1 performs better in some realtime use-cases because of the unified memory: the GPU and ML hardware can work on a camera framebuffer directly without any copy.

CUDA always requires sending data over the PCI bus, at least when it comes to realtime camera processing. GPUDirect exists but it's optimized for disks and NICs, I don't believe it's possible to use it with cameras.


No idea actually, I just find all sorts of odd benchmarks crop up for things where the Unified Memory Architecture on the M1/M2 give things surprising good performance due to the DMA transfer performance hit on other CPU/GPU combinations… it’s far from universal, but it’s just been surprising to see and this looked like the sort of thing that might be one of them between the camera decoding the ML & GPU processing and then “rendering” back out… where it might have had some benefits, hence my “wondering out loud”.


nVidia hardware has dedicated ML silicon, though?


It’s been a while since I bothered looking at anything made by nVidia, I didn’t like their business practices and to be honest… Living in the Mac ecosystem for 90% of my work/life means I followed the paltry AMD rocM stuff ardently and was so happy when Apple made an effort up front to get the M1 and M2 chips supported by the major ML frameworks.


Sure, but that’s not too far from the 12fps that cartoon animators often actually use.


That’s a good point. I did briefly think about 24fps for “cinema” but as that’s a bit of a weird frame rate for a streaming webcam I dismissed the idea. But the video does actually have just enough frame rate to look smooth and I suppose that’s because it’s landing near enough to 12fps that combined with “looks like cartoon”, my brain is filing in the rest and barely noticing.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: