Before the pandemic, my tiny startup was doing quite well selling Edge AI systems, based on our own lightweight AI inference engine, with object detection and face recognition for smart city and smart retail & food service applications.
When the real world shut down, there was suddenly nothing to monitor on streets and in restaurants, so I set out to try and evolve our real time face recognition system into a video codec for high quality face-to-face online interactions, as I was not satisfied with the quality of Zoom and friends. I got it to work, and the first release for IOS was just approved on Apple's app store, link: https://apps.apple.com/app/vertigo-focus/id1540073203
The way it works is that you create a meeting URL, which you can share out-of-band, for instance via slack or text message. You can also share as a QR code which the app can scan to join a call. You then place your device on a surface in front of you so that the front camera can see you, and it will recognize you face and assign you to your own session, which is broadcast to the meeting channel. If more than one person is in view, both of you will be broadcast but with separate session ids, like if you were on separate cameras. Other meeting participants will show up on your screen and you can start talking. It is optimized for eye contact, meaning that the eyes will actually make it through to the other side as more than just dark pixel clouds, so thinks should feel a bit more personal than the standard Zoom/Teams/or Google Meet call.
Because it uses face rec, you can ONLY show your face, and if you disappear from view your audio will stop after a while, to avoid situations like when you need to go the the restroom but forget to mute. This also solves dick-pics etc.
The CODEC is not based on H26[45], but is pure AI that runs on the GPU. There is a neural network that compresses the video in real time, and another one decompressing on the receiving end. Finding a tight network architecture that would do this in real time with acceptable quality was a major part of the effort. There are several quality settings possible, but right now it is set fairly high and for 20FPS maxes out around 700kbit/s, though typically uses about half. I've demonstrated good results down to around 200kbit/s, so in theory it should work over satellite links or even Bluetooth. The protocol is UDP with no congestion control but with (Wirehair) FEC to protect against mild packet loss, future versions will detect packet loss and adapt to available bandwidth.
The audio just uses OPUS and may click a little bit, I blame AudioEngine or the fact that the last time I wrote audio code was for the game I published for the Amiga in 1994.
If you don't have a friend around or multiple devices to play with, there is an "echo test" server mode that allows you to be in a meeting with yourself. Traffic will be peer-to-peer if possible, but otherwise you will be relaying through my tiny Raspberry PI server, so YMMV. I plan to try to switch to something like fly.io soon to improve scalability.
There is also a MacOS version coming very soon, and the underlying AI engine also runs on Windows & Linux. Android support is planned.
Please take a look and let me know what you think.
We've worked on this about 3 years ago, plus background removal (realtime alpha-matting still not really done well by anyone), portrait re-lighting (G-Meet is now doing this) and even eye-contact (adjusting eye position to create the illusion of looking directly into the camera).
Some findings:
- Competing with dedicated H264/5 chips is very hard, especially when it comes to energy efficiency, which is something that mobile users ultimately care a lot about. Super-resolution on H264 is probably the way to go.
- It's hard to turn this into a business (corporates that pay for Zoom don't seem to care much).
PS Also a big fan of super-tiny AI models (binarized NNs, frequency-domain NNs, etc) for edge applications. Happy to chat!