Better video compression? faster network speeds? alternative network protocols?
With all due respect to all the amazing folks working in the domain, as a person working outside the field, the quality of even 1:1 video communication still seems far from ideal.
Wanted to understand a bit what the main underlying hurdles are. Folks say there's less room for improvements with compression after H.264. I'm not sure how much network speeds are a factor given things can get botchy even with wired high bandwidth connections. The audio artifacts definitely impacts the perceived quality so not sure if there's room for improvement here technically.
A lot of what makes Skype/Facetime/WebRTC/Chrome suck are the compromises and complexity inherent in trying to do the best you can do for when these things don't hold -- and sometimes, those techniques end up adding latency even when you do have a great network connection.
Receiver-side dejitter buffers add latency. Sender-side pacing and congestion control adds latency. In-network queueing (when the sender sends more than the network can accommodate, and packets wait in line at a bottleneck) adds latency. Waiting for retransmissions adds latency. Low frame rates add latency. Encoders that can't accurately hit a target frame size on an individual frame basis add latency. Networks that decrease their available throughput (either because another flow is now competing for the same bottleneck, or the bottleneck link capacity itself deteriorated) cause previously sustainable bitrates to start building up in-network queues, add latency.
And automatic echo cancellation can make audio incomprehensible, no matter how good the compression is (but the alternative is feedback, or making you use a telephone handset).
Another problem is that the systems in place are just incredibly complex. The WebRTC.org codebase (used in Chrome and elsewhere) is something like a half million lines of code, plus another half million of vendored third-party dependencies. The WebRTC.org rate controller (the thing that tries to tune the video encoder to match the network capacity) is very complicated and stateful and has a bunch of special cases and is written in a really general way that makes it hard to reason about.
And the fact that the video encoder and the network transport protocol are usually implemented separately, by separate entities (and the encoder is designed as a plug-in component to serve many masters, of which low-latency video is only one, and often baked into hardware), and each has its own control loop running at similar timescales also makes things suck. Things would work better if the encoder and transport protocol were purpose-designed for each other and maybe with a richer interface between them (I'm not talking about changing the compressed video format itself; just the encoder implementation), BUT, then you probably wouldn't have access to such a competitive market of pluggable H.264 encoders you could slot in to your videoconferencing program, and it wouldn't be so easy for you to swap out H.264 for H.265 or AV1 when those come along. And if you care about the encoder being power-efficient (and implemented in hardware), making your own better encoder isn't easy, even for an already-specified compression format.
Our research group has some results on trying to do this better (and also simpler) in a principled way, and we have a pretty good demo video: https://snr.stanford.edu/salsify . But there's a lot of practical/business reasons why you're using WebRTC or FaceTime and not this.