Ask HN: What's preventing us from achieving seamless video communication?

keithwinstein · on May 12, 2019

Low-latency packet video can work incredibly well over a dependable network connection (with a known constant throughput and no jitter), low end-to-end per-packet latency, and good isolation between everybody's microphone and speaker. This was mostly solved in the 1990s.

A lot of what makes Skype/Facetime/WebRTC/Chrome suck are the compromises and complexity inherent in trying to do the best you can do for when these things don't hold -- and sometimes, those techniques end up adding latency even when you do have a great network connection.

Receiver-side dejitter buffers add latency. Sender-side pacing and congestion control adds latency. In-network queueing (when the sender sends more than the network can accommodate, and packets wait in line at a bottleneck) adds latency. Waiting for retransmissions adds latency. Low frame rates add latency. Encoders that can't accurately hit a target frame size on an individual frame basis add latency. Networks that decrease their available throughput (either because another flow is now competing for the same bottleneck, or the bottleneck link capacity itself deteriorated) cause previously sustainable bitrates to start building up in-network queues, add latency.

And automatic echo cancellation can make audio incomprehensible, no matter how good the compression is (but the alternative is feedback, or making you use a telephone handset).

Another problem is that the systems in place are just incredibly complex. The WebRTC.org codebase (used in Chrome and elsewhere) is something like a half million lines of code, plus another half million of vendored third-party dependencies. The WebRTC.org rate controller (the thing that tries to tune the video encoder to match the network capacity) is very complicated and stateful and has a bunch of special cases and is written in a really general way that makes it hard to reason about.

And the fact that the video encoder and the network transport protocol are usually implemented separately, by separate entities (and the encoder is designed as a plug-in component to serve many masters, of which low-latency video is only one, and often baked into hardware), and each has its own control loop running at similar timescales also makes things suck. Things would work better if the encoder and transport protocol were purpose-designed for each other and maybe with a richer interface between them (I'm not talking about changing the compressed video format itself; just the encoder implementation), BUT, then you probably wouldn't have access to such a competitive market of pluggable H.264 encoders you could slot in to your videoconferencing program, and it wouldn't be so easy for you to swap out H.264 for H.265 or AV1 when those come along. And if you care about the encoder being power-efficient (and implemented in hardware), making your own better encoder isn't easy, even for an already-specified compression format.

Our research group has some results on trying to do this better (and also simpler) in a principled way, and we have a pretty good demo video: https://snr.stanford.edu/salsify . But there's a lot of practical/business reasons why you're using WebRTC or FaceTime and not this.

ignoramous · on May 12, 2019

Keith, thanks for your inputs.

Noob question: Where do you think Google's BBR https://ai.google/research/pubs/pub45646 and CMU's HFSC https://www.cs.cmu.edu/~hzhang/HFSC/main.html fall short?

--

For folks unaware, Keith created https://mosh.org and is an expert in Computer Networks.

Here's a relevant talk by Keith on Sprout, a new transport protocol for live video on noisy cellular networks that uses probabilistic inference to predict congestion; and Remy, a program that generates transport protocols on-the-fly in response to network conditions: https://youtu.be/UsCOVF0vDe8

And here's a talk at Usenix on Salsify: https://youtu.be/LPj2ffe7Isk news.yc discussion: https://news.ycombinator.com/item?id=16964112

UoCambridge's David McKay on Information Theory, Pattern Recognition, and Neural Networks : https://videolectures.net/course_information_theory_pattern_...

miki123211 · on May 12, 2019

I remember good old Skype from like 3, 4 years ago that didn't use to suck at this so much, at least on the audio side. When the network connection was really bad, it used something like phone quality, when it was good, the quality was closer to Face Time. You could hear the quality shift dynamically when network conditions were changing (i.e. someone starting a download). I really enjoyed talking on Skype back then, I knew I could depend on it and get the best I could, whatever the conditions were. Now it's much worse, the quality is average, it doesn't handle bad connections well and it doesn't fully utilize the good ones. I don't think we have anything as good now as we had back then. There's Team Talk which is amazing for audio, but much more cumbersome to set up.

ejcho623 · on May 12, 2019

Why do you believe the quality is going backwards? Is there something they are optimizing for the compromise in quality?

miki123211 · on May 18, 2019

not sure myself, I'm way out of my depth here, but a friend who knows much more about this sort of stuff, told me it's probably something to do with mobile connections. Apparently slow landline connections and bad cellular connections like 3G are unreliable in different ways. Also most mobile phones, definitely all iPhones, not sure about Android, can't go higher than 16kHz when it comes to audio calls, so anything better is pointless. Maybe it's just M$ and trying to save some pennies on the infra, especially that they're not p2p any more (also b/c mobile connections and restrictive NATS). I know They're supposedly not that bad any more, but considering what they've done to Skype's other aspects, I wouldn't be surprised really.

algesten · on May 12, 2019

Well put! It's also worth mentioning that because of the network conditions, UDP is a better choice than TCP. The reason is that the retransmissions in TCP are not helping with real-time video.

However every protocol doing this must have fallback from UDP to TCP because there are a surprising amount of corporate firewalls out there that arbitrarily limits the use of UDP.

We worked extensively with webrtc, but are recently switching to our own protocol. The main reason is that webrtc is very complex and generic. You can make a better user experience if you adapt the protocol to target your use case.

Another reason is that we can more easily detect firewalls and give the user feedback about it.

hardwaresofton · on May 12, 2019

Thanks for your work on the Salsify paper -- the paper was a great read and the novel approach is pretty great was illuminating.

I wonder if it would be possible to get Salsify into the browser using something like web-udp[0]. I don't think the lower level access to the encoder information is available to coordinate network usage...

For those who are interested in code, check out the encoder on Github[1].

[0]: https://github.com/osofour/web-udp

[1]: https://github.com/excamera/alfalfa

tracer4201 · on May 12, 2019

I bookmarked your link and the paper it links to for reading on my Monday commute.

In college, I had one Networks course which discussed IPs, subnets, cidr, UDP, TCP, various higher level protocols, packet headers, etc. There were a couple projects - I think I recall implementing TCP or sliding window. Anyway, that’s the extent of my background knowledge.

Where would one start if they want to dive deeper in video transmission or related topics that provide more than the very basic understanding I have?

rixed · on May 12, 2019

Since you seems to be very familiar with that topic, do you know if most popular video chat applications stream voice and video directly between peer, or do they have to go through some central server? And if so, how much latency does that cost us?

rlyshw · on May 12, 2019

WebRTC is direct peer to peer. There’s a reliance on a STUN/TURN server to provide signaling for things like NAT traversal but other than that it’s all p2p.

tehlike · on May 12, 2019

The answer might be a mix, if video conference involves more than trivial amount of participants.

MiddleEndian · on May 12, 2019

Are there any video chat platforms that just assume a good network quality (or have an option for such) so in the happy case there's less latency?

sansnomme · on May 12, 2019

Any relation between Salsify and the Puffer project?

heyoni · on May 12, 2019

Thank you for making mosh!

cameldrv · on May 12, 2019

Zoom and Facetime are pretty decent, but you are completely right. The fundamental problem is we have a huge stack of technologies that just barely work. It runs from USB to video drivers, to operating systems, to videoconferencing software to Wi-Fi to home/office routers to cable modems to cable infrastructure to TCP/IP to backbone network capacity, and back. Everything is pushed to the limit. It's basically the Richard Gabriel worse is better problem, compounded. Everyone gets their part to something like 99% reliability.

If you're depending on 20 things, each having 99% reliability, the system has 82% reliability. Roughly speaking, that's what's happening. There is no silver bullet to fix this. Bringing one layer from 99% to 100% brings the system from 82% to 83%.

ColanR · on May 12, 2019

Seems like we would do well to find a way to implement some kind of certification process, like we do with engineers. [1]

If you lose your certification if you write low-quality code, then (hopefully, if the certifiers have the processes in place) you'd not write the code. In that way, we could finally compare the importance of quality in writing a device driver to the importance of quality in designing a bridge.

[1] https://www.nspe.org/resources/licensure/what-pe

pjc50 · on May 12, 2019

The public have consistently chosen the cheaper but slightly worse option for decades of technology. This, more so than engineer skill, puts an upper constraint on reliability.

closeparen · on May 12, 2019

An engineer's job is to meet the customer's specification. Wasting its money to deliver something far in excess of the requirements is also a path to no longer working as an engineer.

You've no right to be angry at the engineer because you wish to drive a semi truck down what you commissioned and budgeted as a footbridge.

zbuf · on May 12, 2019

Let's start by getting seamless audio communication.

It's too much focus on video, when deficient audio has a greater impact on rapport. Even an old school land line phone gives a more fluid conversation than modern video conferencing.

The trouble is, mainstream conferencing solutions are challenged by customers who expect great audio out of poorly spec'd rooms and microphones. The result is too much 'masking' poor inputs with software that now we all feel totally disconnected with why the system as a whole just isn't working well.

(A small plug for my own focus on audio; cleanfeed.net, which is WebRTC-based with some additional magic)

closeparen · on May 13, 2019

The microphones and speakers in an analog landline phone are not high end either, but they still sound amazing compared to Zoom.

plaidfuji · on May 12, 2019

I’m sayin!!!

hideo · on May 12, 2019

I used to work in this field several years ago. It’s gotten way better over time but there’s room for continuing improvement.

Personally I think it’s a bit of everything:

There’s almost no standardization in signaling protocols. Things like FaceTime and WhatsApp don’t interoperate.

NAT hole punching remains a complex problem. It’s not easy to solve.

Bandwidth is often not stable for long periods of time. Bandwidth drops, latency spikes, packets get lost or retransmitted. WiFi connections are sketchy. Wired connections are better but still packet switched. Cellular wireless systems are overloaded and suffer from multipath fading.

Encoders are insanely complicated to build. Hardware acceleration isn’t easy to implement either. Configuring an encoders parameters for a connection environment is hard and remains a craft and not a science.

The human eye seems to be much more sensitive to artifacts than the human ear. Cameras are hard to tune and expensive. Auto focus white balance etc effect call quality quite badly. Camera placement is still a challenge. Minor changes to lighting and colors can make huge shifts in quality.

This is why video from dedicated conference rooms is way better than video calls from phones or laptops. The state of the art under controlled conditions is really unbelievably amazing.

plaidfuji · on May 12, 2019

Here’s my question: why is video conferencing designed such that when the video drops out or lags, so does the audio? I never (or very less often) have problems with simple VOIP. Why not make a video chat client that treats the audio as a first class citizen, and then displays the video if it happens to be coming through clearly? Is re-syncing the streams too hard? This seems like a fix that would make the experience so much better without having to put more data through any network.

lordCarbonFiber · on May 13, 2019

Probably because crappy audio is received better than audio out of synch from the video. Having the "bad international languge dub" effect on conferencing is distracting as hell.

For what it's worth voice over IP is essentially solved; tools like Mumble can have 1000s of people with top tier quality. I'm not sold at all on why you'd even want video, 90% of most business meetings (in my experience) are screen shares that could be communicated by just sending a link anyway. For personal calls use direct connect peer'd connection; in my experience all of the problems of video calling is going through the enabling server.

ejcho623 · on May 12, 2019

I think Skype already might be doing something like this. The audio passes through ok when the screen feed freezes due to network congestion

plaidfuji · on May 12, 2019

I basically do this already for crucial video conferences: I call in on my phone as well as laptop and keep the phone audio on standby for the inevitable dropouts

lazyeye · on May 12, 2019

Someone needs to design vidconf hardware where (somehow) the camera is in the middle of the screen so you are looking directly into each other's eyes.

hideo · on May 12, 2019

This exists in several forms ranging from TelePrompTer style settings to cameras embedded in the screen. They’re currently expensive but I’d expect to see them hitting the market in the next 5-10 years.

There are also software/post-processing implementations of this out there which try to emulate eye contact. The ones I saw 4-5 years ago were definitely uncanny valley but there may be more happening there of late.

This gets even more interesting in multi party situations where there may be 3+ locations with more than one person at each location

asymmetric · on May 12, 2019

> cameras embedded in the screen

Seems like a privacy nightmare. You can’t have a hardware toggle to close the camera nor cover it with a piece of tape.

berbec · on May 12, 2019

The Librem guys are making hardware with power switches for camera, GPS, etc. [1]

1: https://puri.sm/posts/lockdown-mode-on-the-librem-5-beyond-h...

MarkMc · on May 12, 2019

Or maybe just clever software trained to filter the video feed so the person's gaze is shifted up.

isostatic · on May 12, 2019

Get an autocue and set up your screen below and reflected, and camera as an external input.

nikanj · on May 12, 2019

Today, there are N video-conferencing solutions on the market, all of them rather "enterprise" in their quality. Most meetings start with a good 5-10 minutes of "Can you hear me now?".

Aspiring youths see this market ripe for disruption, and next year there are N+1 video-conferencing solutions, all of them rather shit.

The biggest challenge doesn't seem to lie in the actual video quality. The problem is in getting the damn call up in the first place, with all participants seeing and hearing all other participants.

chvid · on May 12, 2019

That people don't really want it.

As far I can see it is possible and good enough today with Skype, FaceTime and so on plus a good internet connection; yet people prefer email, chat, telephone ... I think that is what is holding it back and because of the lack of demand there is not really big investment in create hardware and software to support it.

r3bl · on May 12, 2019

I believe it's a chicken and egg problem. I know I would rely on video way more if my first sentence in every video call didn't have to be "can you hear me?"

chvid · on May 12, 2019

The eggs have been there for a long time and plenty of chickens has been hatched; but noone is eating them.

I am pretty sure that I can do a high quality FaceTime call with a number of people.

Yet still I prefer doing a regular phone call. As so does just about everyone else. The question you should ask is: Why is that?

It is not a technical problem.

r3bl · on May 12, 2019

> I am pretty sure that I can do a high quality FaceTime call with a number of people.

A number of people that could afford $1000+ to be able to use that platform.

I personally wouldn't consider this problem to be technically-solved until an Android that can barely be considered as a mid-ranger can do it.

dqybh · on May 12, 2019

I wouldn't expect a mid-ranger phone to be fast and completely reliable. That's why it's just a mid-ranger. The price is a compromise; you get shitty radio hardware and shitty software so you get shitty videoconferencing.

tracker1 · on May 12, 2019

Is a mid-range phone today worse than a high end phone 5 years ago? Because the hardware at the high end 5 years ago could handle it decently iirc. Though it really depended on your internet connection, latency and other issues much more than the hardware on the device really.

For the most part, given the margins at the high end, it's not significantly better.

floatingatoll · on May 12, 2019

Latency SLAs are what’s missing: “No packets shall be lost, and latency shall not fluctuate”.

Classical phone lines worked so well because there is never latency. You might get static, but we’re adapted for listening through static. But you never get timing disruptions.

Videoconferencing is a black art of trying to smooth over tiny latencies that the human brain is wired to be extraordinarily sensitive to. People read too much into a single lost packet.

This same problem applies to VR - if the latency is not rigorously consistent, people vomit.

It is possible to design a network that connects two people reliably and with predictable latency - last century’s landline phone network stands as proof. Until someone builds a network that offers the same level of service for videoconferencing, it will continue to be a tool of last resort.

closeparen · on May 13, 2019

That technology is circuit switching, i.e. the antithesis of computer networking.

floatingatoll · on May 13, 2019

Dolphin used fiberoptic interconnects to provide either maximum bandwidth or minimum latency, with a cross-connection grid that could easily be tuned with a software switch to provide narrowly-predictable extremely low latency with an SLA.

To say that this is circuit switching may or not be correct, but it’s wrong to say that it’s the antithesis of computer networking.

EDIT: See also Cloudflare’s ultra-predictable, low-jitter backbone: https://blog.cloudflare.com/argo-and-the-cloudflare-global-p...

faragon · on May 12, 2019

The biggest problem with videoconferencing is using movie video codecs (designed for maximizing compression, and not for multicast latency control) tuned/tweaked for the worst client. So the typical video conference server uses a hand tuned gstreamer and a mechanism for requesting key frames at the pace of the slowest client. That work okish for up to few hundreds of connections in the case of all peers having quality connections (fiber). Scalable video conference, with e.g. thousands of concurrent clients require a different problem approach, being solved eventually because of specialized video hardware acceleration in the video conference gateway server, changes in the network infrastructure, and also low latency devices at client side.

sargun · on May 12, 2019

The lack of circuit switching and bandwidth.

Animats · on May 12, 2019

Exactly.

ISDN was a 64kb/s circuit switched channel end to end, rigidly clocked. Every bit came in on schedule. Voice with no jitter. A friend of mine in Switzerland had ISDN home phones until last month, when Swisscom discontinued it in favor of a VoIP system with worse voice quality.

If there had been a video successor to ISDN, say a 10mb/s circuit, we'd have real time HDTV video chat with no jitter.

Voice and video over IP only work because of horrible kludges to deal with jitter and lag.

p2t2p · on May 12, 2019

How about this:

$ ping google.com.au PING google.com.au (216.58.196.131) 56(84) bytes of data.

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=1 ttl=51 time=3239 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=2 ttl=51 time=4524 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=3 ttl=51 time=4434 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=4 ttl=51 time=3622 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=5 ttl=51 time=1022 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=6 ttl=51 time=849 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=7 ttl=51 time=1030 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=8 ttl=51 time=974 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=9 ttl=51 time=897 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=10 ttl=51 time=1022 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=11 ttl=51 time=1008 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=12 ttl=51 time=949 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=13 ttl=51 time=871 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=15 ttl=51 time=1103 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=16 ttl=51 time=1005 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=17 ttl=51 time=830 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=18 ttl=51 time=752 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=19 ttl=51 time=703 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=20 ttl=51 time=899 ms

64 bytes from syd15s04-in-f3.1e100.net (216.58.196.131): icmp_seq=21 ttl=51 time=821 ms

Does this answer the question? =)

thefounder · on May 12, 2019

You need deterministic performance. There is a standard for that named AVB[0] but it requires avb compliant switches/hubs. It's used in pro, live, home audio and corporate environments with great success. In a perfect world the internet(at least the wired internet) would be avb compliant.

https://en.m.wikipedia.org/wiki/Audio_Video_Bridging

jaspal747 · on May 12, 2019

Can we send two parallel streams of video and audio and then at the receiver, pick each packet selectively? If a new packet arrives in either of the streams, pick it if it is the latest, and discard it if we already got one with the same id..Something one the lines of adding redundancy to compensate for the poor connection?

no_identd · on May 12, 2019

1. Check out this recent comment thread from a few days ago for an entry point to learning about numerous performance bounds caused by both design limitations and extremely long standing feature gaps in the computer network protocols we nowadays use for global communications (I intentionally avoid the term "Internet" here, because, as one can hopefully gleam from the content linked to there, that name seems undeserving for our "global information superhighway"):

https://news.ycombinator.com/item?id=19864808

2. Something important and very relevant to your question which I forgot to mention in that thread:

https://www.itu.int/en/ITU-T/Workshops-and-Seminars/201807/D...

This recent ITU slide deck by Richard Li (Huawei):

https://www.itu.int/en/ITU-T/Workshops-and-Seminars/201807/D...

From this recent ITU workshop: https://www.itu.int/en/ITU-T/Workshops-and-Seminars/201807/P...

The webcast of which you can find here:

https://www.itu.int/webcast/archive/t2017075g#video (third presentation in the video)

Naturally, given the information from point #1 on RINA, GNUnet & currently network technology issues combined with the fact that the plans presented above don't seem to really factor that information in, I dislike some of the directions the ITU & Huawei seem to go for there, BUT, even so, this presentation basically exactly answers most of your questions (I think the other commenters already did a most excellent job answering what the presentation doesn't), even generalizing them to seemingly "far out"—yet apparently entirely feasible—ideas like 'Holographic Teleports' (think VR on steroids.)

no_identd · on May 12, 2019

The talk starts at around 42:32 in the fourth video.

amelius · on May 12, 2019

Related question: why does Netflix provide a so much better video streaming service than a short-distance VoIP session? (I say short-distance because Netflix obviously uses a content distribution network of some kind; not taking this into account would make the comparison unfair).

Liron · on May 12, 2019

Because having a VoIP call is like producing a Netflix movie .1 seconds before screening that movie.

E.g. you couldn't buffer 5 seconds.

amelius · on May 12, 2019

Yes, this is probably the explanation.

EForEndeavour · on May 12, 2019

Because video streaming is one-way, and not real-time, in that the entire video file exists already. These factors make Netflix streaming a very different problem from that of two-way, realtime communication, where you send and receive data and cannot ever buffer more than maybe 1-2 seconds.

zbuf · on May 12, 2019

The answers given are the obvious one; that pre made content can be buffered heavily.

But the internet can, at its core manage packet delivery within the delays required for real time communication.

More precisely the issue is probably one of supply and demand; that our internet is tuned for content to radiate out from large scale producers of content, flowing down to its consumers. There's far less capacity and focus "laterally" in the network -- between consumers and other consumers -- as is required by VoIP sessions.

syllogism · on May 12, 2019

I heard that Netflix collocate in many ISP's data centers.

gsich · on May 12, 2019

because Netflix has no time constraints.

bo1024 · on May 12, 2019

Just curious if you've tried Zoom. I recently switched from skype andor hangouts.

dvh · on May 12, 2019

Not enough IPv4 addresses.

dvh · on May 12, 2019

The only reason (original) Skype existed was because there were, even at that time, not enough IPv4 addresses, so users who had public IPv4 address were tunneling data for users behind NAT. Small guy has no chance to create any kind of p2p communication app because he had no infrastructure like Skype. If everybody had public IPv4 address it would be trivial and also fast because of the direct connection. Video calls are hard but codec had nothing to do with it.

cbluth · on May 12, 2019

Not enough context

sytelus · on May 12, 2019

The core issue is not technical but rather human. In 1960s Bell Labs poured in massive investment in trying to make video calls a reality. They solved virtually all technical issues - in 1960s - and had real system deployed in NYC and other places. Here was the issue it never took off: people just don't like others to see them although they like to see others.. This might feel funny little thing but it ultimately lead to pulling out video handsets from the market. No amount of network effect and marketting helped its adoption.

This experiment has been repeated in many forms by different companies in different settings with no real success. Everyone had solved tech issues, its just that screen resolution keeps increasing. The only area I can think of where video calls have minor success is corporate group meetings and talking to kids/parents but even in those cases people tend to be very selective when to do video calls vs voice-only calls.