WebRTC is not the future of low latency live streaming... at least not outside of video conferencing. It's incredibly complex as a specification, has limitations and numerous issues that set limits in how scalable it can be. Conversely, for HTTP segment based formats like HLS and DASH have limits due to their design (and that of HTTP) which set how low latency can actually go.
Where's the future? A likely candidate may come out of the "Media over QUIC" work in the IETF, which already has several straw man protocols in real world use (Meta's RUSH, Twitch's WARP). It'll be a few more years before we see a real successor, but whatever it is will likely be able to supersede both WebRTC and HLS/DASH where QUIC and/or WebTransport is available.
* Diverse users makes the ecosystem rich. WebRTC supports Conferencing, embedded, P2P/NAT Traversal, remote control... Every group of users has the made the ecosystem a little better.
* Client code is minimal. For most users they just need to exchange Session Descriptions and they are done. You then have additional APIs if you need to change behaviors. Other streaming protocols expect you to put lots of code client side. If you want to target lots of platforms that is a pretty big burden.
* Lots of implementations. C, C++, Python, Go, Typescript
* The new thing needs to be substantially better. I don't know what n is, but it isn't enough to just be a little better then WebRTC to replace it.
> QUIC/WebTransport seems simple because it doesn't address all the things WebRTC does.
Partially agree here, but the design of QUIC(/WebTransport/TCPLS) make some of the features in WebRTC unnecessary:
1. No need for STUN/TURN/ICE. With QUIC you can have the NATed party make an outbound request to a non-NATed party, then use QUIC channels to send/receive RDP from the sender and receiver.
2. QUIC comes with encryption so you don't need to mess with DTLS/SRTP
3. Scaling QUIC channels is much more similar to scaling a stateless service than scaling something heavily stateful like a videobridge and should be easier to manage with modern orchestration tools.
4. For simple, 1:1 cases, QUIC needs a lot less signaling overhead than a WebRTC implementation. For other VC configurations, a streaming layer on QUIC will probably need to implement some form of signaling and will end up looking just like WebRTC signaling.
---
I just wish WebRTC wasn't so prescriptive of DTLS/SRTP. I'm often fiddling around with VC and video feeds on private networks (for example IPSec or an encrypted VPN like Zerotier), and having to opt into the whole CA system there makes it a bit of a pain. There's also the background that having the browser read from a video or voice source isn't always very low-latency even if the DTLS/SRTP comms is going as fast as the network can, which leads to slower glass-to-glass latency, though there are non-browser ways to use WebRTC and many language frameworks as you indicated.
All-in-all small complaints for a good technology stack though.
ICE is needed when both parties are NATes, if one party was not mated, we’d not need ICE in webrtc either.
Agree on 2.
On 3. The videobridge needs state on who is on the session and who to forward to. that requirement doesn’t go away with QUIC. Unless you’re thinking that the video streams are some kind of a named resource or object.
I think the most people gripe about is SDP and it’s prescription of negotiation and encoding. I agree that capability negotiation can be vastly simplified given some of the capabilities can be inferred later in the session.
I would presume that the receiver will still make an output request to the QUIC endpoint to at least bring-up the connection/stream, which should be enough to populate the path in NAT tables, no? It shouldn't be any more wasteful than the regular process which receives packets out-of-band but still needs a signaling channel. This just does in-band signaling, so you bring up the connection, perform signaling, then open a new QUIC stream to receive data.
You're right. I thought your comment was replying that in current implementations (not in the context of QUIC), the NATed peer wouldn't need STUN anyways, as of today. I had lost the context of it referring to an hypothetical implementarion over QUIC.
> I just wish WebRTC wasn't so prescriptive of DTLS/SRTP.
There was a webrtc-webtransport spec, but it got renamed/retasked to p2p-webtransport[1]. It got renamed/rebuild ~1 year ago[2]. Feels like a pretty strong indicator of webrtc being deconstructed, but whose to say this goes anywhere. We'd also need webcodecs.
It's somewhat scary & also somewhat exciting thinking of the one good, working, browser supported standard being ripped into pieces (p2p-webtransport, webcodecs, more) & being user-implemented. Having the browser & servers have a well-known target is both great but also perhaps confining. If we leave it up to each site/library to DIY their solution, figure out how to balance the p2p feeds, it'll be a long long time before the Rest of the World (other than the very big few) have reasonable tech again. WebRTC is quite capable & a nice even playing field, with lots of well-known rules to enable creative interopation. We'd be throwing away a lot. I'd hoped for webrtc-webtransport, to at least keep some order & regularity, but that seems out, at the moment. But Webrtc-nv is still ultra-formative; anything could happen.
The rest of the transport stack is also undergoing massive seismic shifts. I feel like we're in for a lot of years of running QUIC or HTTP3 over WebRTC Data-Channels and over WebTransport, so we can explore solutions the new capabilities while not having to ram each & every change through with the browser implementers. It feels like a less visible but far more massive Web Extensibility Manifesto moment ("Browser vendors should provide new low-level capabilities that expose the possibilities of the underlying platform as closely as possible."), only at sub-HTML levels[3]. The browsers refused to let us play with HTTP Push, never let appdevs know realtime resources had been pushed at the browser, so we're still debating terrible WebSocket vs SSE choices; terrible. I think of gRPC-web & what an abomination that is, how sad & pointless that effort is; all because the browser is a mere glimmer of the underlying transport. I feel like a lot of experimentation & exploration is going to happen if we start exploring QUIC or HTTP3 over WebTransport. Attempts to reimagine alternatives to WebRTC are also possible if we had specs like p2p-webtransport, or just did QUIC over DataChannels. Running modern protocols in the client, not the browser, seems like a semi-cursed future, but necessary, at least for a while, while we don't yet know what we could do. The browsers are super laggy, slow to expose capabilities.
Having attempted to WebRTC as a generic video transport, I can say that WebRTC has insurmountable problems. The two biggest issues are:
1) Lack of client-side buffering. This is a benefit in real-time communication, but it limits your maximum bitrate to your maximum download speed. It’s also incredibly insensitive to network blips.
2) Extremely expensive. To keep bitrate down, video codecs only send key frames every so often. When a new client starts consuming a video stream they need to notify the sender that a new key frame is needed. For a video call, this is fine because the sender is already transcoding their stream so inserting a key frame isn’t a big deal. For a static video, needing to transcode the entire thing in real time with dynamic key frames is expensive and unnecessary.
Webrtc protocol doesn’t dictate 1 or 2. Although browsers do implement some of their own assumptions for this. By default the client side buffer can be orders of 100s of milliseconds. this is as you pointed out tuned for real-time or live applications.
If you’re doing something like YouTube/Netflix and want to avoid going to a lower definition of the stream, that too can be tuned, albeit you’d want to use simulcast and implement your own player (to feed the video and audio frames for decoding at the pace you dictate).
None of these problems are specific to WebRTC. You'll run into them in a WebRTC implementation, you'll run into them with QUIC, even with ffmpeg on the CLI you'll need to specify buffer sizes. As you mention these are both problems with livestreaming and the more you buffer, the less "live" your stream becomes. If you're interested in transmitting static videos, then why not go with HLS or even just making the static file available for direct download through HTTP instead of a live technology?
The buffer sizes in ffmpeg are more about ensuring that the calculated bitrate is accurate iirc than ensuring smooth streaming (although you need your bitrate enforced to guarantee smooth streaming).
IIRC (it's been a bit since I've configured this), you can specify both codec buffers and buffers for streaming to smooth out issues reading from the codec output. I could be wrong though.
1.) Why can't you buffer on the client side for WebRTC? That sounds like a client issue (what library were you using?) not the protocol.
2.) I use the same tactic as HLS. Generate your video with a reasonable (~2 seconds) keyframe interval. When a new client connects start sending at the keyframe.
1) I don't think WebRTC has a specific point. Lots of users came together with their use cases and was designed by consensus. WebRTC can (and does) have toggles around latency/buffering.
2.) I am not aware of a way you can no keyframes, but be decodable at anytime. I just have done it 'HLS Style' or WebRTC 1:1. Curious if anyone else has different solutions.
1) WebRTC and RTP both have RT in their name. RT stands for real-time. If I recall correctly, the only buffer WebRTC has is the jitter buffer, which is used for packet ordering, not for ensuring that enough has buffered to handle bitrate spikes.
2) Yes, you either need a high keyframe interval or some type of out-of-band signaling framework to generate keyframes. WebRTC uses RTCP. A good question is why does WebRTC feel RTCP is necessary at all? Why not generate a keyframe every N seconds like you do with HLS and remove the complexity of RTCP entirely? The answer is that many clients cannot handle the bitrate at real-time speeds.
1) That is a specific implementation, and has nothing to do with the protocol, which certainly doesn't define a "jitter buffer". People routinely use RTMP--which also has RT in the name--to transfer content to streaming services with massive buffers at every step in the pipeline.
Most common browser implementations use an Open GOP. That means an IFrame is implemented when needed. On scene change or when there’s high motion.
Only naive implementations would burst an IFrame on to the network, most pace them. And if needed, you could split your iframe into several frame intervals and decode them without creating a burst by bit rate.
Actually a lot of webrtc implementations use 1s or 2s GOP length. Again depends on how much control you’ve on your pipeline. Browsers implementations do make some assumptions on usecase.
That is not what open GOP means. Open GOP means pictures can reference IDR frames other than the most recent one in decode order, and is a pain in the ass for various reasons, but is technically more efficient. You're referring to a dynamic GOP.
I don't know much about webrtc but I do have some security cameras, frigate and home assistant all working together with rtmp streams.
There are some webrtc solution for getting those streams into home assistant with low latency but they are... I don't know the word. They aren't difficult to set up because the instructions are very simple, however, they don't work when I follow them and, from reading forums, that's not uncommon. I have _no_ idea why it doesn't work.
I don't really understand why I can't spin up a docker container that will take my rtmp streams and convert them to webrtc then hook that into home assistant.
I've gathered that webrtc just doesn't work that way but why can't it?
Heh, welcome to the world of livestreaming media. The reason why it's hard to create this kind of simple "stream in, stream out" abstraction is because most IP Voice/Video stacks are architected very differently than stateless net protocols that are popular today. IP streaming generally works by:
1. A signaling layer that helps setup the connection metadata (a layer where the sender can say they're the sender, that they'll be sending data to port n, that the data will be encoded using codec foo, etc)
2. Media streams that are opened based on the metadata transferred over the signaling layer that are usually just streams of encoded packets being pushed over the wire as fast as the media source and the network allows.
Most IP Media stacks (RTSP, RTMP, WebRTC, SIP, XMPP, Matrix, etc) follow this same pattern. This is different than "modern" protocols like HTTP where signaling is bound together with data using framing (e.g. HTTP headers for signaling vs the HTTP request/response body for data.) This design makes IP media stacks especially fragile to NAT connectivity issues and especially hard to proxy. There are typically good reasons this is done (due to latency, non-blocking reads, head-of-line blocking, etc) but these "good reasons" are becoming less good as innovations in lower networking layers (like QUIC or TCPLS) create conditions that make it much easier to organize IP Media in a manner more similar to HTTP. Hopefully one day you'll just be able to take IP Media streams and "convert" or "proxy" them from one format to another.
All the listed protocols came after HTTP. RTSP, SIP borrowed heavily (albeit badly in retrospect) from HTTP.
I do not have all the historical context (early 90s), but for WebRTC, the idea was to not define any new protocol(s) or do a clean slate design. but rather to just agree on the flavors of the various protocols, and then to universally implement those. We already had SDP, RTSP, RTP, SAP, etc. And the idea was to cobble together the existing protocols into something everyone could agree on (the young companies, the old companies, etc)
We ended up defining variations to the flavors that we already had and for the most part everything turned out okay (maybe the SDP plan wars did not end up where we wanted it, but… it was a good enough compromise).
For realtime media, if we are able combine “locator:identifier” issue, we will be able to make media and signaling work inband.
I know they came later, so I'm still confused why RTSP and SIP weren't implemented atop HTTP. I realize that RTSP and SIP can push server to client, but there's ways around that, though perhaps long polling and Websockets weren't conceivable when RTSP and SIP were invented. I mean, in a pinch, I have an HTTP server serving a folder where SDP files are generated, and I've written clients that just look for a well-known SDP file and use that to consume an RDP stream. It's a ghetto form of "signaling" that I love using when doing experiments (not suitable for production for various reasons obvious to you I imagine.)
I'm not saying WebRTC had poor design decisions or anything. I think it was very smart for WebRTC to reuse SDP, RDP, etc so the same libraries and pipelines could keep working with minimal changes. It also means very little new learning for folks familiar with the rest of the stack.
> For realtime media, if we are able combine “locator:identifier” issue, we will be able to make media and signaling work inband.
+1000. I think RTSP+TCP is a decent way to do in-band signaling and media, and RTMP defines strict ways to send both anyway.
To me, the whole typical IP Multimedia stack screams telco. They prefer to remove and reattach headers upon passing interfaces, separate control and data plane, and rely on synchronization for session integrity. Great when there's a phone line to HQ and a heavily metered satellite link to do a live, I guess...
WebRTC is used by phenixrts as the delivery from server to client. The promise of WebRTC was P2P direct connections for video/data transport, and server/client for coordination and fallback.
> The scalability of Phenix’s platform does not come from the protocol itself, but from the systems built and deployed to accept WebRTC connections and deliver content through them. Our platform is built to scale out horizontally. In order to serve millions of concurrent users subscribing to the same stream in a short period of time, resources need to be provisioned timely or be available upfront.
> With WebRTC, you can add real-time communication capabilities to your application that works on top of an open standard. It supports video, voice, and generic data to be sent between peers...
I agree that RTP over QUIC [1] is closer to what we'd build today if we were starting from scratch than WebRTC is. (Partly benefiting from the lessons learned getting to WebRTC 1.0, of course.)
It's worth noting that QUIC is also a very complex specification and is only going to get more complex as it continues through the standardization process. In parallel, there's ongoing work on the next generation of the WebRTC spec. [2] (WebRTC-NV also adds complexity. Nothing ever gets simpler.)
My guess is that we're at least three years away from being able to use anything other than HLS and WebRTC in production. And -- pessimistically because I've worked on video for a long time and seen over and over that new stuff always take _forever_ to bake and get adoption, maybe that's going to be more like 10 years.
Media over QUIC is interesting. For RTP or peer to peer QUIC, there is more work to be done. But you will end up engineering many of the same things as webrtc suit of protocols (ICE -- STUN, TURN, MULTIPLEXING, etc).
QUIC and webtransport can definitely already do DASH/HLS without some of the protocol complexity by using the QuicStreams (but to use QUICs underlying features, DASH/HLS need to change as well).
Live streaming was a motivating example for both of those, as you can tell from the video. And both of them grew out of our efforts to make WebRTC better for live streaming.
> "Where's the future? A likely candidate may come out of the "Media over QUIC" work in the IETF"
The "future" is going to be a goddamned UDP socket sending compressed media streams across the web. We've reached peak abstraction. We need to come back to first principles, instead of piling on more crap on-top of the browser.
Corporate firewalls will be blocking QUIC until the end of time. Anyone implementing streaming over QUIC will have to have to implement an HTTP/2 fallback, probably WebRTC, but maybe we will get something new.
In the case of QUIC, it is likely that the streaming would be over H/3 (HTTP3) or HTTP over QUIC. They may fallback to H1 or H2 but typically over a long enough time, firewall rules become more relaxed.
We're using DASH and experimenting with LL-HLS to get reliable and affordable (using CDNS) end to end latency of 3 seconds with which you can already interact live with your audience without too much trouble. Latencies down to about 1 second are also feasible already at the cost of some client side buffering. So I would have expected to see more information about DASH than HLS here. It's usable on all platforms except iPhones and Apple TV where we're forced to use HLS at higher Latencies for now and hopefully later also DASH or LL-HLS once that's reliably usable.
See: https://liveryvideo.com
I am a bit skeptical about "down to about 1 second" being achievable with DASH or LL-HLS reliably. Of course, I could be wrong. And a lot depends on the definitions of "about" and "reliably," as well as your user cohort (where they are in the world, etc). :-)
The reason I didn't write much about DASH is that the basic concepts are the same for both DASH and HLS. And my sense for the last couple of years has been that most of the momentum in the ecosystem as taking place around HLS/LL-HLS. But I could be wrong about that, too.
In some areas of the world where network and/or device performance are limited such a low latency will be impossible to achieve indeed. Inter regional broadcasting also comes with a latency penalty, but with good CDNs like provided by our partner Akamai you can already get a 1 second latency in many cases. That being said the buffers will be very small and we don't recommend using that yet in general.
HLS and DASH are similar indeed, but I think the main reason that HLS is still used a lot is that it's the lowest common denominator; you can get it to work everywhere. Perhaps in the future that will be LL-HLS, or something else entirely, but for now most really low latency broadcasting that I'm aware of is using DASH with HLS as a fallback (i.e: CMAF). But I could also be wrong about that of course :-)
> It's usable on all platforms except iPhones and Apple TV where we're forced to use HLS at higher Latencies for now and hopefully later also DASH or LL-HLS once that's reliably usable.
Interesting, since it was Apple who designed & built both the original HLS and the LL extensions. What's currently preventing LL from being usable there?
Their initially released specification depended on some relatively impractical services being provided by the CDN. In following revisions that was improved, but last I checked the major CDNs still didn't support it. I think because it still requires much more from the server side than DASH and struggles to reach comparably low latencies. In the mean time people like us have started to work around it to get DASH support in iOS instead.
Have you tried LL-DASH? It's somewhat more available for usage in tools like ffmpeg. I'm looking to experiment more with it on my own streaming platform.
Low latency HLS is creating partial segments by bucketing 200ms of frames instead of 6s segments in standard HLS. Whereas in webrtc, the endpoint is sending the frame as soon as it is ready.
The apples apples comparison here is 0ms (in webrtc, no send side buffering) vs 200ms (in low latency HLS) or 6s (in standard HLS). This is independent of latency of the endpoint from CDN or source.
Another distinction is playback wait time, i.e., how quickly upon joining can it start rendering video.
I’m assuming the full reference picture (typically, an IFrame or a golden frame depending on the codec) in low latency HLS is only available at the start of each 6s segment and not in partial segments. So joining a live stream, the receiving endpoint would have to wait at most 6s before rendering.
Similarly in webrtc, it’s up to the system to generate a reference frame at regular intervals, as low as every second. Or to do it reactively, a receiving endpoint can ask the sender to send a new reference picture. This is done via a Full intra request, the wait time can be as quick as 1.5 times of the round trip time (as new codecs can generate a new iframe instantaneously upon receiving a request). There’s a slight cpu penalty for this which means that the sender getting too many full intra requests may typically throttle the response to 1s.
So Apples to Apples comparison for wait time would be up to 1s for webrtc vs 6s for HLS.
You don't necessarily need to wait for a reference picture to start playback. Modern codecs all support "intra-refresh", which allows you to reconstruct a reference frame from a set of existing frames. With that, you can set periodic intra refresh much lower than 6s keyframe intervals.
An HLS segment can carry any number of GOPs. A GOP length of FPS/2 or FPS/4 will get you an I-frame pretty quickly allowing the GOP to be decoded. MPEG-DASH can do the same IIRC. So there doesn't need to be a segment length delay in playback and typically is not.
In addition to what vr000m said above, I'll just add that when you make HLS chunks smaller, you're reducing the leverage you get from HLS's core design decisions. I tried to cover some of this in the post.
One way to think about this intuitively is that HLS and WebRTC are opposite ends of one important trade-offs axis.
HLS is about delivering media streams in a way that scales as cost-effectively as possible.
WebRTC is about delivering media frames at the lowest possible latency.
These are very different goals, and given current infrastructure and standards it's not possible to have your cake and eat it too, here. That may change in the future as low-latency video becomes more and more important. QUIC, for example, is a new approaches to building out a full stack that works around some of the fundamental tradeoffs that exist today.
The result is that pushing HLS segments down to 200ms is not at all a clear win. We'll see what happens as HLS implementations improve. And I should say that my brain has been warped by working on UDP/RTP stuff for a long time. But my bet is that using 200ms HLS segments is, for most real-world users, going to make HLS worse in every way than WebRTC would be, for the same use cases. (That's definitely true today with the early implementations of LLHLS.)
I appreciated your post so thank you. I would be more interested in understanding why you don't see <1s HLS chunk sizes as working in most cases? I feel like p99 real world latency stuff would show some natural buffer sizes.
The smaller the chunk (file) sizes, the less benefit you're getting from pushing the chunks through a CDN. More requests will come back to your origin server. And there's a lot of complexity in the pipeline from encoder -> origin server -> CDN that's mostly hidden (which is good) until you hit big performance cliffs (which is bad).
This is something that I think most people trying to implement low latency HLS and DASH have struggled with. It's not only the connection to the client that can stall. Your CDN can stall internally, too, waiting on chunks. And, in fact, if your CDN never ever has any internal performance issues, that's probably an indication that it's configured in such a way as to make the costs of delivering video through the CDN pretty much the same as the costs of delivering that same data through cascading WebRTC media servers!
Also, the CDN -> client link is TCP, so you're giving up the ability to just drop packets. TCP is going to do its lossless/ordered thing for you, which again is great for most of what we do on the Internet but starts to actively work against you when you're trying to get down to very low latencies.
Does that make sense? I tried to cover some of this in the footnotes. Apologies for not doing a better job.
this will be the winner purely because Apple devices are the lowest common denominator. That is, one must support iOS and if iOS only supports LL-HLS but other devices optionally support LL-HLS then LL-HLS is the winner.
This all looks like a lot of Very New Design Decisions. I tend not to be an early adopter on these, although it is mostly moot because I am no longer in the video delivery business. I have been considering streaming sets of short films, commercials, trailers, and movies to friends in a live stream via a pipeline of comparatively hoary old tech:
1) A playlist of multiple different video files in VLC as a source for ...
2) OBS Studio, which produces RTMP to be consumed by ...
3) nginx, which calls ffmpeg to produce multiple birates and resolutions to be rebundled as HLS to be sent to ...
4) a "live TV" channel of my own specification as an input to JellyFin, which can be read by ...
5) various clients on Roku, Apple TV, Firestick, Chromecast "apps"
At this point, I don't think the industry will ever really settle down to something manageable.
If your friends have access to VLC on all of their devices, and have some appetite for tech, why not use an RTSP server and just hand them RTSP URLs as "channels"? You don't get browser playback, but as long as everyone has VLC or an RTSP capable media player they can just open the stream on whatever device they have.
Oh gotcha. Ya, then your setup sounds great, and it looks like ffplayout will simplify a lot of the steps!
I just proposed that setup because that's what my partner and friends do to view livestreams of movies together or, specifically with my partner, our home security camera setup. We're all technical though and I don't offer the rest of my family this setup.
I somehow ended up as Technical Guy out of the people I know. It can be a burden.
Livestreaming is, well, "the space" exists but it largely occupied by solutions where the Hollywood types figured this time they won't be caught on the back foot, so they're busy policing what is sent around, so the homegrown stuff is largely where it is at if you want shove movies around, but then you're veering into "how technical are the recipients?" and it gets interesting.
I used to run a RealServer decades ago. Video is always ... interesting.
Why steps 2 and 3? I mean, I don't totally see the reasons for not consuming the video files directly with ffmpeg, instead of going in such roundabout way to bring the media into it.
Well, and understand I haven't yet attempted this, but apparently the discontinuities between switching files has caused a lot of problems downstream for Jellyfin, so OBS Studio would exist as kind of a way to smooth out the transitions.
Twitch latency in low-latency mode is ~2s (you can check for yourself in the settings cog menu -> advanced -> video stats) because of "prefetch segments", which are delivered via HTTP 1.1 streaming body responses. The client requests this segment and long polls while frames are delivered as they're being encoded. It's simpler and cheaper than WebRTC or Apple's low-latency HLS parts, and scales to hundreds of thousands of viewers. The downside is that it's not part of the HLS specification so support for it needs to be bespoke, but it is a proven technology that wasn't covered here (and to my knowledge, the most widely used low latency HTTP solution by # of users). Twitch has been using this for years.
The reason I didn't talk about it in the post is because the basic ideas that go into reducing the latency of any HLS/DASH implementation are the same. Smaller segment sizes, chunked transport, and smart encoding and prefetch/buffering implementations.
But ... while there's lots of terrific engineering underpinning the Twitch approach, there's no way around the real-world limits to that approach. As a result, on good network connections you'll usually get ~2s latency. But not so much on the long tail of network connections. If your use case can gracefully accommodate a range of latencies for the same shared session across different clients, that's fine. If it can't, it's not fine. Plus, I don't think you can get down very much below 2s.
(I don't have access to any Twitch/IVS internal data, so I don't know what latencies they see globally across their user base. But I've done a lot of testing of this kind of stuff in general.)
Agreed, all of the early HTTP based streaming was about HTTP progressive download. Microsoft’s smooth streaming was dominant in the early 2000s used it too. Glad that Twitch has been using it.
For me, DASH and HLS are just manifests or playlist definitions, of how best to find the resource. It is not dissimilar to webrtc signalling, i.e., go to a resource to discover where to get other parts of the resource.
I'm a bit surprised to not see RTP mentioned anywhere. It was a major standard for real-time voice last time I was looking at these things, and it was starting to get some use for videoconferencing.
A future blogpost will talk about all the networking work that went into optimizing for these large size participation.
Webrtc uses a few protocols. RTP is very central to it, ICE and SDP are also very important protocol for NATs/firewalls and capability selections. While we use webrtc as a shortcut to refer to the suite of protocols and didn’t call out RTP specifically for that reason.
RTP is the fundamental media transport protocol in WebRTC (which is not a protocol, but rather a suite of protocols working together in a defined way). Basically all web videoconferencing uses RTP already.
Okay it's funny because I had vaguely remembered something like that, but a "C-f rtp" on the wikipedia page for WebRTC didn't yield any hits, even though SIP (over websockets) was prominently mentioned, so I just figured the similarity in RTC and RTP had caused me to misremember.
I assume WebRTC includes STUN/TURN/ICE (negotiated over SIP?) then for traversing NATs? The last time I was really into networking was 2001-ish so that stuff was still around the corner, but I kept up with my reading for a few years after that. I also had some of these acronyms refreshed when setting up Jingle, which uses XMPP instead of SIP, but establishes an RTP connection much like traditional VOIP would use.
WebRTC doesn't proscribe a signalling mechanism at all. SIP is sometimes used, Jingle (XMPP) sometimes, or sometimes it's just a custom protocol exchanging SDPs (or equivalent structures) over WebSockets or a REST API.
WebRTC itself is RTP (DTLS-SRTP), ICE (incl. STUN/TURN), codecs & related parameters, capture mechanisms, all bundled up into a Web API.
Great article, thank you! Minor nit about footnote #10 (the very bad network failure mode): it really depends on where the bottleneck is and how the client has been implemented to react - there's a huge spectrum of client implementations out there, ranging from nearly dumb to those having wizard level heuristics.
If it's the user's last mile connection (between e.g. their home and their ISP), then the big HLS/DASH/etc buffer translates into a lot of time to react. So clients have the option to shift quite low - and do so quickly - if there are some very low bandwidth options, in theory even switching down to an audio-only or nearly-audio-only stream if one is provided, and can also choose to be optimistic/aggressive to resume playback as soon as one full chunk is downloaded - and some implementations will even resume playback when less than a full chunk is downloaded. The client side logic has a lot of latitude here to balance fast start/resume times vs sustaining playback.
When the bottleneck or failure is elsewhere, HLS can be incredibly durable. For extremely high profile events, for example, there are typically multiple CDNs involved, multiple sources going to independent encoders, etc. So an HLS/DASH client might talk to many different servers on a given CDN, as well as servers on alternate CDNs, and even grab what amount to being different copies of the stream spit out by different encoders. It's not uncommon for a client to be testing different CDN endpoints throughout playback to migrate away from congestion automatically.
That's a really great point about multi-CDN architectures.
Partly because big parts of WebRTC are not standardized (session setup signaling, of course, but also in practice lots of necessary state management) it's a little bit hard to imagine how to build an equivalent for WebRTC.
Relatively recently, I would have said that our experience running large-scale WebRTC stuff in production made "core" infrastructure failure relatively low on our list of concerns. The two components of the last mile connection, on the other hand, are always a huge pain point because of the long tail of bad ISPs and bad Wifi setups.
However ... many of us who try to deliver always-available video services got something of a wakeup call in November and December last year when AWS had two pretty big outages two months in a row.
In each of these approaches will any of these (or something else) also provide approaches to getting the stream saved for viewing again later on a remote server? I'm guessing I need some client which also connects to the stream and save which can then be served again. Or is still a solution already baked in available?
From a technical perspective, that's more or less baked into HLS: these are just discrete files being served over HTTP, so you can save the files to disk along with the manifest file and you have the content.
In practice, there is a little more to it. For example, when DRM is enabled, you need some way to preserve the decryption keys. And for live content, the manifest file usually just tells the client about a sliding window of files, so you need a tiny bit of additional client side logic to pay attention to this fact.
One cool thing about DASH/HLS is that you can do some pretty complex mixing of content - you can build a traditional TV-channel like experience that mixes live and prerecorded content, you can replace and inject ads, you can make live content immediately available for on-demand playback, etc.
If the streaming server you're using has the ability to store the chunks generated by the HLS encoder, you can always generate a new playlist containing all the chunks for the video - or you can merge the chunks back into a single mp4 file.
Daily's API will let you record the stream as a MP4 video file stored on Amazon S3 where it's immediately available after the live stream ends. Everything happens automatically on the same server that encodes the live stream.
Pavlov’s comment is correct. I came to add that soon the stream can be stored on customer’s own S3. Ergo, you’d be able to do a call in real-time, store it on your S3 account and make it available for streaming.
If we use Vimeo for video hosting would there be some way to automatically post to Vimeo or some other way to get the file up there and available for immediate viewing? Other approaches besides S3?
If Vimeo offers an ingest API using either RTMP(S) or HLS, that would be one way to get the stream from Daily directly to them without any extra processing step in between.
Mbone was tunneled, and expanded to native IP multicast to save bandwidth in networks that supported it. Meanwhile current networks use multicast (non routed, granted) all the time for eg service discovery and lower level stuff like finding the MAC address corresponding to a IP address.
Where's the future? A likely candidate may come out of the "Media over QUIC" work in the IETF, which already has several straw man protocols in real world use (Meta's RUSH, Twitch's WARP). It'll be a few more years before we see a real successor, but whatever it is will likely be able to supersede both WebRTC and HLS/DASH where QUIC and/or WebTransport is available.