If you break videos up into short chunks, then you could simply encode those chunks on-demand into the perfect encoding for the requester.
The advantages are:
- You don't waste CPU encoding video into formats that won't be used.
- You can use a standard caching solution to reuse those chunks.
- Everyone gets the perfect encoding, always.
- If most people watch the first 2 minutes and then give up on a 20 minute video, you don't waste CPU encoding the other 18 minutes.
- You can introduce new encodings instantly for all videos, without going back and re-encoding historical videos.
- You don't waste storage on video chunks in formats that will never be used.
- It's really simple.
The disadvantages would be:
- You have more unpredictable load if a lot of people start watching (different) videos at once (although the common case of a lot of people watching the same video is still fine) and you could "cap" the load by switching to a fallback format to avoid becoming overloaded.
- There might be an initial delay when playing or sweeping a video whilst the first chunk is encoded. On the other hand, it can't get much worse than it is already, and you could make sure these initial chunks are prioritised, or else serve the initial chunks via a fallback format.
Have you ever written a stateless transcoder like this? Of course it can be done, but saying “you could simply encode those chunks” and “It’s really simple” is pretty misleading especially if you are changing frame rates or sample rates or audio codecs during the encoding process.
That said, if there is someone that could do this at scale it would be Facebook.
Also this would mess up ABR streaming at least for the first people to watch the video which would not really guarantee “the perfect encoding always”.
I have written such a transcoder [0] and while it is definitely not "simple," it has definitely never been easier to achieve than today.
If the input video source has been prepared properly (i.e., constant framerate, truly compliant VBR/ABR, fixed-GOP), or if your input is a raw/Y4M, then segmenting each GOP into its own x264 bytestream is rather trivial.
If the input is not prepared for immediate segmentation, it is also somewhat easy now to fix this before segmenting for processing. Using hardware acceleration a transcoder could decode the non-conforming input to Y4M (yuv4mpegpipe) or FFV1, which can then have a proper GOP set.
It's not that simple. Especially if you deal with videos that have open GOPs and have multiple B frames in the GOP. The encoders don't do so great job in those cases. Also breaking the videos up into short chunks is easier said than done -- you need to understand where it makes sense to split the video, make sure the Iframes are aligned and generally try to keep the consistent segment size -- which with very dynamic videos encoded with different types of source encoders could result in very inconsistent performance of the encoder itself across the chunks. For that reason, it's always best to have some sort of two/three pass encoding where the analysis step is integral and then based on it the actual split & encode is performed. Which of course does not work for low-latency live streaming scenarios.
That's not always reasonable, though. If I upload a video at 4K, there needs to be some "baseline" encoding so that when the video is published, there's something playable without streaming 4K to, say, cell phones with a resolution smaller than 4K.
Even then, chewing on that 4K file into a 1080P resolution video "on-demand" for a desktop user on high-speed internet is no small task. First and foremost, you need to assume concurrency: if two people request the video at the same time, there's a complex problem of coordinating the encoding in a large distributed system so that the video is encoded once (or a very small number of times). You also need to do the encoding _faster than the video can be played_, or at least faster than the baseline/fallback version can be retrieved and sent. You also need to queue the next chunk(s) of video up, so you're not watching a chunk, buffering, watching a chunk, etc.
In a system at the scale of FB, it's not smooth sailing for compute jobs like this: you're subject to network latency, noisy neighbors, failures (disk/network/software/power/etc.). The case where you're able to stream the original file from storage, start encoding it and streaming the output back to storage and to everyone around the world who is requesting it _at that moment_, and coordinating the encoding of the next chunk is actually not very likely.
Want to talk about weird failure modes?
- I start watching your video and click all over the seek bar. Am I DOSing your compute cluster?
- Two users on opposite sides of the world request the same location of the same file with the same fidelity. Does one of those users get a dirt-slow experience, or do I double my compute costs?
- A thousand users start watching the same video at roughly the same time. A software bug causes the encoder(s) to crash. Do 1000 users suddenly have a broken experience, or does the video pause while your coordination software realizes there was a failure, releases the lock, and restarts that encoding job from the top while your users all get in line for the new job?
I'd argue that this is the _least simple_ approach. In the happy path, you get a nice outcome while reducing compute cost, and users get high-quality video. In the unhappy path, users get slow loading from slow encoding instead of reduced resolution, or you start to need to trade performance for compute (do you encode twice in two datacenters, or move compute further from the viewer?).
As other people have pointed out, video encoding isn't stateless. You can chunk and encode in parallel, but there are tradeoffs.
The biggest problem is that you've taken what is effectively a CDN problem (serve the first chunk of a video) into a CDN and a CPU scheduling problem
Serving video is cheap, apart from the bandwidth. So anything that reduces the number of bytes transferred yields savings. Real time encoders are not as efficient as "slow" encoders.
For low volume videos (ie 99.5% of all video) the biggest cost is storage. So storing things in a high quality, or worse still, original codec makes storage expensive. Not only that you still have to transcode on the way in, or support all codecs ever made, in real time.
In short, yes, for some applications this approach might work, but for facebook or youtube, it wont.
It seems that this process will be incredibly stateful in the encoder component.
Most of codecs targeted at low bandwidth for mobile streaming track scene changes and if you make chunks in a naive way (split via I-frame borders and encode 'em independent of each other), when reassembled final video will look choppy due to broken scene change relations.
So after encoding each chunk you will have to carefully save relevant parts of encoder state and reuse it for the next chunk. Seems doable, but tricky to get right?
This is kind of simplistic. For instance, if ‘most people’ only watch the first 2 minutes of a 20 minute video, you still have to encode all of it for that minority that does watch the whole video. Also consider that very large groups of people use very similar hardware and connections.
Anyway, of course videos are already chopped into chunks that are stored separately. It’s much easier to distribute and cache these independent chunks. On demand encoding doesn’t change that.
The advantages are:
- You don't waste CPU encoding video into formats that won't be used.
- You can use a standard caching solution to reuse those chunks.
- Everyone gets the perfect encoding, always.
- If most people watch the first 2 minutes and then give up on a 20 minute video, you don't waste CPU encoding the other 18 minutes.
- You can introduce new encodings instantly for all videos, without going back and re-encoding historical videos.
- You don't waste storage on video chunks in formats that will never be used.
- It's really simple.
The disadvantages would be:
- You have more unpredictable load if a lot of people start watching (different) videos at once (although the common case of a lot of people watching the same video is still fine) and you could "cap" the load by switching to a fallback format to avoid becoming overloaded.
- There might be an initial delay when playing or sweeping a video whilst the first chunk is encoded. On the other hand, it can't get much worse than it is already, and you could make sure these initial chunks are prioritised, or else serve the initial chunks via a fallback format.