It's not only a matter of bandwidth, but a matter of CPU utilization. I've tried to feed screenshots to ffmpeg and other tools, and it's just... unusable. It works, but consumes way too much resources. At least on my computer (MacBook 13-inch, 2019).
So from on side you have CPU utilization, from the other - network. Network is cheap, but encoding is expensive. This is my thinking at least. I don't have proof - only local experiments, but it's a really good idea to start measuring this.
I also have other ideas in mind on how to scan the screen and send only parts of the screen that have been updated. Probably if I send only a half of the screen, it will beat the video encoding in terms of network. The diff algo should be very fast though, since we're dealing (in case of 1280x720) with 1280x720=914400, 914400*4 = 3.49 MB info processing in 1 second.
Also, curious to hear about video encoding efficiency vs 60x JPEG creation. Is it comparable?
> I've tried to feed screenshots to ffmpeg and other tools, and it's just... unusable. It works, but consumes way too much resources.
Did you try to use the hardware encoder? Modern computers have chips to accelerate/offload video encode/decode. Your 2019 Mac has Intel GPU for H.264 and HEVC hw encoder, also it has an T2 co-processor that can also encode HEVC video.
If you don't supply specific encoders (with _videotoolbox suffix on Mac) via -c:v then ffmpeg will default to sw encoder, which consumes CPU.
> how to scan the screen and send only parts of the screen that have been updated
You'll be reinventing video codecs with interframe compression.
> Also, curious to hear about video encoding efficiency vs 60x JPEG creation. Is it comparable?
I see that you are comparing pixel by pixel for each image to dedupe and also resizing the image to 1280px. Also the image has to be encoded to JPEG. All of the above are done in CPU. In essense you implemented Motion JPEG. Below is a command to allow you to evaluate a more effecient ffmpeg setup.
ffmpeg \
-f avfoundation -i "<screen device index>:<audio device index>" \ # specific to mac
Keep in mind most codecs can be tuned. Live encoding is a very different use case than encoding a video file you only need later. Most codecs have knobs you can turn to make it have lower latency&cpu in exchange for somewhat larger file sizes.
I co-built a similar screen sharing app (with a web server seeing traffic in the middle though) many years ago. MPEG isn't a good fit. We tried JPEG but it didn't look great. Version 1 decomposed the screen into regions, looked at which regions changed, and transmitted PNGs.
The second version tried to use the VNC approach developed years ago by AT&T and open sourced. The open source implementation glitched just a bit much for my liking. Various companies white labelled VNC in those days; not sure they fed back all their fixes. But the raw VNC protocol has a lot of good compression ideas specific to the screen sharing domain and documented in papers. People also tunneled VNC over SSH. I jerryrigged an HTTPS tunnel of sorts.
After a while I started to suspect if I wanted to get higher frame rates I should use a more modern screen sharing oriented Microsoft-based/specific codec. But it wasn't my skillet so I never actually went down that route. But I'd recommend researching using others screen-optimized lossless codecs, open or closed, so you don't reinvent the wheel on that side of things if you're serious about reducing bandwidth.
I use many screen sharing apps and none of them have this issue. They do the lossy compression by reducing the number of colors (gradients become bands) not by JPEG-style discrete cosines.
So from on side you have CPU utilization, from the other - network. Network is cheap, but encoding is expensive. This is my thinking at least. I don't have proof - only local experiments, but it's a really good idea to start measuring this.
I also have other ideas in mind on how to scan the screen and send only parts of the screen that have been updated. Probably if I send only a half of the screen, it will beat the video encoding in terms of network. The diff algo should be very fast though, since we're dealing (in case of 1280x720) with 1280x720=914400, 914400*4 = 3.49 MB info processing in 1 second.
Also, curious to hear about video encoding efficiency vs 60x JPEG creation. Is it comparable?