Dav1d: performance and completion of the first release

twotwotwo · on Nov 22, 2018

It's super neat to see desktop-class machines should be able to play 1080p AV1 fine with zero hardware support.

I think the lack of mention of GPUs in the post means the answer will be "no", but is this an area where open-source folks could realistically someday lean on the GPU for any help with decoding at all?

I see mentions of CPU/GPU "hybrid decoding" from GPU vendors, but can imagine that might only be something realistically possible with the lower-level access to the GPU the vendor's own driver team has, not via the documented shader languages and APIs.

jbk · on Nov 23, 2018

> I think the lack of mention of GPUs in the post means the answer will be "no", but is this an area where open-source folks could realistically someday lean on the GPU for any help with decoding at all?

Very very hard to do, with standard GPU APIs. You need GPU assembly to do great stuff, and this is rarely available or cross-GPUs.

Also, the issue is that, after SIMD, the run time of the things that are easy to parallelize (therefore GPU-izable) is around 25% or 30%. Which could offer some improvements, but not a x2 improvement.

Also, CPU <-> GPU memory transfer need to be avoided, on desktop, or mobiles where the memory access is not uniform, because this adds a lot of I/O latency.

So, some things are doable, but a full "GPGPU decoder" is unlikely...

twotwotwo · on Nov 24, 2018

Thanks, seemed like something like this might be the case, but good to hear it confirmed and the details. And thanks again for the work on dav1d!

CyberDildonics · on Nov 22, 2018

Why would it take 'low level' GPU access to accelerate video decoding? OpenGL has had compute buffers for years now.

twotwotwo · on Nov 22, 2018

The motivating observation here is that I know of a few GPU vendors offering hybrid decoding for HEVC and VP9, but no hybrid decoders put together by the open-source community. (Counterexamples are interesting!)

Reasons a GPU vendor might be better able to do this sort of thing than an outsider who can sling OpenGL include: 1) some hybrid decoders are described as leaning partly on special-purpose video decoding hardware, which tends to be a black box to us, and 2) more-detailed understanding of and access to the details of the hardware might let you efficiently express something that's inefficient or awkward in just GLSL--in other words, same kind of reason people care about Metal/Vulkan vs. OpenGL or asm vs. C.

(The further down in the weeds I get the less sure I am of precise technical correctness, but a couple concrete things that seem to make shaderizing decoding tricky are: 1) AV1 has a ton of control-flow-y elements--blocks can be split many different ways and be different sizes, and there are lots of prediction modes--and branchy code can be bad for shader efficiency, and 2) some things seem to block parallelism, e.g. for intra prediction you need the blocks you're predicting from before you can do predictions for the next block. And given the CPU-GPU transfer latency you can't ping-pong back and forth at will; you need large chunks that run well strictly on the GPU. Could be that pieces like the transforms and post-filtering that can be cleanly separated into GPU steps, though.)

An efficient open-source AV1 decoder based just on OpenGL/GLSL would be great! But since it wasn't mentioned as an ambition in the post, community-written hybrid decoders seem rare, and we had an expert about AV1 decoders in the thread, it did not seem unreasonable to me to ask how realistic it was.

Though if you manage to write an open-source OpenGL-accelerated AV1 decoder, that would definitely answer my question and leave everyone happy. :)

twotwotwo · on Nov 24, 2018

(jbk's recent reply answers this better than I could.)

jbk · on Nov 21, 2018

I'm the author, so if you need anything, just ask.

wpietri · on Nov 22, 2018

Thank you for putting a "what the heck is this" bit near the top! So many announcements like this assume you know exactly what is being talked about.

gardaani · on Nov 22, 2018

Does dav1d support scalability, such as spatial scalability? Is is possible to decode only 1920x1080 frames from a 3840x2160 video (if the video has been encoded with spatial scalability)?

It would be nice to be able to decode smaller frame dimensions with faster decoding time. That would be useful for viewing 4K material on computers which can't decode the full resolution.

The same for 10- and 12-bit videos - it would be nice to be able to decode a 8-bit version for 8-bit displays with faster decoding time.

ddevault · on Nov 21, 2018

Hi! This is really cool. I've been browsing the code and I wanted to ask, how difficult do you think it would be to port this to a system without pthreads? Can it be used on one thread?

Update: a more thorough look at the code quickly disillusioned me to this idea. Same as libaom...

rbultje · on Nov 21, 2018

Hi! You have 2 options: 1) write pthread emulation for your target system. We wrote one for windows native threads, but others should be straightforward. 2) if you want thread-less, that's possible (single-threaded performance shows 1080p is easy, and on high-end systems even 4K single-threaded might be doable), which basically just involves putting the two functions in thread_task.c under #if HAVE_THREADS, along with any coded calling pthread_() functions or using pthread_ types from <pthreads.h>, and then enforcing that Dav1dSettings.n_{tile,frame}_threads is always 1 (that means it won't ever enter these codepaths). Then, you always get single-threaded and (p)thread-less decoding.

Feel free to come on IRC, happy to help you dive into this, it's not very difficult.

ddevault · on Nov 21, 2018

Oh, great! I will hop onto IRC. Thanks!

clouddrover · on Nov 21, 2018

How much difference in performance is there between decoding 8-bit video versus 10-bit video?

rbultje · on Nov 21, 2018

Right now, 10-bit decoding is horribly slow because the assembly optimizations only cover 8-bit, so it's probably 10-20x slower. We'll work on 10-bit next, and in the end, I'd expect it to be 30-50% slower than 8-bit.

zamadatix · on Nov 21, 2018

Are 10 & 12 bit decoding in the same optimization bucket or do they need to be treated separately?

rbultje · on Nov 21, 2018

10/12-bit can usually be done together, but are completely different from 8-bit. However, it's possible we'll do 10-bit first and then later on make the tiny adjustments that allow us to use them for both 10-bit as well as 12-bit.

ComputerGuru · on Nov 21, 2018

Realistically speaking comparing a hevc (x265) run and a dav1d run producing a video of similar quality but ~20% smaller, what is the difference in encoding time?

ktta · on Nov 22, 2018

You'll want to check out rav1e ( https://github.com/xiph/rav1e)

Here's a comment that gives a clue - https://news.ycombinator.com/item?id=17539791

rbultje · on Nov 21, 2018

dav1d is a decoder, not an encoder.

phkahler · on Nov 21, 2018

Numbers for Ryzen 2400G would be nice. That's my main computer and they make a great HTPC.

Great news though!

jbk · on Nov 21, 2018

> Numbers for Ryzen 2400G would be nice. That's my main computer and they make a great HTPC.

No access to those machines, so I cannot guess...

Ono-Sendai · on Nov 22, 2018

Can you build it on Windows?

rbultje · on Nov 22, 2018

Yes, it supports Windows natively. The tests were run on Windows.

cornstalks · on Nov 21, 2018

Congrats to everyone on the progress, and a huge thanks from me to all the devs who are working on this! Are there any performance comparisons with dav1d (AV1) vs ffvp9 (VP9)? I’m curious how expensive decoding AV1 is compared to VP9 (in software) (and I’m hoping someone else has already done the benchmarking so I won’t have to).

jbk · on Nov 21, 2018

It is a bit more expensive, but not much, for the same quality (aka less bitrate). For same bitrate, it's 25%/30 more expensive.

No actual measure, just feeling from what we've seen.

BlackLotus89 · on Nov 21, 2018

> Therefore, the VideoLAN, VLC and FFmpeg communities have started to work on a new decoder

Is there a need to seperate VideoLAN and VLC?

Anyway nice progress, didn't expect such good results so soon. My main question right now is what the slowest system is on which AV1 is still playable. I know that older CPU and ARM optimizations are on the horizon (On the other platforms, SSE and ARM assembly will follow very quickly, and we're already as fast on ARMv8.), but I'm curious if my raspberry pi/odroid will ever be able to play 1080p AV1 Videos.

jbk · on Nov 21, 2018

> Is there a need to seperate VideoLAN and VLC?

Yes, the community are not joint. VideoLAN has numerous people not working on VLC.

> Raspberry pi/odroid will ever be able to play 1080p AV1 Videos.

rPi? no. Recent o-Droid, yes.

naikrovek · on Nov 22, 2018

> VideoLAN has numerous people not working on VLC.

Whoa, what? What else is going on? Oh, x264 & x265, I bet.

buovjaga · on Nov 22, 2018

https://www.videolan.org/projects/

polskibus · on Nov 21, 2018

Is this written in Rust? If so, did any particular Rust features help a lot in this achievement, in comparison to writing the code in C or C++?

nindalf · on Nov 21, 2018

You're thinking of rav1e, which bills itself as "The fastest and safest AV1 encoder"

https://github.com/xiph/rav1e

dralley · on Nov 22, 2018

Being a decoder, they probably place a high priority on having the widest possible platform support. C is still top dog in that respect.

w-m · on Nov 22, 2018

I'm curious, what kind of platforms do you have in mind, that

a) can be targeted by C, but not by Rust

b) provide enough performance to make porting a next-gen video decoder a worthwhile exercise?

ncmncm · on Nov 22, 2018

Thousands of special-purpose, minimally-featured, embedded systems. You don't notice them because they are invisible, and they are invisible because they "just work". For high-enough volume products they have a decoder chip or section of a gate array, but most are low-volume and can barely afford the ROM for the code.

rbultje · on Nov 21, 2018

No, it is written in C and assembly. See the gitlab graphs:

https://code.videolan.org/videolan/dav1d/graphs/master/chart...

ncmncm · on Nov 22, 2018

It has to be C because so many embedded-system vendors are pathologically hostile to anything else. Most tolerate C only to try to win ports from other, typically end-of-lifed, targets, and resent it.

A few have begun to embrace LLVM, and so don't care about the front-end language -- they still only say they support C, but turn out to not notice if you feed in IR from something else. Then it becomes a question of how badly your code needs the language runtime support code, or how good you are at porting it, because they will not pick up maintaining any of that under any circumstance. GC? Ha.

gameswithgo · on Nov 21, 2018

They would have had a simpler time of cpu feature detection in Rust, as it is built in. But that isn't a huge thing.