yup, for comparison, same file as above using all the threads (32) on my system, I get about 45ms with fully parallel permitted b3. It does run into diminishing returns fairly quickly though; unsurprisingly no improvements in perf using hyperthreading, but also improvements drop off pretty fast with more cores.
b3sum --num-threads 16 /tmp/rand_1G ran
1.01 ± 0.02 times faster than b3sum --num-threads 15 /tmp/rand_1G
1.01 ± 0.02 times faster than b3sum --num-threads 14 /tmp/rand_1G
1.03 ± 0.02 times faster than b3sum --num-threads 13 /tmp/rand_1G
1.04 ± 0.02 times faster than b3sum --num-threads 12 /tmp/rand_1G
1.07 ± 0.02 times faster than b3sum --num-threads 11 /tmp/rand_1G
1.10 ± 0.02 times faster than b3sum --num-threads 10 /tmp/rand_1G
1.13 ± 0.02 times faster than b3sum --num-threads 9 /tmp/rand_1G
1.20 ± 0.03 times faster than b3sum --num-threads 8 /tmp/rand_1G
1.27 ± 0.03 times faster than b3sum --num-threads 7 /tmp/rand_1G
1.37 ± 0.02 times faster than b3sum --num-threads 6 /tmp/rand_1G
1.53 ± 0.05 times faster than b3sum --num-threads 5 /tmp/rand_1G
1.72 ± 0.03 times faster than b3sum --num-threads 4 /tmp/rand_1G
2.10 ± 0.04 times faster than b3sum --num-threads 3 /tmp/rand_1G
2.84 ± 0.06 times faster than b3sum --num-threads 2 /tmp/rand_1G
5.03 ± 0.12 times faster than b3sum --num-threads 1 /tmp/rand_1G
(over 16 elided from this run as they're all ~= the 16 time)
Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... is a related benchmark. On that particular machine we saw good scaling up to 16 threads, but yeah somewhere in that neighborhood you start to run into memory bandwidth issues. Which is the problem you want I guess :)