If BLAKE2 (I assume you're using blake2b) is your bottleneck, perhaps you should consider the blake2bp (four way parallelism) or the blake2sp (eight way parallelism) variants. If even those are too slow for you (lots of continuous data and you have multiple cores available), then you might be able to try using tree hashing on top of one of the mentioned BLAKE2 variants.
I'm curious though, what are you working on that saturates ~1 GiB/s/core?
I'm curious though, what are you working on that saturates ~1 GiB/s/core?