In that case you might for example need to splice about 8-16 kB substrings to UTF-8 validate, to strike a reasonable balance between loop setup costs etc. and not spilling L1 cache. You'd of course need to check every splice ends at a valid UTF-8 boundary.
Streaming operation composition while not sacrificing memory bandwidth or excessive overhead for CPU's liking is pretty cumbersome. You definitely don't want to spill cache, but you also need avoid overhead by ensuring vectorized code can take full advantage of the CPU performance.
Anyone who tries to achieve high performance in this kind of operations without a profiler is doomed.
In that case you might for example need to splice about 8-16 kB substrings to UTF-8 validate, to strike a reasonable balance between loop setup costs etc. and not spilling L1 cache. You'd of course need to check every splice ends at a valid UTF-8 boundary.
Streaming operation composition while not sacrificing memory bandwidth or excessive overhead for CPU's liking is pretty cumbersome. You definitely don't want to spill cache, but you also need avoid overhead by ensuring vectorized code can take full advantage of the CPU performance.
Anyone who tries to achieve high performance in this kind of operations without a profiler is doomed.