The latency here is dominated by converting the 2D image to a 1D wavefront that's input to the device. This stage would involve some digital logic and relatively slow components with response time on the order of milliseconds.
This is the usual pipeline problem, though: sometimes the bottleneck is the CPU, and sometimes the bottleneck is memory bandwidth. This just places the ball firmly in the memory bandwidth court...
(You can have a hundred worker CPU cores doing the necessary conversions, but just need to worry about the parallelization complexity. But, then again, this is exactly what already happens when we feed data to hefty devices like GPUs and TPUs.)