Prompted by pcwalton I've added a tiny specialized iterator for zipping two vector slices that apparently is simple enough to avoid the null pointer checks after all. This brings the runtime on my system down from ~3.1 seconds to ~2.06 seconds. Going from i64 to i32 brings it further down to ~1.03s, beating your C++ version (thanks!) which runs in ~1.55s on my system.
I've looked at the output of both inner loops, and it seems I put too much faith in std::valarray's expression templates. Changing the distance function to
results in the exact same inner loop as Rust. The remaining speed difference comes from the file loading code, which admittedly is crap (way too many memory allocations). So for all intents and purposes, I admit defeat.
Possibly. There's been some talk in that direction, but also some "can't we just fix llvm?" objections. I guess someone needs to write up a pull request (and see what other iterator traits to implement, maybe).
Code at http://ix.io/cUd