I think LLVM is smart enough to optimize regular `for i in 0..stuff.len()` loops into the same assembly as `for i in &stuff` loops in almost all cases. I imagine this sort of "we can tell that i is always less than len" optimization is a big contributor to the low cost of bounds checks. In some large portion of cases where they're not needed, the optimizer can already see that.
IME loops with induction variables (integer indexes) often produces better codegen than with iterators. Compare the these two Rust functions for inverting bits: https://rust.godbolt.org/z/cE4vPdbdY
This got improved in Rust 1.65 just this month, but the point stands.
Might be a problem of the number of pass repetitions, where O2 does not rerun the vectorisation after whatever manages to unroll everything but O3 does.