There is something that I can't get to add up here. The phasing claims that there are only 3 pipeline stages compared to 5 in the textbook RISC architecture or 14-16 in a conventional Intel processor, but this can't possibly add up with the 4 cycle division or the 5 cycle mis-predict penalty.
The phase says when the op issues. It takes some number of cycles before it retires. So an divide issues in the "op phase" in the second cycle, and if on the particular Mill model it takes 4 cycles then it retires on the fifth.
If there is a mispredict, there is a stall while the correct instruction is fetched from the instruction L1 cache. If you are unlucky, it's not there and you need to wait longer.
OK, so the phases aren't an apples to apples comparison to the traditional pipeline stage, but more in line with the TI C6x fetch, decode, execute pipeline which for TI covers something like 4 fetch stages, 2 decode stages and between 1and 5 execute stages. Thank you for the clarification
What am I getting wrong?