Can the mill not execute instructions ahead of others? Maybe i dont understand w...

wtallis · on July 19, 2017

The Mill is pipelined and superscalar, which are the key features to having multiple operations in-flight, and it is also speculative in that it will start feeding stuff into the pipeline before it's known whether those are definitely the right instructions to be executing.

Out-of-order execution means a CPU will re-order instructions on-chip to account for stalls due to cache misses or (on superscalar processors) due to a mismatch between the kinds of instructions and the kinds of execution units available. OoO execution require analyzing instruction dependency information pretty early in the pipeline and very quickly to keep up a steady stream of ready-to-execute operations.

An in-order processor like the Mill only pays attention to instruction dependencies to determine whether a memory fetch has been completed, and if not to stall the pipeline until it has. The Mill compensates for that performance disadvantage by being statically scheduled: all the decisions about how to schedule the instructions are handled by the compiler toolchain, which has to know the exact instruction latencies and mix of available functional units on that model. An in-order processor that isn't statically scheduled could be superscalar but it would be hard to get several functional units in use simultaneously. For example, the original Pentium was in-order but could issue up to two instructions simultaneously, with restrictions on what instructions could be executed in what pipe and if the two instructions in a pair didn't take the same amount of time a third (and maybe fourth) instruction couldn't start until both had finished.

hvidgaard · on July 19, 2017

> ... all the decisions about how to schedule the instructions are handled by the compiler toolchain, which has to know the exact instruction latencies and mix of available functional units on that model.

That sounds pretty bad from a performance standpoint. You'd have to compile a binary for every single CPU to get the best performance. Maybe it could work with JIT languages running on JVM and CLR.

CarVac · on July 19, 2017

Binaries would be distributed in a family-member-independent form, and then they would be "specialized" on each individual machine.

randyrand · on July 20, 2017

Wouldn't the individual latencies also depend on clock speed?

How would you have a CPU be able to clock-up and clock-down and still maintain good performance?

hvidgaard · on July 21, 2017

I'm unsure. When you increse RAM speed the timings are relaxed, and I would think that cache is the same.

hvidgaard · on July 19, 2017

That is, more or less exactly what running on a VM is. You distribute Jave bytecode or CIL.

gizmo686 · on July 19, 2017

You dont need a VM. You just push the last stage of compilation to the user. The user can then compile to their architecture before starting execution. Optimal compilation involves some NP complete problems, so depending on how good there heuristics are, this could cause a performance issue in loading code, or in running underoptimized code. However, these issues could be solved by moving compilation to install time.

hvidgaard · on July 20, 2017

Pushing a final compilation to the user is not a good idea - either it will happen everytime you run, or at install time. I have a lot of programs that are just executeables, and will not be installed. Having that overhead every time something is run is just a waste. Install compilation means that I cannot expect reasonable performance if I upgrade the CPU, if it'll work at all.

So for any reasonable implementation we're looking at JIT with caching, for running platform independent code. The very definition of a process VM is "to execute computer programs in a platform-independent environment"[0].

[0]: https://en.wikipedia.org/wiki/Virtual_machine

randyrand · on July 20, 2017

You've really missed the point. The point is to lower power consumption and increase performance on the CPU level. Running a VM on top of x86 does not achieve that goal because you are still on x86.

hvidgaard · on July 21, 2017

I know very well that is the goal, I've been following this project for years now. My point still stands, you need to compile specifically for the CPU it's run on, or you need to move the final compilation to the end user. The only way to achieve reasonable performance if you distribute platform independent code is by using a process VM.

I'm not saying it will not be faster, just that it is a weakpoint in the Mill architecture compared to x86.

TazeTSchnitzel · on July 19, 2017

The Mill isn't an “out-of-order” machine because it's not taking linear code and scheduling it to run out-of-order dynamically (i.e. when it executes). Instead, the compiler does the scheduling.