Interesting, the architecture looks greatly simplified compared to even standard RISC (As opposed to lets say x86). Due to that simplification it will be power efficient while being inherently highly parallel.
Would be interesting to find out:
1. How high that degree of parallelism can be pushed, are we talking about tens or hundreds of pipelines?
2. What frequency this will operate at?
3. What is up with RAM? I saw nothing about memory, with lots of pipelines it is bound to be memory bound.
Hi, I'm the author of that intro. The talks which Ivan has been giving - there are links in that intro - go into everything in much more detail. But here's a quick overview of your specific questions:
1: we manage to issue 33 operations / sec. This is easily a world record :) The way we do this is covered in the Instruction Encoding talk. We could conceivably push it further, but its diminishing returns. We can have lots of cores too.
2: its process agnostic; the dial goes all the way up to 11
3: the on-chip cache is much quicker than conventional architectures as the TLB is not on the critical path and we typically have ~25% fewer reads on general purpose code due to backless memory and implicit zero. The main memory is conventional memory, though; if your algorithm is zig zagging unpredictably through main memory we can't magic that away
>>: the on-chip cache is much quicker than conventional architectures as the TLB is not on the critical path
I would really like to know your reasoning that the TLB is a major bottleneck in conventional CPUs. CPUs execute a TLB lookup in parallel with the cache, so there is usually no latency except on a TLB miss.
Basic research on in-memory databases suggest eliminating the TLB would improve performance only by about 10%, this certainly isn't a realistic use case and most of the benefits can be obtained simply by using larger pages. So I don't really know where your claims about 25% fewer reads is coming from in relation to simply getting rid of virtual memory.
Right, most modern caches use the virtual address to get the cache index and use the physical address for tag comparisons[1]. Since on x86 the bits needed for the tag are the same between the virtual and physical address the entire L1 lookup can be done in parallel, though for other architectures like ARM you need to finish the TLB step before the tag comparison[2].
But while I think the Mill people are overselling the direct performance benefits here, the single address space lets them do a lot of other things such as backing up all sorts of things to the stack automatically on a function call and handling any page fault that results in the same way that it would be handled if it was the result of a store instruction. And I think they're backless storage concept requires it too.
The reason the TLB is so fast is also that it is fairly small and thus misses fairly often. Moving the TLB so it sits before the DRAM means that you can have a 3-4 cycle TLB with thousands of entries.
33 operations/cycle require memory with (at least) 66 ports: 33 for reads and 33 for writes. Otherwise it is NOOP. For two-operand instructions the count goes to 99=33*3 and for three operand instructions (ternary operator) it goes to 132 ports.
As far as I know, Elbrus 3M managed to achieve about 18 instructions per clock cycle, with VLIW and highly complex register file, whose design slowed overall clock frequency to about 300MHz on 0.9um process. To get everything in comparison, plain Leon2 managed to get about 450MHz in the same process, without any tweaks and hand work and Leon2 is not a speed champion.
So the questions is: do you have your world record in simulation or in real hardware like FPGA?
Some kinds of code will benefit from this - long calculations and deep nested procedures. But lots of hangups on consumer applications are in synchronization, kernel calls, copying and event handling.
I'd like to see an architecture address those somehow. E.g. virtualize hardware devices instead of writing kernel-mode drivers. Create instructions to synchronize hyperthreads instead of kernel calls (e.g. a large (128bit?) event register, a stall-on-event opcode). If interrupts were events then a thread could wait on an interrupt without entering the kernel.
Actually, the Mill is designed to address this; it has TLS segment for cheap green threading, SAS for cheap syscall and microkernel arch, cheap calls and several details for IPC which are not public yet.
What about synchronization? Folks are terrified of threads because synchronizing is so hard. But a thread model can be the simplest especially in message models.
Very very interesting, thanks for sharing! What would the path be to using existing code/where would Mill appear logically first?
Also, could something like Mill work well within the HSA/Fusion/hybrid GPGPU paradigm? E.g. from my very amateur reading of your documents, it looks like a much needed and very substantial improvement to single threaded code; how would a mixed case where we have heavy matrix multiplication in some parts of our code as part of a pipeline with sequential dependencies work? Would an ideal case be a cluster (or some fast interconnect fabric in a multi socket system) of multi core Mill chips be the future?
Realistically, is this something that LLVM could relatively easily target? A simple add in card that could give something like Julia an order of magnitude improvement would be a very interesting proposition, especially in the HPC market. I come at this mainly from an interest how this will benefit compute intense machine learning/AI applications.
The latest talk on their website mentions the LLVM status in passing at the end. Essentially they're moving their internal compiler over to use LLVM, but it requires fixing/removing some assumptions in LLVM because the architecture is so different, and the porting effort was interrupted by their emergence from stealth mode to file patents.
Great idea, since it's all theoretical currently I'm wondering with the compiler offloading how well it will actually perform. Itanium was capable of doing some amazing things, but the compiler tech never quite worked out.
Ah, but the Mill was primarily designed by a compiler writer ;)
Here's Ivan's bio that is tagged on his talks:
"Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011."
So actually its been designed almost compiler-first :)
Still interested in how it works in practice. I'm pretty sure the Itanium team combined with Intel's compiler team have similar credentials.
I'm not saying it can't work, not saying it won't work, but we know that most code pointer chases. While CPU and compiler design is above my paygrade I know that often a lot of fancy CPU/design and compiler tricks that make things twice as fast on some benchmark leads to 2 to 3% performance gains on pointer chasing code.
Not sure how the Mill is going to make my ruby webapp go 8 times as fast by issuing 33 instructions instead of 4.
> Not sure how the Mill is going to make my ruby webapp go 8 times as fast by issuing 33 instructions instead of 4.
8x speed is not being claimed, 10x power/performance is. That could mean that the app runs at the same speed but the CPU uses 10% of the power. A lot of the power saving probably comes from eliminating many part of modern CPUs like out-of-order circuitry.
Ok, so now that it's 10x power/performance I buy 10 of these things and it still only delivers 5% more webpages.
This kind of mealymouthed microbenchmark crap is exactly what the industry doesn't need, if I have a bunch of code that is pure in order mul/div/add/sub then I put it on a GPU that I already have and it goes gangbusters. The problem is most code chases pointers.
Like I said, great idea, would love to see something that can actually serve webpages 10x as fast or 1/10th the power (and cost similar to today's systems)
I never thought of serving webpages as being CPU-bound. Anyway, to get a 10x speedup, you would have to buy enough of these to use as much power as whatever you're replacing. So if one Mill CPU uses 2% as much power as a Haswell, then you'd have to buy 50 of them to see a 10x performance improvement over the Haswell.
The speedup for Ruby will come from the Mill enabling faster DBs and services for you to use, and from Ruby VM improvements that are not perhaps Mill -specific.
If you pick Ruby as your platform, though, you are really picking a point on the runtime vs developing speed tradeoff that perhaps suggests you plan to scale sideways rather than upwards anyway; in which case the hosting platform for your app may be interested in Mill even if its users are ambivient.
Pointer chasing is a major concern, and the Mill can't magic it away. But there are other parts of your Ruby webapp that are a big deal such as event loops, continuations and garbage collection, where again the Mill has special sauce. There is also special attention paid to syscall performance on the Mill. Rails has a staggering number of syscalls per request, and django to pick an alternative has very few, so I'd still hope Rails moderates syscalls a bit.
The beauty of the Mill is that it's been designed from the start to make the compiler extremely simple and straightforward. There is no "magic" in the software here, it's all in the hardware.
Actually there's a fair bit of magic in the software as a result of exposing the hardware rather than trying to hide it. Once the software can know how long things will take, suddenly it can do things that in x86 land would be magical.
This seems to me philosophically what Sony was trying to do with the Cell processor. Expose the hardware to programmers so that they can manage things better. The big difference being that the Mill was designed by a compiler writer rather than a bunch of guys who design GPU pipelines.
Actually there's a fair bit of magic in the software as a result of exposing the hardware rather than trying to hide it. Once the software can know how long things will take, suddenly it can do things that in x86 land would be magical.
Ahhh. When I think of magic I think of stuff like optimizer heuristics that give incredible performance with very carefully written micro-benchmarks and abysmal performance in the worst case.
Well, this seems to fall within the VLIW tradition and has an exposed pipeline like the original VLIW, but there are a bunch of differences. In the original VLIW every instruction pipeline was conceptually a different processor while the Mill is very much unified around it's single belt, though I wonder if you could have a similar design with separate integer and floating point belts.
And instead of having a fixed instruction format the Mill has variable length bundles, which is good. Instruction cache pressure is certainly a traditional weakness of VLIW. So maybe you could say Mill:VLIW::CISC:RISC? But the most important part of RISC was separating memory access from operations and the Mill still certainly does that.
Some of the memory ideas are similar--Itanium had some good ideas about "hoisting" loads [1] which I think are more flexible than the Mill's solution. In general, this is a larger departure from existing architectures than Itanium was. Comparing it with Itanium, I doubt it will be successful in the marketplace for these reasons:
-Nobody could write a competitive compiler for Itanium, in large part because it was just different (VLIW-style scheduling is hard). The Mill is stranger still.
-Itanium failed to get a foothold despite a huge marketing effort from the biggest player in the field.
-Right now, everybody's needs are being met by the combination of x86 and ARM (with some POWER, MIPS, and SPARC on the fringes). These are doing well enough right now that very few people are going to want to go through the work to port to a wildly new architecture.
The compiler part seems to be a core part of the mill's strategy: the representation and design seems to be oriented towards making it easy to compile for (the guy who gives the talks is a compiler writer). If the performance gains are half as good as advertised, and porting is not a complete pain (and it seems it won't be too bad), then they will have little difficulty attracting market share, even if only in niche applications at first.
> -Right now, everybody's needs are being met by the
> combination of x86 and ARM (with some POWER, MIPS, and
> SPARC on the fringes). These are doing well enough
> right now that very few people are going to want to go
> through the work to port to a wildly new architecture.
That's not true at all. The biggest high-performance compute is being done on special parallel architectures from Nvidia [1] (Tesla). Intel trying to bring X86 back into the race with its Xeon Phi co-processer boards [2].
> Right now, everybody's needs are being met by the combination of x86 and ARM (with some POWER, MIPS, and SPARC on the fringes).
I'm not sure. I think that a hard port to a new architecture must look a lot more like a worthwhile effort now that the wait-six-months Plan A no longer works, especially for single-threaded workloads. Provided the new architecture can actually deliver the goods, of course.
LLVM intermediate representation and Mill code are going to be pretty different. The LLVM machine model is a register based machine (with an arbitrary number of registers--the backends do the work of register allocation). Basically, an easier RISC-ish assembly.
So, while LLVM would be helpful for porting things to the Mill, as it's largely a "solve once use everywhere" problem, it's still not trivial. It could take a lot of effort to make it competitive.
as someone who knows next to nothing about cpu architecture but has watched most of the videos, it seems as though all the concepts are broadly familiar ones to experienced architecture people, but the details of every corner are slightly rearranged.
the position of the translation lookahead buffer is one example of this. that portion of the memory talk goes something like
Ivan: usually the TLB is located here [points to slide]. in the mill it's here [flips to next slide].
I am pretty sure they are talking about per-cycle performance. Since they can do 33 operations per cycle. IIRC the peak performance of an Intel chip at the moment is 6 FLOP per 2 cycles (or there abouts).
Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.
It seems like an interesting design, but comparing performance against what an x64 chip can do is a bit silly, you can't just pick numbers at random and call that the overall improvement.
A Haswell core can do 2 vector multiply-adds per cycle, which results in a peak of 32 single-precision FLOP per cycle per core or 16 double-precision FLOP per cycle per core.
The instruction encoding talk starts with comparison between Mill, DSP and Haswell and tries to explain the basic math. The Mill is a DSP that can run normal, "general purpose" code better - 10x better - than an OoO superscalar. The Mill used in the comparison - one for your laptop - is able to issue 8 SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.
I was strictly replying to the Intel FLOPs claim of the parent comment. I have only a faint idea how the Mill CPU works, so I can't really compare against it.
From the little I have read, the Mill CPU looks like a cool idea, but I'm skeptical about the claims. I'd rather see claims of efficiency on particular kernels (this can be cherry-picked too, but at least it will be useful to somebody) than pure instruction decoding/issuing numbers. Those are like peak FLOPs: depending on the rest of the architecture they can become effectively impossible to achieve in reality. In any case, I'm looking forward to hearing more about this.
A key thing generally is that vectorisation on the Mill is applicable to almost all while loops, so is about speeding up normal code (which is 80% loops with conditions and flow of control) as well as classic math.
Well, at least they should tell _which_ number they picked. They also mention power, so I would expect it's something more than just instructions per cycle.
Given that there's not a publicly-available simulator or compiler for the Mill, I hope they washed their hands after retrieving those figures. The GI tract is not a friendly place... ;)
They have been running simulations and had working compilers for a while now. We cannot verify the numbers yet, but they don't seem to just be pulled out of thin air.
I think their first to third talks go into more detail on this. IIRC it's something like ops per second per watt (and not completely theoretical best-case either: based on running realistic code in sim).
For those that are mainly software-oriented, the Lighterra overview posted earlier is helpful background for understanding where VLIW fits into the zoo of CPU architectures:
This whole thing is just horribly exciting for a computer architecture geek like me. I am somewhat worried about the software side given the number of OS changes that would have to be made to support this. But then again, there are lots of places in the world where people are running simple RTOSes on high end chips and the Mill probably has a good chance there. The initial plan to use an older process and automated design means that the Mill can probably be profitable in relatively modest volumes.
There is something that I can't get to add up here. The phasing claims that there are only 3 pipeline stages compared to 5 in the textbook RISC architecture or 14-16 in a conventional Intel processor, but this can't possibly add up with the 4 cycle division or the 5 cycle mis-predict penalty.
The phase says when the op issues. It takes some number of cycles before it retires. So an divide issues in the "op phase" in the second cycle, and if on the particular Mill model it takes 4 cycles then it retires on the fifth.
If there is a mispredict, there is a stall while the correct instruction is fetched from the instruction L1 cache. If you are unlucky, it's not there and you need to wait longer.
OK, so the phases aren't an apples to apples comparison to the traditional pipeline stage, but more in line with the TI C6x fetch, decode, execute pipeline which for TI covers something like 4 fetch stages, 2 decode stages and between 1and 5 execute stages. Thank you for the clarification
From what I understand it would have very similar characteristics to current register renaming. You just get direct access to the whole register file rather than just a few ISA registers.
I think it would require some instruction scheduling to make optimal use of it, but that means the silicon doesn't need that logic so cores can be smaller and more efficient.
In my regards it seems that one of their sources of inspiration were Transmeta processors - VLIW core, software translator from some intermediate bytecode (x86 in case of transmeta).
I hope they will get it better this time.
Well, the plan is to distribute an intermediate representation and then specialize it to the particular mill pipeline the first time you load the binary. Probably a lot easier than translating something that wasn't designed for it.
I believe IBM mainframes have traditionally used something like that: binary code is shipped for a general mainframe architecture, and on first execution is specialized to the hardware / performance characteristics of the particular model within that architecture that you're running. Also allows for transparent upgrades, since if you migrate to a new model, the binary will re-specialize itself on the next execution, (ideally) taking advantage of whatever fancy new hardware you bought.
We are starting work on an LLVM back end now. The tool chain will be described in an upcoming talk, so subscribe to the mailing list if you want to be in the audience or watch any available live streams.
I am also going to make a doc or presentation called "A Sufficiently Smart Compiler" to explain how easily the Mill can vectorise your normal code and so on :)
Would be interesting to find out:
1. How high that degree of parallelism can be pushed, are we talking about tens or hundreds of pipelines?
2. What frequency this will operate at?
3. What is up with RAM? I saw nothing about memory, with lots of pipelines it is bound to be memory bound.