The Mill faces the same compiler problems that Itanium and other VLIWs have faced. There's only so much available ILP to be extracted and then even that's hard to do. It turns out that the techniques that Fisher developed, trace scheduling [1] and its successors, work equally well for superscalars thus bringing no net advantage to VLIWs.
The great wide hope lives on, somewhere in the future.
IDK, their register model "the belt" has a temporal addressing scheme that seems to lend itself well to software pipelining in a way that's a pain to extract with a standard register set.
Itanium did as well but then to no avail. It had direct support for modulo loop scheduling. Also, register renaming (which is temporal) is useful for software pipelining.
I think the Mill people should concentrate on what VLIW has had some success with in the past: embedded. There will be tears if they go after general purpose.
Compiler problems are not the sole reason Itanium failed, perhaps not even the primary one. They were not good initially, that's true. But Itanium was more killed by a combination of factors, and especially by AMD64 existing.
The first Itanium sucked, and had a very bad memory subsystem. Itanium 2 was pretty competitive in performance, with one major exception: the x86 emulation. Why buy Itanium—even if it's very fast—if you could buy cheaper AMD64 which supports your existing software? Add in that Itanium compilers were few in number, all proprietary (and expensive), and didn't generate very good code for the first few years. Also, Itanium violated some programmer assumptions (very weak memory model and accessing an undefined value could cause an exception).
Now, Mill has a better way of extracting ILP than Itanium does, compiler technology is much better, and JIT compilers are very common. VLIW processors can be very good JIT targets. Mill, if it ever materializes, has enormous potential.
Obviously I agree about Itanium vs AMD64. That's a one punch knockout. It was freaking brilliant of AMD.
However the Mill doesn't extract ILP. Compilers do that. And given a code base there's only so much available. Yes, compiler technology is much better now and still there's only so much ILP available.
Lastly, VLIW has been tried with JITs at least twice: Transmeta Crusoe and IBM Latte. VLIW code generation is hard and it's harder if you have very little time which is the nature of JIT.
Denver is not a VLIW; it's 7-way superscalar [1]. Haswell is 8-way. Wide superscalar is really common and has a lot of the advantages of VLIW without the impossible compiler headaches.
Haswell is generally considered 4-wide [1]. As far as I heard, Denver really is VLIW. I don't think there are many in-order [2] CPUs that are that wide. I think the article is using superscalar loosely (as in 'can execute more than operation per cycle' which is true for VLIW, although they are techinically all part of the same instruction).
[1] apparently it can sustain 5 instructions per clock given some very specific instruction mix.
[2] or out of order, really, Power8 being the exception.
It isn't an OoO, at least, not a 224 window OoO. 7-wide can simply mean that it has the decode and execution resources of 7 per cycle. It's a tablet core, they don't have a 7-wide OoO core in there.
I was at the EE380 talk :). It's amazing how few people show up to EE380 these days.
Yeah, Denver is in-order superscalar and it's a JIT but it isn't VLIW. Sad to say, they've tried JITing to in-order superscalar as well. They had a design win with the Nexus but even now NVidia is switching over to RISC-V for Falcon.
It is not from lack of effort that the JIT approach hasn't really worked. It's competitive but not outstanding. Denver, from the EE380 talk, thrives on bad bloated code. It's not so good on good code. This is not a winning combination.
Well, Falcon is intended to be a controller, not a (relatively) high performance core, so that's not an apples to apples comparison. If it's not VLIW, then what is it? In-order superscalar? You mean superpipelined or scoreboarded (like an ancient Cray?)?
Bad bloated code is 90%+ of the code in the world ;)
Also, shoot me an email at sixsamuraisoldier@gmail.com you seem to have some inside info on Denver, I'd love to chat (don't worry, I won't steal any secrets ;) )
This. People fail to understand the Itanium wasn't necessarily the only representative for VLIW.
Better compiler tech in the past few years (you mentioned JIT for example, which Denver has adapted to do quite well) has made VLIW a strong technical contender in several markets. Alas, the overhead cost of OoO is no longer the issue in modern computer architecture.
Denver is a JIT but the microarchitecture is 7-way superscalar [1]. A lot of the Transmeta people ended up on Denver and I'll guess they didn't want to repeat VLIW.
The Denver 2.0 CPU is a seven-way superscalar processor supporting the ARM v8 instruction set and implements an improved dynamic code optimization algorithm and additional low-power retention states for better energy efficiency.
Note that I actually think Itanium is somewhat architecturally interesting and that VLIW is both simultaneously over and under estimated/appreciated. I wanted to just highlight the example of a chip that definitely shook the world, and it didn't even have to succeed to do it!
Regarding the Mill: I still haven't heard a (good) answer for the perennial question: Where's the ILP?
Every time I ask this question, I'm pointed to the "phasing" talk. The issue is:
1) It's quite similar to a skewed pipeline (which Denver already has) in which you "statically schedule across branches" and delay the start until it's inputs are ready. Now, Denver did quite well indeed, but it's hardly a replacement for OoO.
2) Even from the examples that they show, it's clear that this is no where near 3x the ILP, at best, even from their own example, it provides 2x.
I hope the Mill can do well in embedded because they have quite a few good ideas (albeit many of which have already been done which they claim is their own, such as XOR-ing addresses, done by Samhita), and because the computer architecture industry is currently starved for innovation.
If so, it's on pseudo-vectorial instructions made by concatenating many opcodes on the same instruction word.
I don't think they claim it performs better than OoO, just that it uses less power.
Nah, it's late here, and I'm already saying stupid things. I was talking about the mechanism that gives parallelism to the CPU, while you are talking about intrinsic level of parallelism on the programs.
Just dismiss it. I was even going to delete it, but I took too long.
Itanium was just bad execution. They made the classic "rewrite everything and promise parity from day 1" mistake. When you spin off a new line you need time for it to hit maturity, and that usually takes several years. Look at Microsoft Windows, they ran the hacked up Windows 95 version of the OS against the more pure version in the form of NT for about 5 years before they were able to fold the features of the mass market client OS into it and end up with only one major kernel line for development. Similarly, Itanium did end up getting significantly better but it took years, and in the meantime every other architecture also got better so it wasn't exactly hugely superior anymore.
When a new technology or system is oversold and overhyped then underperforms (which is almost inevitable because of product maturity issues) people tend to respond sharply negatively.
The huge disadvantages of Itanium's style of compile-time VLIW, as they were explained to me, were:
1. There is absolutely no flexibility in scheduling. Any EU stalls due to memory access delays, non-constant-time instructions, or the like delays completion of the entire instruction word. This makes cache misses disastrous.
2. If a newer CPU makes improvements to the architecture (say, adding new EUs), programs cannot take advantage of those improvements until compilers are updated to support those improvements, and the programs are rebuilt with the new compiler. This is unlike typical superscalar architectures, where a new EU will be used automatically with no changes needed to the program.
As an overarching critique of VLIW, this isn't that bad, but it does ignore a few things:
1. It's not strictly true that there is "absolutely no flexibility in scheduling", there are a few techniques that can at least mitigate this issue, such as Itanium's advanced loads and a few others. On the aggregate though, the lack of MLP is absolutely what killed VLIW, especially nowadays.
2. This is true, but IBM's style (which the Mill has adopted) of "specialization" can effectively negate this issue.
There's some truth to the fact that Itanium had terrible execution, but a number of attempts afterwards have been made as well, high end general purpose VLIW is just not the way to go.
Note the high end and general purpose, if you're talking about smartphone cores or supercomputing, then VLIW can actually perform quite well, and indeed Itanium enjoyed decent success in supercomputing and the Qualcomm Hexagon, albeit not really general purpose, finds itself in many processors doing solid work.
Yup, that's the other classic mistake, committing to a design based on theoretical rather than practical merits.
It's a tempting way to do things, it's very difficult and expensive to spin up and maintain dual development and production pipelines in parallel. But more often than not that's often the best way to go about things. Sometimes you make missteps that seemed sensible at the time. Sometimes you torpedo your whole business because you take too far a step back and you can't even survive long enough to let the new thing mature. Look at Intel, they've had multiple missteps that did tremendous damage to their business. They jumped on the netburst architecture bandwagon and that turned out to be a dead end. The Core architecture basically grew out of their Pentium Mobile work.
I'm hoping The Mill folks can bring back attention to VLIW.