For the uninitiated, the bringup of a new processor is very complex - from compiler to libraries to kernel, from emulation to real hardware. The process described here is very similar to the one I was able to observe at SiCortex, though that chip was based on an existing one and so there was no FPGA stage. I'll bet they even use some of the same tools, such as the emulator. Kudos to them for not rushing ahead before their foundation is ready, and I look forward to seeing how this all works out.

Exactly. As a new entrant, it is impossible for us to immediately enter the mass markets dominated by the majors. Consequently we adopted a strategy of targeting an increasing set of niche markets that have been poorly server by the majors and their products, and then going after larger markets as we grow in resources.

This market strategy dictates the ability to produce many specialized products with small sales for each. That can't be done in the million-monkeys development approach used by the majors, which is why the majors neglect these markets. So we adopted a specification-based design strategy.

In turn, the design strategy dictates the development strategy: first the specification tools; then the assembler and simulator so we could try out various designs in small test cases and improve the architecture; then the tool chain so we could measure real quantities of code and confirm the ISA and macro-architecture. Then, and only then, write what manual RTL is left that the generators working from the specifications can't handle. The combined RTL must be verified, and it is much easier and cheaper to do that in an FPGA than with fab turns. As the message says, the FPGA is next.

Lastly, we will pick a particular market and specify the ideal product for it, run the spec through the process we have been so long building, and the fab gives us something in a flat pack.

Which won't work, of course. The first time, anyway :-)

Ivan, Thank you very much for this comment. I have been struggling to communicate similar problems related to scale in the biological sciences. The methodology needed to discover something new is far too "specialized" for the techniques used by the "majors" to be economically viable. I have been pushing for a specification based approach and your account here is the perfect articulation in support of it. I was lucky enough to spot the Mill talks when they first appeared and have been following along from the sidelines. I will keep cheering and look forward to more progress. Best of luck to you and the whole Mill team!

Bringup is when you throw everything together and that means that you have everything to throw together. It's like that Johnny Cash song, One Piece At A Time.

  Now, up to now my plan went all right
  'Til we tried to put it all together one night
  And that's when we noticed that something was definitely wrong.
That's bringup. Compiler development ain't part of bringup. They could/should have been doing an LLVM backend using a software simulator (and eventually an FPGA simulator) which is the traditional approach handed down from our knuckledragger forefathers and going back decades.

Still, I truly wish them well. It beats the crap out of 99% of the YC startups I see. The 37th recommendation engine, a webstore for sneakerheads, ...

They uh, are already doing an LLVM backend for a software simulator? The FPGA work is its own project.

The bringup process includes all the stuff leading up to that final moment, just like most other processes - e.g. release process - include the stuff leading up to a conclusion. The quibble really wasn't very constructive.


"After a design is created, taped-out and manufactured, actual hardware, 'first silicon', is received which is taken into the lab where it goes through bringup. Bringup is the process of powering, testing and characterizing the design in the lab."

Well, sorry, but I worked with a whole lot of people who were designing, verifying, and building that processor, and they used "bringup" for the entire process. I was part of that process myself. Actual experience counts for more than argumentum ad wikipedia. Also, even if the word was wrong, that matters far less than the context in which it was used. Dictionary flames are the last refuge of someone with nought worthwhile to say.

The "Software" section of the article says they did exactly the things you suggest they should do.

Is it reasonably to still not be running on an FPGA after like 12+ years of working on it?

It seems that they're working on it somewhat part-time, and have prioritised getting the patents first. Which makes some sense as it's their only way to avoid being instantly destroyed by Intel.

Evolutionary development you can start at once, because you are building on what you had before; think an x86 generation. Evolution works if you already dominate a market and only need to run a little faster than your competitor. Evolution can be scheduled; tick-tock.

A newcomer can't sell yet another me-too, even with evolutionary improvement. Instead the newcomer has to rethink and create from first principles if it is to have any chance in the market. Rethinks can't be scheduled; they take as long as they take. Ours has taken longer than I'd hoped, but adding resources like more people to the project would have just slowed us down, or forced us to market with a broken product.

The Mill rethink stage is over, and we now can have reasonable schedules and put in more resources; that's why we are going out for a significant funding round this year, our first over $10M.

That doesn't sound right even though I support your niche strategy. Many of the successful companies had me too priducts with incremental, ecosystem, or marketing improvements. Intel did with x86 starting as an incremental improvement. AMD turned into huge company doing it to x86. Transmeta and Centaur did well adding power efficiency among other things. Quite a few vendors implemented POWER variants with many acquired or still doing good business.

There's plenty of it on software side, too, with the proprietary DOS's and UNIX's. Foreign cloners stayed doing it with mainframes and embedded CPU's. So, incremental stuff (esp patented) can certainly grab market share and generate revenue. It's been going on for some time even in CPU market even with dominant players.

Yes - by those with established businesses. I can't think of a startup that succeeded with an initial incremental, at lease since Amdahl. It's also hard to do an increment: the Intel teams are not dumb, and many could and have built far better processors than Intel has - but not while keeping compatibility and Intel's ROI and marketing and pricing structure. TransMeta tried - RIP.

There's also personal strategy involved. If we had done a better X86 we would have needed huge dollops of money to crack the front door of the market - c.f. TransMeta - and lost ownership of the company. By going for the disruptive approach we still own all of it - and funding rounds now are at a valuation that will keep us making our own mistakes, not someone else's. That matters to me, enough to go without paycheck for a decade. YMMV.

Aren't almost all the businesses mentioned here extinct, or at least have tiny chunks of the market?

More disruptive startups are. These were either running profitably or absolutely huge at one point. Blame AMD's and VIA's management for the rest on their end. ;)

Starting out in x86, AMD had a license from Intel.

That justifies my use of it even more. Even with the license requirement, they still succeeded quite a bit. Even led on the 64-bit part since Intel screwed that up.

(Mill team)

Too right! 2017 is the year we get funding to go full-time and implement what we've been plotting at our Tuesday evening meetings for 10 years :)

Hmm, if you had a model with 128-bit words that skipped floating point, and left out the sneaky stuff that makes some people distrust Intel and AMD processors, you'd have an almost ideal chip for Ethereum nodes. (Make it 256-bit words and it'd really be ideal.)

Not that that's a huge market at this point, but apparently there's a lot of Fortune 500 interest so maybe that'll change by the time you're in production.

Something I don't see mentioned on your site is the rest of the hardware. Specifically is Mill intended to more of a co-processor with computing work handed over to it, or is it standalone in a system. If standalone what is the situation over bootup (eg UEFI or equivalent) and controllers like PCIe, USB, storage (NVME, SATA etc), NICs etc.

AFAIK, It's intended as a general purpose processor.

And when you show a reference design for a smart phone with provably strong security capabilities its going to get very real indeed. Kudos for continuing the push toward real parts.

But if you wait long enough your earlier ideas can no longer be patented due to prior art.

If they run it on an FPGA the game will be over. The first questions to ask are how it compares in terms of performance and gate count to other architectures running some benchmarks. In particular they need to compare it to RISC V which is deliberately not patent encumbered and seems to be killing it in performance per area and per watt.

I can't say for certain that they'll fail. Maybe they'd be twice as good, but anything short of that and I just don't see viability.

Perhaps you are thinking that the FPGA is a product? It's not; it's an RTL validator. Moving chip RTL from FPGA to product silicon is a well understood step that is almost routine in the industry. Time was you would do initial RTL development work in silicon, but modern FPGAs are big enough to hold a whole CPU core. Today you wouldn't develop directly on the silicon without an FPGA step, even if you own a fab; it's just too expensive to debug.

An FPGA will require an actual gate count, and it can give you a meaningful DMIPS/MHz number. Both of which are relevant metrics.

Their claim was ~10x improvement in performance / power compared to existing general-purpose CPUs.

> If they run it on an FPGA the game will be over.


That's great then. An FPGA implementation should be able to validate that. Even a fraction of that 10x improvement could change the world. So why does it take so long to benchmark an actual implementation? That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler. There really is no excuse for it to take so long. They need to stop talking about their project and actually show us something.

> That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler.

Huh? You're not serious.. The Mill is definitively not simpler than many other CPUs (in order CPUs for example), what it is supposed to have is better performance or more exactly better performance/power ratio.

I am not sure. I heard about the Mill some years ago and watched some presentations about the belt and my feeling was that it would be simpler to map functional languages onto it.

Maybe that was more wishful interpretation, but if I remember correctly the idea is to put more logic into the compiler (good old "sufficiently smart compiler") to do the optimization than do it at runtime in HW. IMHO this implies less logic in the HW which might make the implementation simpler (I am SW guy so my computer architecture knowledge might we skewed).

Further more I believe that we might be able to infer more and more bounds about runtime behavior that could possibly give rise to more aggressive optimizations. In particular the Mill provides more predictable performance than traditional architectures which is a feature in itself as it e.g. simplifies (again IMHO) real-time programming.

So that's why I believe the implementation hasn't to be at least harder than traditional architectures.

I am really looking forward to see some results and admire their effort.

The belt's designed to be easily compiled to (register coloring is NP hard) but it requires the CPU analyse data flow somewhat to optimize in a way where there isn't a bunch of data copying. Godard frequently explains that the belt is conceptual, that the hardware should be doing optimizations underneath

Whereas for a simple architecture registers are simple: they index into the register file. This only breaks down when we bring in out-of-order & register renaming in order to have more registers than the instruction set specifies

Optimal register coloring is NP hard, but no compiler does that; heuristics are no worse than quadratic and give near-optimal.

The Mill specializer part that schedules ops is linear, while the part that assigns belt numbers and inserts spills is NlogN in the number of ops in an EBB because it does some sorts.

> my feeling was that it would be simpler to map functional languages onto it

I found this [1] blog post interesting - that mainstream architectures have defaulted to low level C-machines and that radically new CPU designs might return to the halcyon days of lisp.


Thx, that was an interesting read.

However I think that Lisp is too powerful to be executed directly on the machine (as I understand Lisp machines provide HW capabilities to deal with cons and list atoms).

I often wonder if we should try to make a non Turing complete language, but that we make "pseudo-turing-complete", by using some kind of bounded-automaton with insane worst-case upper bounds (e.g. about complexity) that we can feed to the OS/machine which then can aggressively schedule as it knows upper timing bounds etc.

Btw. is their good literature about the implementation of intermediate languages as compilation targets for static functional languages? I think the book to go was SPJ's "The Implementation of Functional Programming Languages", but I am not sure how relevant it is today.

To "execute Lisp directly" means to interpret the AST. No hardware machine does that that I know of. It can almost certainly be done, but arguably shouldn't. Interpreting the raw syntax is a bootstrapping technique to which it would seem foolish to commit a full-blown hardware implementation.

The Mill is simpler that many CPU because it isn't OoO and it has a single address space but it also has some magic in it (the way it zeros memory, the stacks..) which aren't free to implement.

An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon. The two aren't comparable. This is part of the reason modern FPGAs have onboard ARM processors, as well as graphics and I/O peripherals.

>> An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon.

No of course not. But it will validate benchmarks per-clock cycle, as well as providing a gate count (or LUT count) required to achieve that benchmark result. If the architecture is anywhere near as awesome as claimed, there should be a strong indication of that in the numbers produced by the FPGA implementation.

