Hacker News new | past | comments | ask | show | jobs | submit login

Interesting, the architecture looks greatly simplified compared to even standard RISC (As opposed to lets say x86). Due to that simplification it will be power efficient while being inherently highly parallel.

Would be interesting to find out:

1. How high that degree of parallelism can be pushed, are we talking about tens or hundreds of pipelines?

2. What frequency this will operate at?

3. What is up with RAM? I saw nothing about memory, with lots of pipelines it is bound to be memory bound.




Hi, I'm the author of that intro. The talks which Ivan has been giving - there are links in that intro - go into everything in much more detail. But here's a quick overview of your specific questions:

1: we manage to issue 33 operations / sec. This is easily a world record :) The way we do this is covered in the Instruction Encoding talk. We could conceivably push it further, but its diminishing returns. We can have lots of cores too.

2: its process agnostic; the dial goes all the way up to 11

3: the on-chip cache is much quicker than conventional architectures as the TLB is not on the critical path and we typically have ~25% fewer reads on general purpose code due to backless memory and implicit zero. The main memory is conventional memory, though; if your algorithm is zig zagging unpredictably through main memory we can't magic that away


>>: the on-chip cache is much quicker than conventional architectures as the TLB is not on the critical path

I would really like to know your reasoning that the TLB is a major bottleneck in conventional CPUs. CPUs execute a TLB lookup in parallel with the cache, so there is usually no latency except on a TLB miss.

Basic research on in-memory databases suggest eliminating the TLB would improve performance only by about 10%, this certainly isn't a realistic use case and most of the benefits can be obtained simply by using larger pages. So I don't really know where your claims about 25% fewer reads is coming from in relation to simply getting rid of virtual memory.


Right, most modern caches use the virtual address to get the cache index and use the physical address for tag comparisons[1]. Since on x86 the bits needed for the tag are the same between the virtual and physical address the entire L1 lookup can be done in parallel, though for other architectures like ARM you need to finish the TLB step before the tag comparison[2].

But while I think the Mill people are overselling the direct performance benefits here, the single address space lets them do a lot of other things such as backing up all sorts of things to the stack automatically on a function call and handling any page fault that results in the same way that it would be handled if it was the result of a store instruction. And I think they're backless storage concept requires it too.

[1]http://en.wikipedia.org/wiki/CPU_cache#Address_translation [2] Unless you were to force the use of large page sizes, as some people suggest Apple might have done with their newest iPhone.


The reason the TLB is so fast is also that it is fairly small and thus misses fairly often. Moving the TLB so it sits before the DRAM means that you can have a 3-4 cycle TLB with thousands of entries.


Have you watched Godard’s talks? He goes into considerable detail about this.


After the watching first Mill video I remember reading someone who said that this seemed like the perfect architecture for a lisp.

It was something about scopes mapping very well onto the Mill's "memory model."

I'm not quite up to that sort of analysis though, but I'm wondering if you see that too? If so, I would love to read more about that.


33 operations/cycle require memory with (at least) 66 ports: 33 for reads and 33 for writes. Otherwise it is NOOP. For two-operand instructions the count goes to 99=33*3 and for three operand instructions (ternary operator) it goes to 132 ports.

As far as I know, Elbrus 3M managed to achieve about 18 instructions per clock cycle, with VLIW and highly complex register file, whose design slowed overall clock frequency to about 300MHz on 0.9um process. To get everything in comparison, plain Leon2 managed to get about 450MHz in the same process, without any tweaks and hand work and Leon2 is not a speed champion.

So the questions is: do you have your world record in simulation or in real hardware like FPGA?


33 ops/sec? :)


Oooh, too late for me to correct that particular typo :)

33 ops / cycle, sustained. Last night we also published an example list of the FU mix on those pipelines here: http://ootbcomp.com/topic/introduction-to-the-mill-cpu-progr...


I am pretty sure this means 33 ops/cycle, as the question asked how far multi-pipeline model can be pushed.


Well, that probably is actually a new world record.


Probably 33 operations in parallel since the original question was talking about parallelism.


> "Interesting, the architecture looks greatly simplified compared to even standard RISC"

Depends on how you define simplicity, really. Writing a good back-end for this architecture is likely to be very challenging.


Not really. The temporal addressing is quite simple for a compiler to generate. Plus, Mill was designed by a compiler writer in the first place.

What challenges would you think a back-end developer would face?


3. What is up with RAM? I saw nothing about memory, with lots of pipelines it is bound to be memory bound.

There's more information about memory in the talk at http://ootbcomp.com/topic/memory/




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: