Hacker News new | past | comments | ask | show | jobs | submit login
Modern Microprocessors – A 90-Minute Guide (2001-2016) (lighterra.com)
178 points by codesuki on May 2, 2021 | hide | past | favorite | 41 comments



I gave a presentation last year on this for QConLondon (before the lock down) and afterwards for the LJC virtually, if people prefer to listen/watch videos.

https://speakerdeck.com/alblue/understanding-cpu-microarchit...


Couldn’t find the video on the link above but I guess the previous one is this:

https://youtu.be/rglmJ6Xyj1c


It’s at the bottom of the description:

The presentation was recorded at the London Java Community meeting in April 2020, and a recording is available here: https://youtu.be/C4HEoBYL0yk


Thank you for this - really engaging presentation.


Previous discussions:

2018, 87 comments: https://news.ycombinator.com/item?id=18230383

2016, 12 comments: https://news.ycombinator.com/item?id=11116211

2014, 37 comments: https://news.ycombinator.com/item?id=7174513

2011, 30 comments: https://news.ycombinator.com/item?id=2428403

This link has also appeared in 9 comments on HN, featuring threads on "Computer Architecture for Network Engineers", "X86 versus other architectures" by Linus Torvalds, and "I don't know how CPUs work so I simulated one in code", also recommending a udacity course on how modern processors work (https://www.udacity.com/course/high-performance-computer-arc...): https://ampie.app/url-context?url=lighterra.com/papers/moder...

Jason also has a couple of other interesting articles on his website, like intro to instruction scheduling and software pipelining (http://www.lighterra.com/papers/basicinstructionscheduling/) and the one I liked a lot and agree with called "exception handling considered harmful" (http://www.lighterra.com/papers/exceptionsharmful/).


I would love a Bartosz Ciechanowski interactive article on microprocessors. It may be outside his domain though, since the visualisations and demo's would be less 3D model design, and more, perhaps, mini simulations of data channels or state machines that you can play through. Registers that can have initial values set, and then you can step through each clock cycle. Add a new component each few paragraphs and see how it all builds up. I did all this at university, but would love a refreshers that is as well made as his other blog posts.


I have been toying with making a game which is factorio-ish but literally for processors so potentially watch out for that (got a job now, though...)


>"One of the most interesting members of the RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register-renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OOO logic either."

PDS: Why do I bet that the Transmeta Crusoe didn't suffer from Spectre -- or any other other x86 cache-based or microcode-based security vulnerabilities that are so prevalent today?

Observation: Intentional hardware backdoors -- would have been difficult to place in Transmeta VLIW processors -- at least in the software-based x86 translation portions of it... Now, are there intentional hardware backdoors in its lower-level VLIW instructions?

I don't know and can't speculate on that...

Nor do I know if the Transmeta Crusoes contained secret deeply embedded "security" cores/processors -- or not...

But secret deeply embedded "security" cores/processors and backdoored VLIW instructions aside -- it would sure be hard as heck for the usual "powers-that-be" -- to be able to create secret/undocumented x86 instructions with side effects/covert communication to lower/secret levels -- and run that code from the Transmeta Crusoe's x86 software interpreter/translator -- especially if the code for the x86 software interpreter/translator -- is open source and throughly reviewed...

In other words, from a pro-security perspective -- there's a lot to be said about architecturally simpler CPU's -- regardless of how slow they might be compared to some of today's super-complex (and, ahem, less secure...) CPU's...


Any speculation with any shared hardware is a no-no from a security standpoint.

You can either separate your hardware (physically or virtually) or do no speculation.


AFAIK Transmeta's core was doing a lot of speculative execution stuff, so if it evolved until today I wouldn't bet on it also gaining issues like Spectre.


It would be an interesting test, wouldn't it?

See if Transmeta Crusoe is vulnerable to Spectre?

But, even if it is... keep in mind that when running x86 instructions, you still have the x86 translation software proxy layer... that means that could could grab any given offending / problem-causing x86 instruction -- when you encountered it, and recode the VLIW output from it to output a different set of native VLIW instructions -- that you knew were safe...

In other words, with a Transmeta Crusoe -- if the x86 translation layer is open source and you possess it (and can code / understand things) -- then you'll have some options there.

Which is unlike a regular x86 CPU -- where the way it decodes and executes instructions -- cannot be changed in any way by the user...


Transmeta Crusoe is unlikely to be vulnerable, but it's Intel contemporaries aren't either. So you'd need to be looking at some hypothetical "today" version.

An open system with such a design would indeed be fascinating (the original wasn't open, and Transmeta was big on their patents on this stuff). More flexibility than microcode patches too.


Elbrus 2000 is a mass-produced VLIW microprocessor with binary translation of x86. There is a set of different models having different number of cores and DSP blocks.

https://en.wikipedia.org/wiki/Elbrus_2000


>An open system with such a design would indeed be fascinating

Agreed completely!

>More flexibility than microcode patches too

Agreed completely!


The code morphing software itself is almost certainly a new source of new side channel spectre like attacks. Like being able to tell if something is already in the translation cache via timing.


It would be much easier for a vendor to turn off speculative compilation in that case though, i.e. for HPC you want all the performance, but a cloud vendor could still protect vulnerable interfaces.


The compilation doesn't need to be speculative. It'd be a lot like the micro op cache attack that was on the front page not too long ago.

The point is to leak privileged code flow.


Well, again, if the x86 code morphing software is available and open source, and if someone understands this software and can modify it -- then that's infinitely infinitely better (from a security perspective) -- than having to run x86 code directly on a regular AMD/Intel x86 processor...

In the latter case -- you have absolutely no control whatsoever over how the processor interprets and dispatches its x86 instructions...


Oh totally. I'm more sure than not that something like an open source code morphing software could be made more secure against side channel attacks with greater flexibility than is afforded micro code updates.

I'm mainly saying that it's a problem space that both has actively shipping implemetnations (Nvidia Denver), has new levels of cache which affect performance based on previous codeflows, and hasn't been fully explored publicly AFAIK. There's probs some dragons in there in at least the pre spectre versions of that software.


You can just try attacking Apple's M1 since the technologies are quite similar.


If you're talking about Rosetta, it's not as clear that you'd see any successful attacks. It only runs at one CPU privilege level. And even in the browser sandbox escape versions Rosetta heavily uses AOT when it can, so your JS is probably not sharing a translation cache with much if any of the code you'd be attacking.

This is in contrast to Transmeta where the whole system more or less ran out of the one translation cache.

Now Nvidia Denver on the other hand...


Some consider E2K (Elbrus 2000) as successor to Transmeta Crusoe and it is not effected to Spectre issue. It does binary translation of x86 code as well and quite fast (20% preformance loss).


Fascinating! Did not know that Elbrus continued on the same technology path as Transmeta!

Here, let's put a link for posterity:

https://en.wikipedia.org/wiki/Elbrus_2000

PDS: Opinion: Transmeta/Elbrus/VLIW designs in general -- worthy of future study...


Isn't the material a little bit old? I remember reading about all this stuff in the University circa 1996.

Edit: originally said "outdated".


I wouldn't say outdated, but these ideas have a long history. Superscalar processors go back to Cray's CDC 6600 in 1966. Cache memory was used in the IBM System/360 Model 85 in 1969. Pipelining was used in the ILLIAC II (1962) and probably earlier. Branch prediction was used in the IBM Stretch (1961). Out-of-order execution and register renaming was implemented in the IBM System/360 Model 91 (1966).

It's interesting to see how many CPU architecture ideas that we consider modern were first developed in the 1960s, and how they took a long time to move into microprocessor.


That's of course true, but it's might be misleading. OoO didn't take off until HPS came up with the reorder buffer (and enabled precise exceptions), with Pentium Pro being the first (and highly successful) implementation. Also, branch prediction has dramatically improved since stretch, McFarly's two-level was a breakthrough and the current state of the art is Seznec's TAGE-SC.

My point is that there's still a lot of advancement made.


You are right, outdated would imply not useful which is obviously not true. But all of this stuff was in my copy of Hennessy & Patterson from those days so it is not exactly new (because I am that old!).


ahem, what HAS changed since then? besides new models & more updated "MHz" values and some tables with performances, nothing that is of interest to a compressed introduction to the topic. So, personally, what would you have added to the article?


Well, I wouldn't know, that's why I clicked on the article :P

I said "outdated" in a wrong way, because this means is not valid today, which is obviously not the case.


old, but still relevant for what it is discussing.

The wiki page on x86 has a pretty good summary of the addition of things https://en.wikipedia.org/wiki/X86 .


I was curious about the following comment on SMT in the post:

>"From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread – things like the program counter, the architecturally-visible registers (but not the rename registers), the memory mappings held in the TLB, and so on. Luckily, these parts only constitute a tiny fraction of the overall processor's hardware."

Is each "SMT core" then just one additional PC and TLB then? I'm not sure if "SMT core" is the correct term or just "SMT" is but it seems like generally with Hyper Threading there is generally 1 hyper thread available for each core effectively doubling the total core count. It seems like it's been that way for a very long time. Is there not much benefit beyond offering single hyper thread/SMT for each core? Or is just prohibitively expensive?


IBM POWER8 has 8-way SMT, and POWER9 has 4-way and 8-way models.


I had a question about the following passage:

>"The key question is how the processor should make the guess. Two alternatives spring to mind. First, the compiler might be able to mark the branch to tell the processor which way to go. This is called static branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted not-taken.

Could someone say what the definition of "backward" vs a "forward" is? Is backward the loop continues and forward a jump or return from a loop?

Also are there any examples of "static branch prediction" CPU architectures?


I would love for an update that covers recent developments in SOC integrations, for example the onboarding of RAM and neural processing in the M1 chip.


These kinds of optimizations always make me wonder whether they are worth it. Might it be more efficient to use these transistors for more, simple cores instead? Perhaps the property that most problems are so sequential makes timing/clock rate optimizations inevitable.


I would like to see a superscalar OoO CPU with a RISC-V ISA. Since RISC-V cores tend to be very small, I would expect to see a CPU with hundreds of cores.


You're probably describing a GPU


How old does something have to be before it can no longer be called modern? I think 5 years is a stretch.


Great post, last updated 2016. Might be helpful context on discussed architectures and processors.


most info is from 2001. about only 20% (some graphs and tables / descriptions) are updated in 2016.


Good read.

It must be noted this article discusses von Neumann architecture and alike (Harvard).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: