It is fascinating that semantic confusion over RISC vs CISC persists since I was...

CalChris · on June 18, 2023

To be fair, RISC-V has a small base, RV64I in the 64-bit case. These bases are small, reduced and frozen. But after that, yes, the extensions get whacky. L is Decimal Floating Point, still marked Open. I'm not sure what's reduced about that. But extensions are optional.

About the history of RISC, the basic idea dates to Seymour Cray's 1964 CDC 6600. I don't think Berkeley gives Cray enough credit.

dvwobuq · on June 18, 2023

Patterson and Waterman detail exactly what they we’re thinking during the design of RISCV in the RISCV Reader and Cray is mentioned in multiple places.

https://www.goodreads.com/en/book/show/36604301

CalChris · on June 19, 2023

Cray gets mentioned in the Reader 3 times, the simplicity quote, the pioneer quote and the timeline comparison with the Iliac-4 (p. 80). Waterman's thesis does actually give some credit:

  The CDC 6600 [95] and Cray-1 [82] ISAs, in many respects the precursors to RISC, each had two lengths of instruction, albeit without the redundancy property of the Stretch.

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...

hajile · on June 19, 2023

IEEE 754 specifies not only floating point, but decimal floating point too. You actually find hardware implementations on some systems (notably IBM POWER).

Better to reserve and not finish that to be unprepared.

In any case, decimal floating point is better for MOST programs/programmers and is only inferior in being harder to implement (imo).

snvzz · on June 18, 2023

The usual RISC-V FUD points. It gets boring.

>It has a soup of wacky opcodes to optimize corner cases

OK, go ahead and name one (1) such opcode. I'll wait.

>obscure vendor specific extensions that are absolutely CISC-y (examine T-Head's additions if you don't believe me!).

Yes, these extensions are harmful, and that's why they're obscure and vendor-specific.

RISC-V considers pros and cons, evaluates across use cases, and weights everything when considering whether to accept something into the standard specs.

Simplicity itself is valuable; that is at the core of RISC. So the default is to reject. A strong argument needs to be made to justify adding anything.

codedokode · on June 18, 2023

RISC-V ISA is very inconsistent. For example, for addition with checked overflow the spec says that there is no need for such instruction as it can be implemented "cheaply" in four instructions. But at the same time they have fused multiply-add which is only needed for matrix multiplication (i.e. only for scientific software), which is difficult to implement (it needs to read 3 registers at once), and which can be easily replaced with two separate instructions.

brucehoult · on June 19, 2023

Fused floating point multiply-add with a single rounding from the infinite-precision answer is required by the IEEE 754-2008 floating point standard.

You don't get a choice in the matter.

> can be easily replaced with two separate instructions

It can't. You will get different answers.

RISC-V allows you to choose a CPU without floating point instructions. But if you choose to have an FPU then you get multipy-add. Yes, it needs to read three registers, which is expensive. It is also the most common instruction in any floating point calculation, so that expensive three port register file gets used constantly.

Checking overflow for addition on the other hand is something that is very seldom used (on any CPU). On RISC-V you need four instructions only if the operands are full register size and you don't know anything about either operand. If you know the sign of one operand then the cost reduces to one extra instruction.

skissane · on June 19, 2023

> Checking overflow for addition on the other hand is something that is very seldom used (on any CPU)

I think a lot of that is due to the popularity of C, and the fact that C has no built-in support for overflow checking. In some alternate timeline in which C had that feature (or a different language which had that feature took C's place), I suspect it would have been used a fair bit more often.

Well C23 finally adds checked arithmetic, in <stdckdint.h>. But, it took until 2023 to do it, what if it had been there 20, 30, 40 years ago? Very little software supports it yet anyway.

And it isn't using the same syntax as standard arithmetic. Instead of `c = a + b`, you have to do `ckd_add(&c, a, b)`. That isn't going to encourage people to use it.

codedokode · on June 19, 2023

Yes the ugliness of the syntax for checked addition overweighs benefits like better accuracy and security from using it.

codedokode · on June 19, 2023

> Checking overflow for addition on the other hand is something that is very seldom used (on any CPU).

I believe you are wrong. Almost every addition used in software should be a proper addition, not addition module 2^32 or 2^64. For example, if you want to calculate total for customer's order, you want proper addition, not a remainder after division by 2^32. Or if you are calculating a number of visitors of a website, again, you need the correct number.

Addition modulo N is used only in niche cases like cryptography.

In my opinion, it is wrapping addition which is seldomly used. I can't remember last time when I needed it. So it is surprising that RISC-V makes rarely used operation take less instructions than more popular one.

You might argue, that some poorly designed ancient languages have '+' operator to perform wrapping addition, however that is only because they are poorly designed, not because users want such method of addition. For comparison, a properly designed language, Swift has non-wrapping addition.

defrost · on June 19, 2023

> Addition modulo N is used only in niche cases like cryptography.

... and niche cases like GIS location computations on a spherical planet and any form of engineering | physics mesh computation.

It's inescapable when you peer into the guts of any earth based modelling from satellte data .. something that happens on a daily terrabyte scale.

ghusbands · on June 19, 2023

Even if there are some more cases, the basic point is still true: The majority of additions don't warrant word-size wrapping and it has been a source of many, many bugs.

rcxdude · on June 19, 2023

Which is almost never wrapping at a power of two, unless you scale your coordinate system to optimise it.

defrost · on June 19, 2023

The quote I responded to was:

> Addition modulo N is used only in niche cases ...

and not modulo 2^N.

The point that I would make is that general purpose CPU's doing general everday tasks such as https connections, secure transfers, GIS mapping, etc are doing a great many more modulo operations than acknowledged above.

tsukikage · on June 19, 2023

> Checking overflow for addition on the other hand is something that is very seldom used

Arithmetic on numbers larger than your word size. Poster child: crypto. It's 2023 and crypto is not rare. This post cannot get from me to you without crypto in the datapath.

codedokode · on June 19, 2023

Also, here is a list of languages where '+' operator doesn't overflow: PHP, Python, Javascript and Swift. JS doesn't even have wrapping addition, and nobody seems to be unhappy about that.

tsukikage · on June 19, 2023

Python has abritrary integer size. However, it runs on machines with a fixed word size. This means internally it has to perform all the usual tasks involved in arithmetic on numbers larger than the machine word size, just like back in the 70s: overflow checking, explicit propagation of carries, taking great care with magnitudes of inputs etc all over the dang place. Like, seriously, everywhere. Take a look: https://github.com/python/cpython/blob/main/Objects/longobje...

I certainly hope for at least some of that code the compiler ends up making use of adc and family, otherwise it's gonna be utterly miserable. It's great that the language is hiding that complexity from the programmer, but it's a big mistake to imply that this means it does not happen at all.

Javascript stores its "integers" in the mantissa of a floating point NaN and makes no attempt whatsoever to hide the implications of that decision from the unsuspecting developer; and good grief that leads to an incredible amount of pain and suffering.

brucehoult · on June 19, 2023

JS doesn't even have integers. The only numeric type is double precision floating point. If you get to 2^53 then you start to lose LSBs -- that's a lower limit than integers wrapping on a 64 bit ISA (at 2^63 or 2^64).

saagarjha · on June 19, 2023

I’d be bored too if I showed up in RISC-V threads to read “FUD” into any comment that doesn’t immediately praise every aspect of the architecture.

0xr0kk3r · on June 18, 2023

[flagged]

ghusbands · on June 19, 2023

You're missing the point. You said "The ISA [and ratified extensions] has a soup of wacky opcodes". You're being asked to name a single wacky opcode from the ISA and ratified extensions. The response mentioned only vendor-specific extensions having bad opcodes.

0xr0kk3r · on June 19, 2023

The reply is an obvious troll. Whenever someone yawns at your post, dismisses, and then decides to be the arbiter of what constitutes valid (classic fallacy), walk away. Proof? Others did reply and Op just pisses on them verbally. If you don’t know how to read the ratified extensions which OP admitted we’re harmful, just walk away from the topic.

ghusbands · on June 19, 2023

As an interested observer, I would like to know more about what you're talking about - it's not trivial to find the T-Head RISC-V extensions or identify which ratified extensions are 'wacky opcodes' or related. I agree that the OP has a biased view in favour of RISC-V, but it's also natural to react strongly to attacks like "soup of wacky opcodes".

I can see that RISC-V has made some short-sighted decisions (like having no carry flag or equivalent), and has a good few ratified extensions. But how does it compare to other architectures? I think everyone would agree that x86/x64 is a soup of wacky opcodes; is modern ARM(+SVE) better than RISC-V in this regard?

brucehoult · on June 19, 2023

Why is having no carry flag "short-sighted"? Rather, it is dropping the unneeded historical baggage.

A carry flag was useful on an 8 bit CPU where you were constantly dealing with 16 bit and 32 bit numbers. On a 64 bit CPU it is almost never used, and updating it on every add is an absolute waste of resources.

ghusbands · on June 19, 2023

Having integer operations wrap by default is a significant source of security issues, so having no other kind of addition and making it even harder to spot when it happens is not what my choice would be (I'd have both flags and trapping as options, at least). I should choose more charitable language.

I'd suggest that the unneeded historical baggage is default-wrapping integer operations, rather than carry flags. Back then, CPUs needed the simplicity.

0xr0kk3r · on June 19, 2023

My original post was about the futility of debating RISC vs CISC, supporting OP. The idea that RISC has very few instructions that are near register-level single-cycle: load/store/add/subtract/branch, etc. But in reality there is a tendency for these instructions to be come more tied to the hardware, compiler, or application; hundreds of opcodes that perform multiple RISC operations per opcode is very un-RISC-y. Same with multi-cycle operations that are tied to specific hardware, thus not "pure". Same goes for applications. (The classic example is the Arm "RISC" instruction called "FJCVTZS", or "Floating Point JAVASCRIPT convert to signed rounding to Zero". There's an entire HN thread on this from years ago.)

The crux of my argument is the review and ratification of extended compiler switches that add lots of functionality that becomes less about core compute and more about APPLICATIONS and HARDWARE. And that's where things get CISC-y. Hence the futility of comparing RISC v. CISC: if the clean-slate RISC-V project runs into this, stop arguing the the legend/myth because it is a waste of energy.

My use of the term "wacky" was a poor choice. My problem with the first reply is because they insist that isn't the case, sneers at me, and says "tell me what you think is wrong and I tell you why you are wrong..." That's flamebait, because there are several other replies that the post brushes off with more flamebait.

HARDWARE:

RISC-V has extensibility in the ISA, and T-Head added instructions that require an ASIC. So now we have an ISA+ that very clearly is hardware dependent. Should RISC-V really have entertained Alibaba's hardware-specific SIMD instructions as part of the standard, even if they are enabled with compiler switches? That's a question that will have consequences. RISC-V's biggest market is by far China, so maybe it makes sense? But these opcodes are "wacky" (ugh) in the sense because they require THead hardware, and are very CISC-Y.

I'll stop picking on T-Head. Consider the extensions for non-temporal instructions based on memory hierarchies. This is incredibly hardware specific. But of course, there will always be memory hierarchies regardless of vonNeuman v. harvard designs. And they can be left out of course. But still they will only apply to specific implementations. Much like machines that don't have FPUs cannot make use of FP opcodes, machines without hierarchies cannot make use of non-temporal instructions. So do FPU instructions not belong in the ISA, of course not.

APPLICATIONS:

Does an ISA need vector crypto? Well, it is an extension, so it can be turned off, but AES could easily become post-quantum obsolete. So why bloat the ISA? Sqrt/Sin/Cos will never become obsolete, but AES might.

Even security and privilege levels. Hypervisor extensions, and execution + code isolation extensions force a particular way of doing things on an ISA.

MY DARNED POINT

To recap: I may have made the biggest strawman of all time, but it is based on what I keep hearing. It is easy to wave away all four of my examples if you think RISC-V-can-do-no-wrong. But that misses my point: ISAs are complicated, and when you have hundreds of instructions that do complex things in multiple cycles with lots of side effects that are required by certain hardware or certain applications ... you no longer have a reduced instruction set computer. Even when the best minds start from a clean slate to create RISC they end up with CISC.

Which is why I think the debate is bunk and wastes everyone's time. It was entirely the product of 1990's marketing against Intel and still plagues us today.

wahahah · on June 19, 2023

AES, particularly AES-256, is considered quantum resistant/safe. AES-128 could have its keysize effectively reduced to 64 bits with Grover's Algorithm (AES-256 to 128 bits), but scaling quantum computers to do that brute force search is not as straightforward as building a bunch of FPGAs or ASICs.

codedokode · on June 18, 2023

RISC-V is copying wrong decisions made tens years ago. Against any common sense it doesn't trap on overflow on arithmetic operations, and silently wraps number over, producing incorrect result. Furthermore, it does not provide an overflow flag (or any alternative) so it is difficult to make addition of 256-bit numbers for example.

wbl · on June 18, 2023

It doesn't trap because trapping means you need to track the possibility of a branch at every single arithmetic operation. It doesn't have a flag so flag renaming isn't needed: you can get the overflow from a CMP instruction and macroop fusion should just work.

codedokode · on June 18, 2023

> you need to track the possibility of a branch at every single arithmetic operation

Every memory access can cause a trap, but CPUs seem to have no problem about it. The branch is very unlikely and can always be predicted as "not taken".

twbarr · on June 18, 2023

Hell, with non-maskable interrupts, any instruction can cause a trap!

meithecatte · on June 19, 2023

Not even that - instruction fetch can cause a page fault. When an NMI happens,the CPU still has the choice of when to service it. If it needs to flush the pipeline, it might as well retire the instructions up to the first store.

hajile · on June 19, 2023

Managing memory coherency is probably the single hardest part to design in any given CPU. Why add even more hard things (especially if they can interact and add even more complexity on top)?

Get rid of what complexity you can then deal with the rest that you must have.

Dylan16807 · on June 19, 2023

Coherency is very hard but it's not what causes traps from accessing memory. That part is a relatively simple permission check.

snvzz · on June 18, 2023

I love how each and every criticism on RISC-V's decisions ignores the rationale behind them.

Yes, that idea was evaluated, weighted and discarded as harmful, and the details are referenced in the spec itself.

codedokode · on June 18, 2023

I tried searching the spec [1] for "overflow" and here is what it says at page 17:

> We did not include special instruction-set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches.

> For general signed addition, three additional instructions after the addition are required

Is this "cheap", replacing 1 instruction with four? According to some old mainframe era research (cannot find link now), addition is one of the most often used instructions and they suggest that we should replace each instruction with four?

Their "rationale" is not rational at all. It doesn't make sense.

Overflow check should be free (no additional instructions required), otherwise we will see the same story we have seen for last 50 years: compiler writers do not want to implement checks because they are expensive; language designers do not want to use proper arithmetic because it is expensive. And CPU designers do not want to implement traps because no language needs them. As a result, there will be errors and vulnerabilities. A vicious circle.

What also surprises me is that they added fused add-multiply instruction which can easily be replaced by 2 separate instructions, is not really needed in most applications (like a web browser), and is difficult to implement (if I am not mistaken, you need to read 3 registers instead of 2, which might require additional ports in register file only for this useless instruction).

[1] https://github.com/riscv/riscv-isa-manual/releases/download/...

brucehoult · on June 19, 2023

So you are criticising RISC-V not compared to its actual x86 and Arm competition -- where overflow checking is also not free and is seldom used -- but against some imaginary ideal CPU that doesn't exist or no one uses because it's so slow.

skissane · on June 19, 2023

> So you are criticising RISC-V not compared to its actual x86 and Arm competition -- where overflow checking is also not free and is seldom used

How do people do overflow checking on x86 and ARM in practice? For languages which implement it, such as Rust or Ada?

I know 32-bit x86 has the INTO instruction, which raises interrupt 4 if the overflow flag (OF) is set – but it was removed in x86-64, which gives me the impression that even languages which did do checked arithmetic weren't using it.

> but against some imaginary ideal CPU that doesn't exist

I'm not the person you are responding to, but to try to read their argument charitably (to "steelman" it) – if a person thinks checked arithmetic is an important feature, RISC-V's decision not to include it could be seen as a missed opportunity.

> or no one uses because it's so slow.

Is it inherently slow? Or is it just the chicken-egg problem of hardware designers feel no motivation to make it fast because software doesn't use it, meanwhile software doesn't use it because the hardware doesn't make it fast enough?

LegionMammal978 · on June 19, 2023

> How do people do overflow checking on x86 and ARM in practice? For languages which implement it, such as Rust or Ada?

> I know 32-bit x86 has the INTO instruction, which raises interrupt 4 if the overflow flag (OF) is set – but it was removed in x86-64, which gives me the impression that even languages which did do checked arithmetic weren't using it.

Languages still use the overflow flag, they just don't use interrupts. I'm most familiar with Rust, where if the program wants a boolean value representing overflow (e.g., with checked_* or overflowing_* operations), LLVM obtains that value using a SETO or SETNO instruction following the arithmetic operation. If the program just wants to branch on the result of overflow, LLVM performs it using a JO or JNO instruction. Overflow checks that crash the program (e.g., in debug builds) are implemented as an ordinary branch that calls into the panic handler.

codedokode · on June 19, 2023

> So you are criticising RISC-V not compared to its actual x86 and Arm competition -- where overflow checking is also not free

Do you suggest we should carry on bad design decisions made in the past? x86 is an exhibition of bad choices and I don't think we need to copy them.

> and is seldom used

I believe it is not like this. I think that in most cases you need non-wrapping addition, for example, if you are calculating totals for a customer's order, counting number of visits for a website, or calculating loads in a concrete beam.

Actually wrapping addition is the one that is seldomly used, in niche areas like cryptography. So it surprises me that the kind of addition that is used more often (non-wrapping) requires more instructions than exotic wrapping addition. What were CPU designers thinking I fail to understand.

brucehoult · on June 19, 2023

You can't solve all the world's problems in one step. RISC-V solves a number of important problems, while making it as easy as practical to run existing software quickly on it.

If you want to have checked arithmetic, RISC-V's openness allows you to make a custom extension, implement hardware for it (FPGA is cheap and easy), implement software support and demonstrate the claimed benefits, and encourage others to also implement your extension, or standardise it.

It is simply not possible to do this in the x86 or Arm worlds. And that is one of the problems RISC-V solves -- a meta problem, not some one individual pet problem, but potentially all of them.

ghusbands · on June 19, 2023

I agree that wrapping is a bad default, but I can provide some rationale.

If you do wrapped addition without flags, you have one self-contained instruction that even covers signed and unsigned integers. If you want other behaviour, you then have to specialize for signed or unsigned, specialize for the choice of wrap/trap/flag, and make those traps and flags work nicely with whatever other traps or flags you might have.

So, yeah, if you want the simplest possible thing, driven by some decision other than the best outcomes for software in general, then you would choose wrapping addition without flags or traps.

saagarjha · on June 19, 2023

This seems like an oversimplification of how these things work. Every architecture is going to provide a way to do wrapping arithmetic. You seem to also want that there be dedicated instructions to check for overflow. Some architectures have this! But what happens in practice is that people are smarter than this and recognize that the number of instructions emitted is irrelevant if some of them are inherently slower than others. Compilers emit lea on x86-64 these days to save ports and you think they’ll use your faulting add that takes an extra cycle? Definitely not.

Anyways, this game is going to really end up won by people higher in the stack paying the price for bounds checks and including them no matter what, because not having them is not tenable for their usecase. This drives processor manufacturers to make these checks more efficient which they have been doing for many years.

codedokode · on June 19, 2023

> Compilers emit lea on x86-64 these days to save ports and you think they’ll use your faulting add that takes an extra cycle? Definitely not.

"Faulting" addition should be as fast as wrapping addition and take a single instruction. Yes, I want hardware-accelerated overflow checking because it leads to more accurate results and prevents security vulnerabilities.

By the way, I want FPU operations to cause traps too (when getting infinity or NaN).

saagarjha · on June 19, 2023

But there’s inherently more work. You need to keep track of some extra state and when the overflow actually occurs you need to unwind processor state and deliver the exception. You can make this cheap but it definitely cannot be free. From the words you’re using I feel like you have a model in your head that if you can just encode something into an instruction it’s now fast and that instructions are the way we measure how “fast” something is, but that’s not true. Modern processors can retire multiple additions per cycle. What this will probably look like is both of them are single instructions and one of them has a throughput of 4/cycle and the other one will be 3/cycle and compiler authors will pick the former every time.

codedokode · on June 19, 2023

> Modern processors can retire multiple additions per cycle.

Then add multiple overflow checking units.

> one of them has a throughput of 4/cycle and the other one will be 3/cycle and compiler authors will pick the former every time.

Currently on RISC-V checked addition requires 4 dependent instructions, so its throughput is about 1 addition/cycle.

brucehoult · on June 19, 2023

> Then add multiple overflow checking units.

You can't.

With your favoured ISA style you can't just put 4 or 8 checked overflow add instructions in a row and run them all in parallel because they all write to the same condition code flag. You have to put conditional branches between them.

Or, if you want an overflowing add to trap then you can't do anything critical in the following instructions until you know whether the first one traps or not e.g. if the instructions are like "add r1,(r0)+; add r2,(r0)+; add r3,(r0)+; add r4,(r0)+". In this example you can't write back the updated r0 value until you know whether the instruction traps of not. Even worse if you reverse the operands and have a RMW instruction.

codedokode · on June 19, 2023

> With your favoured ISA style you can't just put 4 or 8 checked overflow add instructions in a row and run them all in parallel because they all write to the same condition code flag.

This can be implemented using traps, without flags. And RISC-V supports delayed exceptions, which makes the implementation easier.

> In this example you can't write back the updated r0 value until you know whether the instruction traps of not.

RISC-V supports delayed exceptions, so you actually can.

hajile · on June 19, 2023

Comparisons of code compiled for x86 or RISC-V show that (on average), the RISC-V code is significantly smaller.

Any code size increases are made up for elsewhere and they STILL get smaller code too.

snvzz · on June 19, 2023

And, amusingly, the instruction count is also very competitive, especially inside loops.

Furthermore, it achieves all of that with a much simpler ISA that matches x86 and arm in features, while having an order of magnitude less instructions to implement and verify.

userbinator · on June 19, 2023

Compiler output is not a good way to show off the best of an ISA (which is more an indictment of how bad compilers actually are at optimising for code density). Look at the demoscene. x86 can be an order of magnitude denser than lame compiler output.

RISC-V wasn't around when this paper was written, but it's close enough to MIPS to disprove the claim that "RISC-V code is significantly smaller": https://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_den...

snvzz · on June 19, 2023

>Averages over large bodies of code do not matter

>Compiler output does not matter

>1987 paper

>RISC-V encoding "close enough to MIPS"

>disprove the claim that "RISC-V code is significantly smaller"

F for effort.

userbinator · on June 19, 2023

1987 paper

Did you even look at the link?

Neither shilling nor trolling is welcome here. Is there a relationship you haven't disclosed with RISC-V?

snvzz · on June 19, 2023

>Did you even look at the link?

Yes, I did.

>RISC-V encoding "close enough to MIPS"

... while pointing at MIPS-X, 1987. Deranged.

>Is there a relationship you haven't disclosed with RISC-V?

Are you projecting? I have noticed a pattern in your appearances whenever a discussion about RISC-V pops up.

kjs3 · on June 19, 2023

Bad decisions are evaluated & weighed, and often documented. I love the assumption that the RISC-V team is both infallible and immune to bias.

snvzz · on June 19, 2023

Great. Then confront the rationale, instead of dismissing it or pretending it is not there.

ip26 · on June 18, 2023

there is an iron triangle operating on ISA design. I would propose the vertexes are complexity, performance, and memory model. The ideal ISA has high performance, a strong memory model, and a simple instruction set, but it cannot exist.

thesz · on June 18, 2023

Define high performance.

Also, define strong memory model. This is the first time I hear that memory model can be strong.

And, finally, define what is "simple" in instruction set.

sweetjuly · on June 18, 2023

Strong and weak as terms to describe memory models is very common, the standard RISC-V memory model is called "weak memory ordering" after all :)

thesz · on June 19, 2023

Strong and weak are not properties of memory model, but memory access ordering in memory model.

They are common in description of memory access orderings, but not memory model itself.

snvzz · on June 18, 2023

Strong memory ordering is convenient for the programmer.

But it is a no-go for SMP scalability.

That's why most architectures today use weak ordering.

x86 is alone and a dinosaur.

ip26 · on June 18, 2023

I’m curious to hear what problems are you thinking of in particular that make it no-go? Strong model has challenges, but I am not aware of any total showstoppers.

x86 has also illustrated the triangle, garnering some weakly ordered benefits with examples like avx512 and enhanced rep movsb.

The interesting thing is both solutions (weak ordering, special instructions) have been largely left to the compiler to manage, so it could become a question of which the compiler is better able to leverage. For example, if people are comfortable programming MP code in C on a strong memory model but reach for python on a weak memory model, things could shake out differently than expected.

inkyoto · on June 19, 2023

> Strong memory ordering […]

> But it is a no-go for SMP scalability.

The SPARC architecture introduced the TSO and defaults to the total store order, and SPARC (and later UltraSPARC) systems were one the first successful highly scaleable SMP implementations.

Sun Enterprise 10000 (E10k) servers could be configured with up to 64x CPU's (initially released in 1997), Sun Fire 15K (available in 2002) could support up to 106x CPU's, Sun Fire E25K released in 2004 could support up to 72x dual-core CPU's (144x CPU cores in total).

SPARC survives (albeit not frequently heard about today) as Oracle SPARC T8-4 and M8-8 (8x CPU's, 32x CPU cores each, 256 threads per core) and Fujitsu[0] SPARC M12-2S (32x CPU's, 384 cores on each CPU and 3072 CPU threads).

All of the above is SMP and very many CPU's, CPU cores and CPU threads.

A succeful, scaleable, SMP architecture has to get the cache coherence protocols right irrespective of whether the ISA implements the TSO, is weakly ordered or a hybrid approach.

To ensure the cache coherence in a TSO UltraSPARC SMP architecture, Sun E10k realised a threefold approach: 1) it broadcast cache coherence on a logical bus (as opposed to a physical bus) shaped as a tree where all CPU's were leaves with all links between them being point-to-point, 2) greater coherence request bandwidth could be achieved by using multiple logical buses, whilst still maintaining a total order of coherence requests (the E10k had four logical buses, and coherence requests were address-interleaved across them, and 3) data response messages, which are much larger than request messages, did not require the totally ordered broadcast network required for coherence requests.

The E10k scaled exceptionally well in its SMP setup whilst using the TSO. It was also highly performant in its prime time with the successor Sun Fire family improving even further.

Therefore, the strong memory ordering being a no-go for the SMP scalability statement is bunk.

[0] And Fujitsu has been a well known poster child of making massively scaleable, (Ultra-)SPARC based supercomputing systems for a very long time as well.

sweetjuly · on June 18, 2023

Memory ordering tends not to play much into the design issues of xMP systems. As long as you have a coherent and properly scalable cache and NoC, the actual memory ordering of the local processor is irrelevant to the total performance of the system since the LSU and L1 cache are (typically) responsible from providing ordering. The reason why most architectures use weaker memory ordering rules is that it allows you to more easily build faster individual cores as it makes it much easier to extract memory parallelism.

saagarjha · on June 19, 2023

Yep. There are plenty of Intel processors that have plenty of cores (and multiple sockets, even).

dehrmann · on June 18, 2023

It's the story of every framework. It starts out clean and minimal, then gets features added on as users demand more for more and more specific uses.

throwaway2037 · on June 19, 2023

See every web browser ever made. Also: Every GUI toolkit ever made. Also: IDEs. Eventually, someone is complaining about "bloat". Please: Give features, then I'll deal with the bloat.

codedokode · on June 18, 2023

This "feature hell" is often seen in open source projects, when users add dubious features that nobody except them needs, and as a result after many years the program has hundreds of CLI flags and settings and becomes too complex.

See openvpn as an example.

0xr0kk3r · on June 18, 2023

It is NOT feature hell. That is an absolutist/purist standpoint that only gets in the way in my experience. Products evolve to fit their market, which is literally why products are made.

Complexity needs to be managed, not labelled and shunned because it is "too hard" or "ugly". That is life. Learn that early and it will help.