Having multiple threads does not mean that they are all doing equally useful work. Single threaded performance is absolutely critical for a desktop machine.
Even in multithreaded desktop applications, it's rare to see them effectively use more than 8 threads.
There are some tasks where single core definitely limits the performance (some games especially). For most of the 'compute' oriented tasks like CAD/3D, LLMs, etc. multicore is great, and the slow single core speed doesn't seem to get in the way.
I would still rather have 128 M1-class cores than 128 Neoverse-N1 cores :)
macOS, ChromeOS, Windows and Android are heavily multithreaded, even when there is one main application thread, the underlying OS APIs are using auxiliary threads.
That doesn't answer the question unless we know what fraction of the overall CPU design org the Austin team was. Was CPU design only done on Austin, or was the team in Austin one of many?
That clarifies things, and suggests Samsung is reconsidering its long-term ARM strategy: Exynos processors consistently underperform relative to the Snapdragon versiosn of the same Samsung device, despite what I assume were decades of investment.
I asked the rhetorical question in terms of answering the comment. Yes, all your answers are valid. In the context of the original question, you would have an accurate number of WFH vs RTO days in the product.
> However, the details of modern branch predictors are proprietary, so we don’t have authoritative sources on them.
I focused on Computer Architecture for a masters degree and now I work on a CPU design team. While I cannot say what we use due to NDA, I will say that it is not proprietary. Very nearly everything, including the branch predictors, in modern CPUs can be found in academic research.
Many of these secrets are easily found in the reading list for a graduate-level computer architecture course. Implementation details vary but usually not by too much.
I’m not related to academia. I don’t design CPUs. I don’t write operating systems and I don’t care about these side channel attacks. I simply write user-mode software, and I want my code to be fast.
The academic research used or written by CPU designers being public doesn’t help me, because I only care about the implementation details of modern CPUs like Intel Skylake and newer, AMD Zen 2 and newer. These details have non-trivial performance consequences for branchy code, but they vary a lot between different processors. For example, AMD even mentions neural networks in the press release: https://www.amd.com/en/technologies/sense-mi
What the GP is saying is that all the details of how modern processors work are out there in books and academic papers, and that the material covered in graduate-level computer architecture courses is very relevant and helpful, and they include all (or nearly all) the techniques used in industry.
From the GP's perspective, it doesn't matter at all if the course taught branch predictors on a MIPS processor, even though MIPS isn't really used anywhere anymore (well, that's wrong, they're used extensively in networking gear, but y'know, for the argument). They still go over the various techniques used, their consequences, etc., so the processor chosen as an example is unimportant.
You're saying that all this information is unhelpful for you, because what you want is a detailed optimization guide for a particular CPU with its own particular implementation of branch prediction. And yeah, university courses don't cover that, but note that they're not "outdated" because it's not as if at some point what they taught was "current" in this respect.
So yeah, in this sense you're right, academia does not directly tackle optimization for a given processor in teaching or research, and if it did it would be basically instantly outdated. Your best resource for doing that is the manufacturer's optimization guide, and those can be light on details, especially on exactly how the branch predictor works.
But "how a processor works" is a different topic from "how this specific processor works", and the work being done in academia is not outdated compared to what the industry is doing.
PS: Never believe the marketing in the press release, yeah? "Neural network" as used here is pure marketing bullshit. They're usually not directly lying, but you can bet that they're stretching the definition of what a "neural network" is and the role it plays.
> They still go over the various techniques used, their consequences, etc., so the processor chosen as an example is unimportant.
They also include various techniques not used anymore, without mentioning that’s the case. I did a search for “branch predictor static forward not taken site:.edu” and found many documents which discuss that particular BTFN technique. In modern CPUs the predictor works before fetch or decode.
> university courses don't cover that
Here’s a link to one: https://course.ece.cmu.edu/~ece740/f15/lib/exe/fetch.php?med... According to the first slide, the document was written in fall 2015. It has dedicated slides discussing particular implementations of branch predictors in Pentium Pro, Alpha 21264, Pentium M, and Pentium 4.
The processors being covered were released between 1995 and 2003. At the time that course was written, people were already programming Skylake and Excavator, and Zen 1 was just around the corner.
I’m not saying the professor failed to deliver. Quite the opposite, information about old CPUs is better than pure theory without any practically useful stuff. Still, I’m pretty sure they would be happy to included slides about contemporary CPUs, if only that information was public.
> They also include various techniques not used anymore, without mentioning that’s the case.
Definitely. Sometimes it's for comparative reasons, and sometimes it's easier to understand the newer technique in the context of the older one.
> discussing particular implementations of branch predictors in Pentium Pro, Alpha 21264, Pentium M, and Pentium 4.
Yeah, but the course is still not the optimization guide you wanted. The slides pick & choose features from each branch predictor to make the point the professor wanted to make and present the idea he wanted to. It's not really useful for optimizing code for that particular processor, it's useful for understanding how branch predictors work in general.
> I’m pretty sure they would be happy to included slides about contemporary CPUs, if only that information was public.
Only if they served as a good example for some concept, or helped make a point that the professor wanted to make. There's no point in changing the examples to a newer processor if the old one is a cleaner implementation of the concept being discussed (and older examples tend to be simpler and therefore cleaner). The point isn't to supply information about specific processors, it's to teach the techniques used in branch predictors.
P.S. See those 3 slides about a "Perceptron Branch Predictor"? Based on a paper from 2001? I'm betting AMD's "neural network" is really just something like that...
Practically, the only thing that matters is that branch prediction assumes that history repeats itself, and that past patterns of a branch being taken in certain conditions will impact it being taken again.
So that means that conditions that are deterministic and relatively constant throughout the lifetime of the program will most likely be predicted correctly, and that rare events will most likely not be predicted correctly. That's all you need to know to write reasonably optimized code.
Depends what poster above did. If they are just monitoring something at constant rate "just" low jitter" is enough.
No idea if ESP32 DMA engine can do it but common trick to get either constant rate input or output was just setting DMA engine to copy data from/to IO port to/from memory at constant rate.
I remember someone using it to drive some ridiculus number of RGB LEDs by basically using DMA to bitbang multiple chains of them at once
But yeah, using a simpler MCU as basically IO co-processor is usually simpler. I wish there was some kind of low pin bus that allowed to say directly map a part of memory to the chip on other side, akin to what RDMA does.
I needed to detect writes to a bus that could happen at 1MHz, which involved reading the state of the bus multiple times per microsecond (based on the timing of the various signals). The jitter in the worst case was multiple microseconds (causing missed accesses), no matter what I tried.
I wasn’t able to use DMA on the ESP32 to help— perhaps it could have if I had tried to massage the problem a little bit more though.
Three companies have wall gardens for game consoles and nobody cares about that. Those three companies own the entire console market. There's no alternative.
Why doesn't the EU do something about that? Why is Apple an exception?
This comment makes it sound like Apple shouldn't be forced to open up their hardware, but the better conclusion is that game consoles should be forced to open up their hardware. This is (allegedly) Hacker News, we should be all for giving people more control over the devices they (allegedly) own.
Just don’t buy closed devices if that’s what you want. There’s no shortage of open devices available to buy.
Complaining about iPhones and game consoles being closed is like complaining that your Honda Civic isn’t good at off-roading and won’t tow your horse trailer.
> There’s no shortage of open devices available to buy.
I'd like an open device that supports iMessage and Facetime.
I don't understand why I should have to give up all the iOS software that I really like just because I want to run one app on my device that Apple doesn't allow.
"Walled gardens" when MS has a policy of porting all their games to Windows and also offers an official way for users to run custom software on the device.
Agreed on the other two though, both should also be forced to open up.
> And once "computer says no" you'll have a hell of a time trying to fight it.
Cars have been equipped with event data recorders that capture information moments before the crash for at least 20 years. It is part of the air bag controller. Insurance companies have equipment to read that data and use it against customers all the time.
You don't need a Tesla or other black box insurance to get cheated.
Having a little accelerometer data that is probably a mild hassle to extract is far different than an always-on, internet-enabled, sensor-filled computer that literally has a camera pointing at your face.
Uh, that's uncalled for. I think your hint is right there in the title: Myths Programmers Believe about CPU Caches. Was there an inaccuracy in the article with respect to CPU Caches?
Even in multithreaded desktop applications, it's rare to see them effectively use more than 8 threads.