More

throw34 · on July 6, 2022

Do you have a reference for that? 5 years is a pretty long time but I suppose it depends a lot on how one defines the start of a design.

throw34 · on July 6, 2022

Usually it’s what’s being run inside the terminal that matters. ssh? no big deal. ffmpeg? can slow a machine down and can have a significant memory footprint.

YMMV

throw34 · on July 6, 2022

Probably somewhere around two years before the M1 announcement is when development started.

It’s a lot of work to design one of these parts.

https://anysilicon.com/asic-design-flow-ultimate-guide/

2muchcoffeeman · on July 6, 2022

https://www.forbes.com/sites/antonyleather/2019/07/26/intel-...

Intel - 4 years.

throw34 · on July 4, 2022

But doesn’t your statement prove that there are decisions and things the user of a node can make which effect the end result? Therefore, is not just process node that matters.

modeless · on July 4, 2022

Obviously you can choose not to aim for the best possible performance in order to reduce costs. But for the companies that are competing for the absolute best performance at any cost, which doesn't include Qualcomm, by far the biggest factor determining that performance is the fabrication technology.

throw34 · on July 4, 2022

That’s an interesting take given this thread because Apple has explicitly stated they don’t prioritize performance, they prioritize performance per watt, which is not the same thing. And that also shows there’s a whole design space here and just focusing on one dimension, process node, is overly simplistic.

modeless · on July 4, 2022

Top end performance is power limited, so optimizing for performance per watt is almost the same thing as optimizing for performance.

throw34 · on July 4, 2022

I’m not sure what point you’re arguing anymore.

I’m arguing against this statement. I don’t think it’s true.

“ You're looking in the wrong place. The magic comes from TSMC, not Apple.”

My point is a good process is necessary but not sufficient to make a part competitive with Apple on performance per watt. There’s a lot of “magic” to go around.

xyzzy_plugh · on July 4, 2022

I think the argument is that Apple would have significantly less magic were it not for TSMC, which seems to be correct regardless of process or design (vertical integration).

fomine3 · on July 5, 2022

Yes. Now every computers tend to be power limited (or say heat limited).

grogenaut · on July 4, 2022

I think it states that Qualcomm is not competing with apple on the success metrics that this post talks about re catching up, so you can't use it as a comparison. No idea how fast they could hit those specs up if they wanted to (was core to their business)

throw34 · on July 3, 2022

> A lot of dating books recommend getting really specific about what you want

Also, from my personal experience, what you want is going to change over time. And even if someone checks your boxes today, they’re unlikely to be that same person in 10 years.

Being a parent changed me a lot (hopefully for the better) and 20s me, 30s me, 40s me are all pretty different guys in terms of priorities, willingness to listen to others, etc.

I echo other comments and say the priority is having a communication channel and the willingness to adapt.

lotsofpulp · on July 3, 2022

>Also, from my personal experience, what you want is going to change over time. And even if someone checks your boxes today, they’re unlikely to be that same person in 10 years.

And what you can afford is going to change over time.

actfrench · on July 3, 2022

I know I love to grow and change and adapt. It’s a little challenging to tell if the person you meet is an adapter or a communicator . Everyone says communication in a relationship is important but the reality can be different. Is there some way early on you could tell your partner was someone willing to adapt ?

throw34 · on July 3, 2022

I suppose traditionally this is what one learns during the fiancée stage of the relationship and given your accelerated timeline that’s definitely going to be a challenge. Unfortunately the only way I know to tell is to actually go through a couple of tough spots and see what happens and forcing it might not work. On the plus side, if you do find a test that is predictive and can also be used early in a relationship, you’ll be a best selling author of relationship books for sure.

actfrench · on July 3, 2022

Love this: "having a communication channel and the willingness to adapt"

throw34 · on July 1, 2022

I get your point but was Voyager a bank?

https://www.investvoyager.com/investorrelations/overview

“ Voyager Digital Ltd. is a fast-growing, publicly traded cryptocurrency platform in the United States founded in 2018 to bring choice, transparency, and cost efficiency to the marketplace. ”

jrumbut · on July 1, 2022

Never heard of them before but they talked about direct deposits and a debit card on the page linked above so if they aren't a bank they are acting a lot like one.

throw34 · on July 1, 2022

Do you know the limit for the part? If it’s 120C then the M1 is leaving 26C on the table and the M2 left only 12C.

Without the specs these measurement don’t mean much.

throw34 · on June 29, 2022

"The R1000 addresses 64 bits of address space instantly in every single memory access. And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today."

That statement has to be coming with some hidden caveats. 64 bits of address space is crazy huge so it's unlikely the entire range was even present. If only a subset of the range was "instantly" available, we have that now. Turn off main memory and run right out of the L1 cache. Done.

We need to keep in mind, the DRAM ICs themselves have a hierarchy with latency trade-offs. https://www.cse.iitk.ac.in/users/biswap/CS698Y/lectures/L15....

This does seem pretty neat though. "CHERI makes pointers a different data type than integers in hardware and prevents conversion between the two types."

I'm definitely curious how the runtime loader works.

cmrdporcupine · on June 29, 2022

"We need to keep in mind, the DRAM ICs themselves have a hierarchy with latency trade-offs" Yes this is the thing -- I'm not a hardware engineer or hardware architecture expert, but -- it seems to me that what we have now is a set of abstractions presented by the hardware to the software based on a model of what hardware "used to" look like, mostly what it used to look like in a 1970s minicomputer, when most of the intensive key R&D in operating systems architecture was done.

One can reasonably ask, like Mr Kamp is, why we should stick to these architectural idols at this point in time. It's reasonable enough, except that the alternative of heterodox, alternative architectures is also heterogenous -- new concepts that don't necessarily "play well with others." All our compiler technology, all our OS conventions, our tooling, etc. would need to be rethought under new abstractions.

And those are fun hobby or thought exercises, but in the real world of industry, they just won't happen. (Though I guess from TFA it could happen in a more specialized domain like aerospace/defence)

In the meantime, hardware engineering is doing amazing things building powerfully performing systems that give us some nice convenient consistent (if sometimes insecure and awkward) myths about how our systems work, and they're making them faster every year.

bentcorner · on June 29, 2022

Makes me wonder if 50 years from now we'll still be stuck with the hardware equivalent of the floppy disk icon, only because retooling the universe over from scratch is too expensive.

nine_k · on June 29, 2022

As they say, C was designed for the PDP-11 architecture, and modern computers are forced to emulate it, because the tools to describe software (languages and OSes) which we have can't easily describe other architectures.

There were modern semi-successful attempts though, see PS3 / Cell architecture. It did not stick though.

I'd say that the modern heterodox architecture domain is GPUs, but we have one proprietary and successful interface for them (CUDA), and the open alternatives (openCL) are markedly weaker yet. And it's not even touching the OS abstractions.

aap_ · on June 30, 2022

> As they say, C was designed for the PDP-11 architecture

Not really though. A linear address space was not particularly specific to the PDP-11. The one point where C really was made to fit the PDP-11 was the addition of a byte datatype (aka char), but the PDP-11 wasn't unique in that regard either.

nine_k · on July 1, 2022

PDP-11 wasn't unique. To the contrary, it had many typical features.

- Uniform memory, cache memory is small, optional and transparent.

- A single linear address space; no pages, stack in the same RAM as data.

- A single CPU, with a single scalar ALU, and completely in-order execution; no need for memory barriers.

A typical modern machine larger than an MCU has several level of memory hierarchy which affect performance enormously, the physical RAM is mapped all over the address space, several execution units process data in parallel and often out of strict order, there are many variants of vector (SIMD) instructions, and usually a whole vector co-processor ("graphics card"). This breaks many of the assumptions that C initially ha made, and hardware tries hard to conceal the memory hierarchy (well, your OS may allow you to schedule your threads to the same NUMA domain), to conceal the parallel execution, to conceal the memory incoherence between processing nodes, etc. Well, you sort of can make the compiler infer that mean a vectorized operation, or use an intrinsic.

In my eyes, the C's assumptions about hardware show their age, and also hold the hardware back.

crazyd987 · on July 1, 2022

My memory of the PDP-11 is different. However, the pdp11 is/was a weird collection of systems, from tiny little single chip implementations to run a VAX console, upto large racks of boards to make up a cpu and fpu like the 11/70. I mainly worked on 11/44's an 11/73's which both shared a 16bit paged address space, (with no cache that I remember).

They had more physical memory (256k and I think 4M) than could be addressed by the instructions(64k).

The pages where 8k - so eight of them, and waving them around required an OS mapping function call.

The IO controllers where asynchronous, and the OS did preemptive multiprocessing and the problem-space was larger than 64k, and faster than the disk-drive, so multi-processing and locks where required to address it.

We used C and assembler on them. C was nicer than assembler to work with.

I don't see a difference of-kind between the pdp-11 and current computers. I do see a difference of 'know-ability' of the software stack that makes up a system.

There are so many external dependencies in the systems I have worked on since, many of them larger than the systems that loaded into that pdp-11, so being certain that there is no fault was almost always a pipe-dream. Automated tests helped - somewhat.

Often, confidence is based on the 'trajectory' of the rate of bugs discovered.

mikepavone · on June 30, 2022

> That statement has to be coming with some hidden caveats. 64 bits of address space is crazy huge so it's unlikely the entire range was even present. If only a subset of the range was "instantly" available, we have that now. Turn off main memory and run right out of the L1 cache. Done.

So I did some digging around for documentation about this machine and it looks like it puts the upper 54-bits of the address through a hash function to select an entry in a set associative tag RAM which is then used to select a physical page. This has the possibility for collisions, but it can get away with that because RAM is just a cache for disk contents.

Certain parts of the address technically mean something, but apart from leveraging that in the design of their hash function it has no real relevance to the way the hardware works. This scheme would work with linear 64-bit addresses just fine with an appropriate hash implementation. Basically all that's happening here is that the TLB is large enough to have an entry for reach physical page in the system and a TLB miss means you have to fetch a whole page from disk rather than walking a tree of page tables in memory.

I think the other thing going on here is that the R1000 is a microcoded machine from the 80s with no cache (well unless you're counting main RAM as cache, so it probably has a relatively leisurely memory read cycle which makes it more straightforward to have a very large TLB relative to the size of main memory. There's no magic here and no lessons for modern machines when it comes to how virtual address translation is done

phkamp · on June 30, 2022

You are right that there is no lessons for "modern machines" as we build them now.

But that is precisely my point: Maybe there are better ways to build them ?

mikepavone · on June 30, 2022

What I mean is that the R1000 memory architecture is not fundamentally different from modern hardware in a way that seems to solve any modern design problems. The tag RAM is functionally equivalent to the TLB on a modern CPU, but it's much larger relative to the size of the RAM it's used with. The 2MB memory boards described in US Patent 4,736,287 (presumably an earlier version of the 32MB boards present in the R1000/s400) have a whopping 2K tag RAM entries. This is the same size as the 2nd level data TLB in Zen 2 which is supporting address lookup for a much larger main memory.

If you were to try and make a modern version of the R1000 architecture you're going to run into the same size vs speed tradeoffs that you see in conventional architectures. The server oriented Rome SKus of Zen 2 support 4 TB max RAM. Even if you bump the page size to 4MB, you still would need 1M TLB/tag RAM entries to support that with an R1000-style implementation.

phkamp · on July 1, 2022

Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree.

What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

mikepavone · on July 1, 2022

>Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree. > >What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.

You would need a TLB even with a completely flat page table because hitting the DRAM bus (some flavor of DDR on modern systems, but it's still fundamentally DRAM) on every access would absolutely destroy performance on a modern machine even if translation itself was "free". You need translation structures that can keep up with the various on-chip cache levels which means they need to be small and hierarchical. You can't have some huge flat translation structure like you have on the R1000 and have it be fast.

Anyway, my point is that at a mechanical level TLB and tag RAM work the same way. You take a large virtual address, hash the upper bits and use them to do a lookup in a set-associative memory (so basically a hash table with a fixed number of buckets for conflicts). In some CPUs (it's a little unclear to me how common it is for cache to be virtually or physically addressed these days) this even happens in parallel with data fetch from cache just like tag RAM lookup on the R1000 was done in parallel with data fetch from DRAM. This is not some forgotten technique, it's just moved inside the CPU die and various speed and die space constraints keep it from covering all the physical pages of a modern system.

Now, could you perhaps use a more R1000-like approach for the final layer of translation, sure. Integrating it tightly with system memory probably doesn't make sense given the need to be able to map other things like VRAM into a virtual address space, but you could have a flat hashtable like arrangement even if it's just a structure in main RAM. You can even implement such a thing on an existing CPU with a software managed TLB (MIPS, some Sparc)

phkamp · on July 2, 2022

The TLB is an attempt to mitigate the horrible performance properties of a multi-level page-table-trees.

If you do away with the page-table-tree, there is no problem for the TLB to mitigate.

Symmetry · on June 30, 2022

I'm sure there are improvements to be made but there are pretty fundamental physical reasons why reading a random piece of data from a large pool of memory is going to take longer than reading random piece of data from a small pool of memory. Hence, as memory pools have gotten bigger since the days of the R1000 we use caches both in memory itself and in address translation.

phkamp · on June 30, 2022

The entire address space (64 address bits, addressing bits, so "only" 61 bit addresses for bytes, but also 3 "kind" bits, separating code/data/control etc.) were available at all times.

The crucial point is that the RAM wasn't laid out as a linear array, but as a page-cache.

In a flat memory space, to allocate a single page far away from all others, you will need four additional pages (the fifth level is always present) for the page-tables, and four memory accesses to look them up before you get to the data.

In the R1000, you present the address you want on the bus, the memory board (think: DIMM) looks that address up in its tag-ram to see if it is present, completes the transaction or generates a memory fault, all in one single memory cycle, always taking the same time.

gpderetta · on June 30, 2022

The problem is that there is a limit on how fast you can make the look-up in hardware. Today, given the large amounts of physical memory and the high frequency that CPUs are clocked at, single cycle lookup would be impossible. In fact today CPUs already have such lookup tables in the form of TLBs as hitting the page table every time would have very high latency; still TLBs cannot cover the whole address space and still need multi-level structures even for a subset of it.

Single Address Space OSs are an option, but it means that you are restricted to memory safe languages, it is very vulnerable to spectre-like attacks, and any bug in the runtime means game over.

trasz · on June 30, 2022

>Single Address Space OSs are an option, but it means that you are restricted to memory safe languages

CHERI works just fine for enforcing memory protection within an address space.

throw34 · on June 29, 2022

This has been a popular topic for some time in the computer architecture space.

“ The End of the Road for General Purpose Processors & the Future of Computing - Prof. John L. Hennessy”

https://www.csail.mit.edu/news/end-road-general-purpose-proc...

throw34 · on June 28, 2022

Or maybe have the object code for the executable as one static constant and jump to it on the one and only line of the program: speed, size and one line of code (maybe two depending on how the counting is done)