I'm not a hardware engineer, so take this with a grain of salt: The 6502 doesn't...

RetroTechie · 2023-08-05T14:21:26

That's because on (most) modern systems, main RAM is combined (shared memory space) and external from CPU, connected through fat but high-latency pipe.

A solution is to include RAM with each CPU core on-die. Afaik this is uncommon approach because semiconductor fabrication processes for RAM vs. CPU don't match well? But it's not impossible - IC's like this exist, and eg. SoC's with integrated RAM, CPU caches etc are a thing.

So imagine a 'compute module' consisting of 6502 class CPU core + a few (dozen?) KB of RAM directly attached + some peripheral I/O to talk to neighbouring CPU's.

Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Examples:

  GreenArrays GA144
  Cerebras Wafer Scale Engine

(not sure about nature of the 'CPU' cores on the latter. General purpose or minimalist, AI/neural network processing specialized?)

Afaik the 'problem' is more how to program such systems easily / securely, how to arrange common peripherals like USB, storage, video output etc. in a practical manner as to utilize the immense compute power + memory bandwidth in such a system.

masklinn · 2023-08-05T20:49:29

> Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Sounds like a new version of the Connection Machine (the classic / original one).

jdblair · 2023-08-05T14:15:17

The 6502 didn't need cache because its clock speed was slower than the DRAM connected to it. That made memory accesses very inexpensive.

The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.

Vogtinator · 2023-08-05T14:48:06

> The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.

Same for the C64. Sometimes it was necessary to read slightly more than that for video display though, so the VIC (video chip) had to pause the CPU for a bit sometimes, resulting in so called "badlines".

smolder · 2023-08-05T15:41:28

I think they used SRAM and not DRAM, which is how it was faster than a clock cycle.

monocasa · 2023-08-05T16:04:58

Both were faster. DRAM was a little cheaper, but required more circuitry to handle refresh on most CPUs making it a wash cost-wise on some designs. Typical woz engineering got the cost of the refresh circuitry down to where DRAM made sense economically on the Apple ][.

Interestingly Z80s had DRAM refresh circuitry builtin which which was one reason for their prevalence.

peterfirefly · 2023-08-05T20:01:00

Screen output was the DRAM refresh.

And for the Z80: also that it only needed GND and 5V. The 8080 also needed 12V. And the Z80 only needed a single clock phase -- the 8080 needed two. The 6502 also only needed 5V and a single clock input (the 6800 needed two clock phases). The 6502 and Z80 were simply a lot easier to work with than most of the competition.

Gracana · 2023-08-05T16:04:14

Nope, they used 16k x 1-bit DRAM ICs.

vidarh · 2023-08-05T13:12:50

You'd probably want to add at least a small cache, sure. Putting the zero page (256 bytes) on-die on each core (because it's typically used as "extra registers" of sort in 6502 code) plus a small additional instruction cache could do wonders. But you probably still wouldn't get anywhere near enough performance for it to be competitive with a more complex modern architecture.

It'd be a fun experiment to see someone do in an FPGA, though.

masklinn · 2023-08-05T12:00:00

> The 6502 doesn't have a cache

I don't think CPU caches were much of a thing back then, at least in the segments involved. AFAIK Intel wouldn't get on-die cache until the 486 (the 386 did support external L1, I think IBM's licensed variants also had internal caches but they were only for IBM to use).

The most distinguishing feature of the 6502 is that it had almost no registers and used external memory for most of the working set (it had shorter encodings for the first page, and IIRC later integrators would use faster memory for the zero page).

FartyMcFarter · 2023-08-05T12:05:13

> I don't think CPU caches were much of a thing back then

Indeed, because memory was fast enough to respond within a single cycle back then. Or alternatively, CPU cycles were slow enough for that to happen depending on how you want to look at it :)

bitwize · 2023-08-05T15:05:10

6502 systems had RAM that ran at system speed. Everything was so much slower then so this was feasible... plus, some systems (like the Commodore VIC-20) used fast static RAM.

The 6502 even had an addressing mode wherein accesses to the first 256 bytes of RAM -- the "zero page" -- were much faster due to not needing address translation, to the point where it was like having 256 extra CPU registers. Critical state for hot loops and such was often placed in the zero page.

Do not presume to apply principles of modern systems to these old systems. They were very different.

vidarh · 2023-08-05T18:33:54

The saving for the zero page is only the more compact encoding which saves one cycle for memory access (for the non-indexed variants, anyway)

E.g. LDA $42 is encoded as $A5 $42, and so takes 3 memory accesses (two to read the instruction, one to read from address $42). LDA $1234 is encoded as $AD $34 $12, and so takes 4 cycles: to read $AD $34 $12, and one to read the contents of address $1234. Same for the other zeropage vs. equivalent absolute instructions.

See e.g. timing chart for LDA here:

http://www.romdetectives.com/Wiki/index.php?title=LDA

brazzy · 2023-08-05T11:59:20

Now I have to wonder whether it would be possible to build a cache around an unmodified 6502. I.e. a component that just sits between the (unmodified) CPU and memory and does caching.

Someone · 2023-08-05T12:22:38

I don’t think an unmodified 6502 would get faster; it assumes memory is fast enough for its clock speed, and won’t, for example, prefetch code or data.

Also, if you add a modern-sized cache, you won’t need the memory at all in a single-CPU system; the cache would easily be larger than the addressable memory.

vidarh · 2023-08-05T13:15:24

It'd only benefit if the latency of the external memory is bad enough to keep the cores waiting, sure, and certainly you're right regarding a single-CPU system. I think this fantasy experiment only "makes sense" (for low enough values of making sense) if you assume a massively parallel system. I don't think it'd be easy (or even doable) to get good performance out of it by modern standards - it'd be a fun weird experiment though.

Someone · 2023-08-05T19:28:32

If I were to build a system with lots of 6502s, I think I would build something transputer-like, so that each CPU has its own memory, and the total system can have more than 64 kB of RAM.

The alternative of a massively parallel system with shared memory, on 6502, would mean all CPUs have 64 kB or memory, combined. I think that will limit what you can do with such a system too much.

vidarh · 2023-08-05T20:19:17

See: https://news.ycombinator.com/item?id=37011671

These were intended to get to 4k cores (but 32 bit), w/32K on-die memory, but the ability to read from/write to other cores. A 6502 inspired variant would be cool, but to not get killed on the cycle cost of triggering reads/writes you'd probably want a modified core, or maybe a 65C816 (16-bit mostly 6502 compatible).

toast0 · 2023-08-05T17:15:26

Should be, the 6502 has a RDY signal, to insert wait states in case accesses aren't ready yet. You'll need some way to configure the cache for address ranges and access types, but that would presumably just be more mmio.

ack_complete · 2023-08-05T17:51:57

The Apple IIc+ did this with a 4MHz 65C02.

hasmanean · 2023-08-05T14:37:53

Back then memory latency was faster than a cpu clock cycle so it was not needed.