I know it's not a straight conversion, but I've wondered what performance would ...

vidarh · 2023-08-05T13:10:38

For an idea in that vein (many 6502's on a die), look at (the sadly defunct) Epiphany, in the Parallela [1] (the buy links are still there but I doubt any are still available).

I have two, and they're fun little toys and I wish they'd have gotten further, with the caveat that actually making use of them in a way that'd be better than just resorting to a GPU is hard.

The 16-core Epiphany[2] in the Parallella is too small, but they were hoping for 1k or 4k core version.

I'm saying it's a similar idea because the Epiphany cores had 32KB on-die RAM per core, and a predictable "single cycle per core traversed" (I think, not checked the docs in years) cost of accessing RAM on the other cores, arranged in a grid. Each core is very simple, though still nowhere near the simplicity of a 6502 (they're 32 bit, w/with a much more "complete" modern instruction set)

The challenge is finding a problem that 1) decomposes well enough to benefit from many cores despite low RAM per core (if you need to keep hitting main system memory or neighbouring core RAM, you lose performance fast), 2) does ot decompose well into SIMD style processing where a modern GPU would make mincemeat of it, and 3) is worth the hassle of figuring out how to fit your problem into the memory constraints.

I don't know if this is an idea that will ever work - I suspect the problem set where they could potentially compete is too small, but I really wish we'd see more weird hardware attempt like this anyway.

[1] https://parallella.org/

[2] https://www.adapteva.com/docs/e16g301_datasheet.pdf

LeonenTheDK · 2023-08-05T15:31:10

That reminds me of the Green Arrays computer with 144 cores, programmed in Color Forth.

Chuck Moore has a talk on it here, very interesting stuff: https://www.youtube.com/watch?v=0PclgBd6_Zs

I'm not sure how far this has gone since then. The site is still up with buy links as well: https://www.greenarraychips.com

vidarh · 2023-08-05T16:29:13

Yeah, I think they're even more minimalist. It's an interesting design space, but hard to see it taking off, especially with few developers comfortable with thinking about the kind of minimalism required to take advantage.of these kinds of chips.

LeonenTheDK · 2023-08-05T18:13:38

Completely agree, it's an almost total shift from everything that's in use these days. Very interesting to play with though, I'd love to see some real world use.

RetroTechie · 2023-08-05T16:20:09

And 4) how to handle problems that don't map well onto massively-parallel machine. Some problems / algorithms are inherently serial by nature.

Some applications like games decompose relatively easy into separate jobs like audio, video, user input, physics, networking, prefetch game data etc. Further decomposing those tasks... not so easy. So eg. 4..8 cores are useful. 100 or 1k+ cores otoh... hmmm.

vidarh · 2023-08-05T16:27:14

True, but I'd expect for a chip like that you'd do what the parallella did, or what we do with CPU + GPU and pair it with a chip with a smaller number of higher powered cores. E.g. the Parallella had 2x ARM cores along with the 16x Epiphany cores.

convolvatron · 2023-08-05T18:19:59

for control parallelism. for data parallelism you get to soak up as many cores as you want if you can structure your problem the right way.

personally I'd love to see more exploration at hybrid approaches. the Tera MTA supported both modes pretty well.

FartyMcFarter · 2023-08-05T11:47:47

I'm not a hardware engineer, so take this with a grain of salt:

The 6502 doesn't have a cache, only external memory. So performance would probably be much worse than naively expected (except perhaps for workloads that fit in CPU registers, edit -not even those due to the lack of a code cache as well). Memory latencies haven't improved nearly as much as CPU speeds have, which is why modern CPUs have big caches.

The CPU would be idly waiting for memory to respond most of the time, which completely kills performance.

RetroTechie · 2023-08-05T14:21:26

That's because on (most) modern systems, main RAM is combined (shared memory space) and external from CPU, connected through fat but high-latency pipe.

A solution is to include RAM with each CPU core on-die. Afaik this is uncommon approach because semiconductor fabrication processes for RAM vs. CPU don't match well? But it's not impossible - IC's like this exist, and eg. SoC's with integrated RAM, CPU caches etc are a thing.

So imagine a 'compute module' consisting of 6502 class CPU core + a few (dozen?) KB of RAM directly attached + some peripheral I/O to talk to neighbouring CPU's.

Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Examples:

  GreenArrays GA144
  Cerebras Wafer Scale Engine

(not sure about nature of the 'CPU' cores on the latter. General purpose or minimalist, AI/neural network processing specialized?)

Afaik the 'problem' is more how to program such systems easily / securely, how to arrange common peripherals like USB, storage, video output etc. in a practical manner as to utilize the immense compute power + memory bandwidth in such a system.

masklinn · 2023-08-05T20:49:29

> Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Sounds like a new version of the Connection Machine (the classic / original one).

jdblair · 2023-08-05T14:15:17

The 6502 didn't need cache because its clock speed was slower than the DRAM connected to it. That made memory accesses very inexpensive.

The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.

Vogtinator · 2023-08-05T14:48:06

> The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.

Same for the C64. Sometimes it was necessary to read slightly more than that for video display though, so the VIC (video chip) had to pause the CPU for a bit sometimes, resulting in so called "badlines".

smolder · 2023-08-05T15:41:28

I think they used SRAM and not DRAM, which is how it was faster than a clock cycle.

monocasa · 2023-08-05T16:04:58

Both were faster. DRAM was a little cheaper, but required more circuitry to handle refresh on most CPUs making it a wash cost-wise on some designs. Typical woz engineering got the cost of the refresh circuitry down to where DRAM made sense economically on the Apple ][.

Interestingly Z80s had DRAM refresh circuitry builtin which which was one reason for their prevalence.

peterfirefly · 2023-08-05T20:01:00

Screen output was the DRAM refresh.

And for the Z80: also that it only needed GND and 5V. The 8080 also needed 12V. And the Z80 only needed a single clock phase -- the 8080 needed two. The 6502 also only needed 5V and a single clock input (the 6800 needed two clock phases). The 6502 and Z80 were simply a lot easier to work with than most of the competition.

Gracana · 2023-08-05T16:04:14

Nope, they used 16k x 1-bit DRAM ICs.

vidarh · 2023-08-05T13:12:50

You'd probably want to add at least a small cache, sure. Putting the zero page (256 bytes) on-die on each core (because it's typically used as "extra registers" of sort in 6502 code) plus a small additional instruction cache could do wonders. But you probably still wouldn't get anywhere near enough performance for it to be competitive with a more complex modern architecture.

It'd be a fun experiment to see someone do in an FPGA, though.

masklinn · 2023-08-05T12:00:00

> The 6502 doesn't have a cache

I don't think CPU caches were much of a thing back then, at least in the segments involved. AFAIK Intel wouldn't get on-die cache until the 486 (the 386 did support external L1, I think IBM's licensed variants also had internal caches but they were only for IBM to use).

The most distinguishing feature of the 6502 is that it had almost no registers and used external memory for most of the working set (it had shorter encodings for the first page, and IIRC later integrators would use faster memory for the zero page).

FartyMcFarter · 2023-08-05T12:05:13

> I don't think CPU caches were much of a thing back then

Indeed, because memory was fast enough to respond within a single cycle back then. Or alternatively, CPU cycles were slow enough for that to happen depending on how you want to look at it :)

bitwize · 2023-08-05T15:05:10

6502 systems had RAM that ran at system speed. Everything was so much slower then so this was feasible... plus, some systems (like the Commodore VIC-20) used fast static RAM.

The 6502 even had an addressing mode wherein accesses to the first 256 bytes of RAM -- the "zero page" -- were much faster due to not needing address translation, to the point where it was like having 256 extra CPU registers. Critical state for hot loops and such was often placed in the zero page.

Do not presume to apply principles of modern systems to these old systems. They were very different.

vidarh · 2023-08-05T18:33:54

The saving for the zero page is only the more compact encoding which saves one cycle for memory access (for the non-indexed variants, anyway)

E.g. LDA $42 is encoded as $A5 $42, and so takes 3 memory accesses (two to read the instruction, one to read from address $42). LDA $1234 is encoded as $AD $34 $12, and so takes 4 cycles: to read $AD $34 $12, and one to read the contents of address $1234. Same for the other zeropage vs. equivalent absolute instructions.

See e.g. timing chart for LDA here:

http://www.romdetectives.com/Wiki/index.php?title=LDA

brazzy · 2023-08-05T11:59:20

Now I have to wonder whether it would be possible to build a cache around an unmodified 6502. I.e. a component that just sits between the (unmodified) CPU and memory and does caching.

Someone · 2023-08-05T12:22:38

I don’t think an unmodified 6502 would get faster; it assumes memory is fast enough for its clock speed, and won’t, for example, prefetch code or data.

Also, if you add a modern-sized cache, you won’t need the memory at all in a single-CPU system; the cache would easily be larger than the addressable memory.

vidarh · 2023-08-05T13:15:24

It'd only benefit if the latency of the external memory is bad enough to keep the cores waiting, sure, and certainly you're right regarding a single-CPU system. I think this fantasy experiment only "makes sense" (for low enough values of making sense) if you assume a massively parallel system. I don't think it'd be easy (or even doable) to get good performance out of it by modern standards - it'd be a fun weird experiment though.

Someone · 2023-08-05T19:28:32

If I were to build a system with lots of 6502s, I think I would build something transputer-like, so that each CPU has its own memory, and the total system can have more than 64 kB of RAM.

The alternative of a massively parallel system with shared memory, on 6502, would mean all CPUs have 64 kB or memory, combined. I think that will limit what you can do with such a system too much.

vidarh · 2023-08-05T20:19:17

See: https://news.ycombinator.com/item?id=37011671

These were intended to get to 4k cores (but 32 bit), w/32K on-die memory, but the ability to read from/write to other cores. A 6502 inspired variant would be cool, but to not get killed on the cycle cost of triggering reads/writes you'd probably want a modified core, or maybe a 65C816 (16-bit mostly 6502 compatible).

toast0 · 2023-08-05T17:15:26

Should be, the 6502 has a RDY signal, to insert wait states in case accesses aren't ready yet. You'll need some way to configure the cache for address ranges and access types, but that would presumably just be more mmio.

ack_complete · 2023-08-05T17:51:57

The Apple IIc+ did this with a 4MHz 65C02.

hasmanean · 2023-08-05T14:37:53

Back then memory latency was faster than a cpu clock cycle so it was not needed.

larschdk · 2023-08-05T12:13:57

No division, no floating point, only 8-bit multiplication and addition, no pipeline, no cache, no MMU, no preemptive multitasking, very inefficient for higher level languages (even C). But you would get about 450 of them for the number of transistors on a 486.

Findecanor · 2023-08-05T13:09:23

Multitasking is tricky because the stack pointer is hard-coded to use page 1. it could be moved first on the 65EC02 and on the 16-bit 65C816.

In a code density comparison, the 6502 is as bad as some RISC processors where instructions are four bytes instead of one. <https://www.researchgate.net/publication/224114307_Code_dens...>

But it was fast, with the smallest instructions taking 2 cycles (one to read the op-code, one to execute). A 1 MHz 6502 is considered on par with a 3.5 MHz Z80, overall.

cbm-vic-20 · 2023-08-05T12:36:40

The 6502 does not have any multiplication instructions. It does have very fast interrupt latency for the processors of its time, though.

junon · 2023-08-05T12:09:04

Not very good, I wouldn't think. It didn't have many instructions or registers. There would be a LOT of memory reads/writes, a lot of extra finnagling of things that modern architectures can do in one or two instructions very cleanly.

Assuming no extra modern stuff was added on top of the 6502 and you just got a bog-standard 6502 just at a very high clock speed, then as the other comment mentioned there was no memory caching, no pipelining (though there have been lots of people interested in designing one with a pipeliner), and not a whole lot of memory space to boot as it had 16-bit wide registers.

Thus, most 6502 applications (including NES games, for example) had to use mappers to map in and out different physical memory regions around the memory bus, similar to modern day MMUs (just without the paging/virtualization). It would be hammering external memory like crazy.

Someone · 2023-08-05T12:43:01

> and not a whole lot of memory space to boot as it had 16-bit wide registers.

If only. Its program counter is 16 bits, but the other registers are 8 bits.

> It would be hammering external memory like crazy.

For some 6502s, like crazier. The NMOS 6502 has a ‘feature’ called ‘phantom reads’ where it will read memory more often than required.

https://www.bigmessowires.com/2022/10/23/nmos-6502-phantom-r...:

“On the 65C02, STA (indirect),Y requires five clock cycles to execute: two cycles to fetch the opcode and operand, two cycles to read two bytes in the 16-bit pointer, and one cycle to perform the write to the calculated address including the Y offset. All good. But for the NMOS 6502 the same instruction requires six clock cycles. During the extra clock cycle, which occurs after reading the 16-bit pointer but before doing the write, the CPU performs a useless and supposedly-harmless read from the unadjusted pointer address: $CFF8. This is simply a side-effect of how the CPU works, an implementation detail footnote, because this older version of the 6502 can’t read the pointer and apply the Y offset all in a single clock cycle.”

I don’t know whether it’s true, but https://www.applefritter.com/content/w65c02s-6502 (Comment 8) claims:

“The Woz machine on the Disk II card is controlled by the address lines (which set its mode) and exploits this "phantom read" to switch from SHIFT to LOAD mode during floppy disk write operations so in the CPU cycle immediately following the phantom read, which is the write cycle, to the same address as the phantom read (assuming no page crossing this time), the Woz machine will be in LOAD mode and grab the disk byte ("nibble") from the 6502 data bus. The Woz machine does a blind grab of that byte, it does not look at the R/W signal to determine if it's a write cycle. If the phantom read is missing, it just won't work because the data byte will be gone before the mode changes to LOAD.

I am quite sure that this is how the Disk II works. But this is where my own hands-on experience ends.”

junon · 2023-08-05T16:05:21

> If only. Its program counter is 16 bits, but the other registers are 8 bits.

Oh wow, I didn't remember that. I had to go back to double check it! Maybe the NES had wider registers? Or maybe I'm just completely misremembering, which is definitely probably absolutely the case.

Also, thanks for all the other links! Really interesting stuff.

vidarh · 2023-08-05T12:51:09

There's the mostly 6502 compatible 16-bit WDC 65C816 that'd probably be a much better, less painfully minimalist starting point.

Apart from a MMU, the biggest hurdle I think for modern dev on the original 6502 is the lack of multiply/divide and the pain of having to either forgo modern levels of stack use or emulate a bigger stack.

ddingus · 2023-08-05T19:11:00

Back in the day, Rockwell had a dual 6502 variant in their datebook. Two cores, each basically interleaving memory access.

The fun part about 6502 and many friends from that era was memory was direct access, bytes moving in and out every CPU cycle. In fact the CPU was slow enough that often did not happen, allowing for things like transparent DMA.

Clock a 6502 at some Ghz frequency and would there even be RAM fast enough?

Would be a crazy interesting system though.

I have a 16Mhz one in my Apple and it is really fast! Someone made a 100Mhz one with an FPGA.

Most 6502 ops took 3 to 5, maybe 7 cycles to perform.

At 1Mhz it is roughly 250 to 300k instructions per second.

1ghz would be the same assuming fast enough RAM, yielding 300M instructions per second.

I have always thought of it as adds and bit ops per second, and that is all a 6502 can really do.

causality0 · 2023-08-05T13:00:47

This comment makes me want to read an article focusing on how performance-per-transistor has changed over time.