Hacker News new | past | comments | ask | show | jobs | submit login
Intel x86 documentation has more pages than the 6502 has transistors (2013) (righto.com)
231 points by optimalsolver 9 months ago | hide | past | favorite | 105 comments



In other mind-boggling stats: A single modern games console has more RAM than all of the Atari 2600s ever manufactured put together.


>A single modern games console has more RAM than all of the Atari 2600s ever manufactured put together

In other mind boggling movie lore, an RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3[1].

The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number for the world-ending AI, that nothing could possibly come close to it, and 20 years later consumers can have twice more computing power in their home PCs.

It's also a nice reminder how far technology has progressed in the last decades even if the pace has slowed down in the last years.

[1]https://youtu.be/_Wlsd9mljiU?t=155


> RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3. The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number

Also a fact worth noting but is routinely ignored in the popular press is that, these astronomical peak floating-point ratings of modern hardware are only achievable for a small selection of algorithms and problems. In practice, realizable performance is often much worse, efficiency can be as low as 1%.

First, not all algorithms are best suited for the von Neumann architecture. Today, the memory wall is higher than ever. The machine balance (FLOPS vs. load/store) of modern hardware is around 100:1. To maximize floating-point operations, all data must fit in cache. This requires the algorithm to have a high level of data reuse via cache blocking. Some algorithms do it especially well, like dense linear algebra (Top500 LINPACK benchmark). Other algorithms are less compatible with this paradigm, they're going to be slow no matter how well the optimization is. Examples include many iterative physics simulation problems, sparse matrix code, and graph algorithms (Top500 HPCG benchmark). In the Top500 list, HPCG is usually 1% as fast as LINPACK. Best-optimized simulation code can perhaps reach 20% of Rpeak.

This is why both Intel and AMD started offering special large-cache CPUs, either using on-package HBM or 3D-VCache. They're all targeted for HPC. Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall. Doing inference is a relatively cache-friendly problem, many HPC simulations are much worse in this aspect.

Next, even if the algorithm is well-suited for cache blocking, peak datasheet performance is usually still unobtainable because it's often calculated from the peak FMA throughput. This is unrealistic in real problems, you can't just do everything in FMA - 70% is a more realistic target. In the worst case, you get 50% of the performance (disappointing, but not as bad as the memory wall). In contrast to datasheet peak performance, the LINPACK peak performance Rpeak is measured by a real benchmark.


When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you're generally computing N FMA units * f frequency, theoretical maximum FLOPS unit. This number, as you note, has basically no relation to anything practical: we've long been at the point where our ability to stamp out ALUs greatly exceeds our ability to keep those units fed with useful data.

Top500 measures FLOPS on a different basis. Essentially, see how long it takes to solve an N×N equation Ax=b (where N is large enough to stress your entire system), and use a synthetic formula to convert N into FLOPS. However, this kind of dense linear algebra is an unusually computation-heavy benchmark--you need to do about n^1.5 FLOPS per n words of data. Most kernels tend to do more like O(n) or maybe as high as O(n lg n) work for O(n) data, which requires a lot higher memory bandwidth than good LINPACK numbers does.

Furthermore, graph or sparse algorithms tend to do really bad because the amount of work you're doing isn't able to hide the memory latency (think one FMA per A[B[i]] access--you might be able to do massive memory bandwidth fetches on the first B[i] access, but you end up with a massive memory gather operation for the A[x] access, which is extremely painful).


> Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall

FP16 doesn't work any faster than mixed precision on Nvidia or any other platform(I have benchmarked GPUs, CPUs and TPUs). For matrix multiplication, computation is still the bottleneck due to N^3 computation vs N^2 memory access.


With FP16 you can fit twice as much weights in cache, and also fetch twice as much weights from memory

Also this depends on the size of the matrix


The 4090 provides over 80 tflops in bog standard raw FP32 compute no tensor cores or MAD/FMA or any fancy instructions.


An SF book published in the 1950s (I have forgotten title and author, sigh) featured a then-imagined supercomputer with

- 1M bits of storage

- A mile on a side

- Located in Buffalo, NY, and cooled by Niagra Falls (vacuum tubes, natch)

- Able to surveil every citizen in the nation for suspicious activity

No mention of clock speed, cache-line size, or instruction set. I guess SF writers aren't computer designers :-)


The writers could still turn out to be right. I am not sure we are making good use of all that hardware yet.


The only thing keeping us alive is that skynet is an electron app.


>I am not sure we are making good use of all that hardware yet. Dunno, working out the color of 8 million pixels every 6ms seems pretty good to me


True, though I was talking about the AI workloads.


Damm high level programming languages. Just go back to assembly, that'll fix everything.


Yeah how dare they. ;)

Truth be told though, I believe we are in for some more innovation in the area, especially with the advent of ARM lately. It's always kinda funny how these mega-machines we have still manage to stutter.


> It's always kinda funny how these mega-machines we have still manage to stutter.

I just figured that’s the trade-off for general purpose computing. We can optimize for whatever covers the wide swath of use cases, but we can’t optimize for everything, and some will continue to be mutually exclusive. Mind you I’m no expert, I’m just extrapolating on how differently what CPUs and GPUs are optimized for these days and historically.


Nah you are very correct, I just feel that our compilers can still do a better job with the general purpose code because they are usually completely blind to the systems the compiled code runs in (as in, I/O takes orders of magnitude more time for example and this can be used to auto-parallelize code; but I know compiler authors will never auto-spawn threads). I feel this can be improved a lot but for various (and likely good) reasons our tooling is not as good as it can be.


Good points. I wish I knew more about compilers sometimes, and this is one of those times.


> The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number for the world-ending AI, that nothing could possibly come close to it, and 20 years later consumers can have twice more computing power in their home PCs.

If you look at the top500 supercomputer list of the time [1], they actually nailed it, the #1 rank at the time hitting a peak of 40TFLOPS

[1] https://www.top500.org/lists/top500/2003/06/


Isn’t it a great reminder that technology is not progressed enough to even take advantage of 60TFLOPS.


> to even take advantage of 60TFLOPS

Rendering Electron apps and mining Dodgecoins?


In scientific computing, it has become a serious problem. Because of the memory wall, many important algorithms can never take advantage of 60 TFLOPS due to their low arithmetic intensity. The only solutions are (1) stop using these algorithms, (2) stop using von Neumann computers (e.g. in-memory computing). The stop-gap solution is HBM or 3D-VCache.


In other mind boggling movie lore, an RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3[1].

That isn't really mind boggling since you are quoting fiction.


>That isn't really mind boggling since you are quoting fiction

Fiction of the past plays an important role in seeing how far tech has progressed, that what was once fiction is now a commodity.


How does this opinion explain a made up number as "mind boggling"?


What makes you think it's a made up number? Just because it's been featured in a movie doesn't mean the number can't be grounded in the reality of the era. Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers, prop builders and art directors with setting scenes that look realistic and don't just pull radom numbers out of thin air which could be embarrassing mistakes for tech-savvy movie goers.

60 TFLOPS is the equivalent of 10.000x PS2s of processing power, the most powerful console at the time, or 2x NEC Earth Simulator, the most powerful supercomputer at the time, which seems about right for what would be a virus taking over all the compute power of the DoD.

So definitely the writers consulted with some people who knew something about computers to get a figure grounded in reality at the time and not just pulled a random number out of thin air, especially that at the time even average joes were hearing about FLOPS as a measure of compute power, being advertised in PC and gaming console specs, so naturally they had to come up with a number that seemed very impresive but was also believable.


What makes you think it's a made up number?

It's a fictional prediction of the future. Even people trying to predict the future get it wrong. People being wrong isn't mind blowing.

Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers,

Is that what happened here?

prop builders and art directors with setting scenes that look realistic

That has nothing to do with the script

don't just pull radom numbers out of thin air

Yes they do

60 TFLOPS is the equivalent of 10.000x PS2s of processing power

60 TFLOPS was also about where the biggest super computer already was at 2003, so this was silly even using the present. That's fine, but it isn't "mind boggling" to base it on fiction.

https://en.wikipedia.org/wiki/TOP500

Why not say that in 20 years the biggest super computer in the world is now where a home graphics card is? That's actually mind boggling. No need to live your life based off of the fiction of someone else.

which seems about right for what would be a virus taking over all the compute power of the DoD.

Why does that "seem about right". Again, this is fiction vs reality. That is a science fiction scenario that should make no sense to anyone experienced with computers. Why would a virus need a super computer?

So definitely the writers consulted with some people

No you've move from 'technical consultants exist' to 'definitely the writers consulted people'. What are you basing this on?

so naturally they had to come up with a number that seemed very impresive but was also believable.

Which part in the made up number is mind blowing again?

Reality is 'mind blowing' enough, there is no need to mix reality and fiction.


It's not fiction that the writers thought 60TFLOPS would be huge today.


It kinda seems like the writers (writer?) either consulted with or did the math and calculated where pretty powerful computers would be by now, and that the t800 was more of a mid tier model and that there were higher tier models (or ai’s that ran in data centers) that individually ran on 4090 power and above


So what?

It's a made up number that's supposed to sound fancy. It is for people who don't know much about computers. It's probably just there because people have heard the prefix 'tera', but wouldn't know what 'exa' or any other prefix means.

It doesn't mean anything. Documentation made by people having more pages than a CPU which was also made by people is interesting because these are real things made for specific purposes, not a number pulled out of thin air for fiction.

There is nothing 'mind blowing' about an uninformed person just being wrong. Is it 'mind blowing' that the original terminator was supposed to run on a 6502?

In Johnny Mnemonic 320 GB was supposed to be a lot of data in 2021 when it costs the same as lunch for two people.

https://www.imdb.com/title/tt0113481/plotsummary/


> Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency.

> But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has 80k transistors running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.

https://danluu.com/keyboard-latency/


> https://danluu.com/keyboard-latency/

This might be a bit off topic, but it was surprising to see a Logitech K120 have the same latency as a Unicomp Model M or other keyboards that are 5-10x more expensive than it.

No wonder I liked using it for work years ago: as far as membrane keyboards go, it's pretty dependable and decently nice to use, definitely so for it's price.


An USB-C charger has much more computing power than an Apollo Moonlander.

https://www.theverge.com/tldr/2020/2/11/21133119/usb-c-anker...


We'll have computronium soon if we carry on like this!


But it is seriously i/o deficient!


The measurement methodology seems a bit odd for the purposes of measuring the difference between old and new computers: if a large fraction of the latency measured is due to the key travel, that's latency which is also present in the older computers (AFAICT buckling spring has a lot more key travel before activation than the scissor-switch keys of the apple and most laptop keyboards) Surely for the purposes of the comparison you would want to look at switch activation-to-bus-activity latency.


Why have that kind of resource in a keyboard?

Some keyboards were made with 4 bit processors. I have yet to look one up and perhaps I should.

Pretty much any 8 bit CPU would be luxurious. And low latency due to the single task, respectful code density, and rapid state changes for interrupts.


That write up is fantastic but it's undated and probably from 2016/2017.


For reference, the Atari 2600 had 128 bytes of RAM with about 30 million devices sold


I thought I sort of understood how computers work until I saw that.

I really can't figure out how to do a full screen video game with state in 128B


The program and assets are stored in a ROM cartridge, so only mutable data needs RAM.

Actually drawing things on screen depends on two things:

The first is racing the beam. The display output traces across the entire screen at 60Hz, one scanline at a time. At no point does a complete image exist, instead you just make sure to apply changes so that the machine can draw what needs to be drawn just before the beam traces that part of the image. You will need to cycle count so that your program takes exactly the right time to execute every section, because you certainly won't have time for interrupts or other synchronization.

The second is using dedicated hardware, where you store the location on screen, color and memory address of a sprite, and the hardware draws it for you. There are a very limited amount of sprites available, which limits the amount of things that can happen in a single line.


There was no framebuffer in those consoles [1]. So you pretty much only have to store game state and some auxiliary data in those 128 bytes, which starts sounding a lot easier.

[1] https://en.wikipedia.org/wiki/Television_Interface_Adaptor


or, a lot harder, since your code can only draw a line a time, not work with the whole frame buffer!


Modern games now have programmers deal with drawing a frame a pixel at a time when writing shaders. The GPUs themselves render a tile at a time and not the whole buffer.


Look up 'racing the beam' if you haven't before. The answer is... you can't! It didn't have a frame buffer and lines had to be written to the display one at a time. There was a lot of sprite flicker as many games had more on screen than the console could actually display in one frame.


Pacman was horrible.


There is no frame buffer. The graphics are all drawn by manipulating a few registers in the video chip.

Everything is scan lines and cycles. You get a one bit per pixel 40 pixel wide background, a couple of single color 8 bit sprites and a couple more two bit wide sprites and that is pretty much it. A lot can be done by simply changing a color register at specific times too. Reusing sprites happens regularly as well. (Sprite drawn at left of screen can be repositioned to the right to be seen again. That is the "racing the beam" part you may have heard people mention.

Most of the CPU run time available for each frame is spent generating the display a scan line at a time.

The real game happens during the vertical blanking period.

Almost everything comes from ROM, leaving ram for game state and the few objects that may need to be dynamic, and even those are in ROM when there is room for all the states.

It is an odd feeling when you run out of room. The phrase, "I used every BIT of RAM" is literal! Happened to me once. No more bits and I had to either leave out a feature, or take a speed penalty by packing multiple states into single bytes.


It's basically only the state in the RAM, the game code is in ROM on the cartridge (you can have up to 4KB of ROM before have to rely on bank switching tricks). Video on the 2600 is weird, there isn't any video memory to speak of, you basically set up the video chip line by line in code.


Great video [1] on how some clever tricks are used to stay within memory constraints.

[1]: https://www.youtube.com/watch?v=sw0VfmXKq54



Would be more impressive if the 2600 had more than 128 bytes RAM--that's bytes not KB.


Wow, I was surprised to see my article from 2013 appear on HN today! (It's hard to believe that the article is 10 years old.) In that time, the x86 documentation has expanded from four volumes to 10 volumes. Curiously, the number of pages has only grown by 21% (4181 pages to 5066). The x86 instruction set has added a bunch of features: AVX, FMA3, TSX, BMI, VNNI, and CET. But I guess the number of new instructions is relatively small compared to the existing ones.


Back in 2006, Intel's Montecito had a transistor budget for more ARM6 cores than the ARM6 had transistors, so:

Transistor : ARM6 = ARM6 : Montecito

And that was a long time ago, Montecito had a measly 1.72 billion transistors, Apple's M2 Ultra has 130 billion.

Makes the whole Transputer ( Transistor:Computer ) idea seem somewhat prescient.

https://blog.metaobject.com/2007/09/or-transistor.html

And an idea how you could use those transistors differently from now:

https://blog.metaobject.com/2015/08/what-happens-to-oo-when-...


I know it's not a straight conversion, but I've wondered what performance would be like on a multi-core, multi-gigahertz 6502 successor. Even a 486 had a million transistors, think how many 6502's could be on a die with that same count.


For an idea in that vein (many 6502's on a die), look at (the sadly defunct) Epiphany, in the Parallela [1] (the buy links are still there but I doubt any are still available).

I have two, and they're fun little toys and I wish they'd have gotten further, with the caveat that actually making use of them in a way that'd be better than just resorting to a GPU is hard.

The 16-core Epiphany[2] in the Parallella is too small, but they were hoping for 1k or 4k core version.

I'm saying it's a similar idea because the Epiphany cores had 32KB on-die RAM per core, and a predictable "single cycle per core traversed" (I think, not checked the docs in years) cost of accessing RAM on the other cores, arranged in a grid. Each core is very simple, though still nowhere near the simplicity of a 6502 (they're 32 bit, w/with a much more "complete" modern instruction set)

The challenge is finding a problem that 1) decomposes well enough to benefit from many cores despite low RAM per core (if you need to keep hitting main system memory or neighbouring core RAM, you lose performance fast), 2) does ot decompose well into SIMD style processing where a modern GPU would make mincemeat of it, and 3) is worth the hassle of figuring out how to fit your problem into the memory constraints.

I don't know if this is an idea that will ever work - I suspect the problem set where they could potentially compete is too small, but I really wish we'd see more weird hardware attempt like this anyway.

[1] https://parallella.org/

[2] https://www.adapteva.com/docs/e16g301_datasheet.pdf


That reminds me of the Green Arrays computer with 144 cores, programmed in Color Forth.

Chuck Moore has a talk on it here, very interesting stuff: https://www.youtube.com/watch?v=0PclgBd6_Zs

I'm not sure how far this has gone since then. The site is still up with buy links as well: https://www.greenarraychips.com


Yeah, I think they're even more minimalist. It's an interesting design space, but hard to see it taking off, especially with few developers comfortable with thinking about the kind of minimalism required to take advantage.of these kinds of chips.


Completely agree, it's an almost total shift from everything that's in use these days. Very interesting to play with though, I'd love to see some real world use.


And 4) how to handle problems that don't map well onto massively-parallel machine. Some problems / algorithms are inherently serial by nature.

Some applications like games decompose relatively easy into separate jobs like audio, video, user input, physics, networking, prefetch game data etc. Further decomposing those tasks... not so easy. So eg. 4..8 cores are useful. 100 or 1k+ cores otoh... hmmm.


True, but I'd expect for a chip like that you'd do what the parallella did, or what we do with CPU + GPU and pair it with a chip with a smaller number of higher powered cores. E.g. the Parallella had 2x ARM cores along with the 16x Epiphany cores.


for control parallelism. for data parallelism you get to soak up as many cores as you want if you can structure your problem the right way.

personally I'd love to see more exploration at hybrid approaches. the Tera MTA supported both modes pretty well.


I'm not a hardware engineer, so take this with a grain of salt:

The 6502 doesn't have a cache, only external memory. So performance would probably be much worse than naively expected (except perhaps for workloads that fit in CPU registers, edit -not even those due to the lack of a code cache as well). Memory latencies haven't improved nearly as much as CPU speeds have, which is why modern CPUs have big caches.

The CPU would be idly waiting for memory to respond most of the time, which completely kills performance.


That's because on (most) modern systems, main RAM is combined (shared memory space) and external from CPU, connected through fat but high-latency pipe.

A solution is to include RAM with each CPU core on-die. Afaik this is uncommon approach because semiconductor fabrication processes for RAM vs. CPU don't match well? But it's not impossible - IC's like this exist, and eg. SoC's with integrated RAM, CPU caches etc are a thing.

So imagine a 'compute module' consisting of 6502 class CPU core + a few (dozen?) KB of RAM directly attached + some peripheral I/O to talk to neighbouring CPU's.

Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Examples:

  GreenArrays GA144
  Cerebras Wafer Scale Engine
(not sure about nature of the 'CPU' cores on the latter. General purpose or minimalist, AI/neural network processing specialized?)

Afaik the 'problem' is more how to program such systems easily / securely, how to arrange common peripherals like USB, storage, video output etc. in a practical manner as to utilize the immense compute power + memory bandwidth in such a system.


> Now imagine a matrix of 1000s of such compute modules, integrated on a single silicon die, all concurrently operating @ GHz clock speeds. Sounds like a GPU, with main RAM distributed across its compute units. :-)

Sounds like a new version of the Connection Machine (the classic / original one).


The 6502 didn't need cache because its clock speed was slower than the DRAM connected to it. That made memory accesses very inexpensive.

The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.


> The Apple II took advantage of the speed difference between CPU and DRAM and designed the video hardware to read from memory every other memory cycle, interleaved with CPU memory access.

Same for the C64. Sometimes it was necessary to read slightly more than that for video display though, so the VIC (video chip) had to pause the CPU for a bit sometimes, resulting in so called "badlines".


I think they used SRAM and not DRAM, which is how it was faster than a clock cycle.


Both were faster. DRAM was a little cheaper, but required more circuitry to handle refresh on most CPUs making it a wash cost-wise on some designs. Typical woz engineering got the cost of the refresh circuitry down to where DRAM made sense economically on the Apple ][.

Interestingly Z80s had DRAM refresh circuitry builtin which which was one reason for their prevalence.


Screen output was the DRAM refresh.

And for the Z80: also that it only needed GND and 5V. The 8080 also needed 12V. And the Z80 only needed a single clock phase -- the 8080 needed two. The 6502 also only needed 5V and a single clock input (the 6800 needed two clock phases). The 6502 and Z80 were simply a lot easier to work with than most of the competition.


Nope, they used 16k x 1-bit DRAM ICs.


You'd probably want to add at least a small cache, sure. Putting the zero page (256 bytes) on-die on each core (because it's typically used as "extra registers" of sort in 6502 code) plus a small additional instruction cache could do wonders. But you probably still wouldn't get anywhere near enough performance for it to be competitive with a more complex modern architecture.

It'd be a fun experiment to see someone do in an FPGA, though.


> The 6502 doesn't have a cache

I don't think CPU caches were much of a thing back then, at least in the segments involved. AFAIK Intel wouldn't get on-die cache until the 486 (the 386 did support external L1, I think IBM's licensed variants also had internal caches but they were only for IBM to use).

The most distinguishing feature of the 6502 is that it had almost no registers and used external memory for most of the working set (it had shorter encodings for the first page, and IIRC later integrators would use faster memory for the zero page).


> I don't think CPU caches were much of a thing back then

Indeed, because memory was fast enough to respond within a single cycle back then. Or alternatively, CPU cycles were slow enough for that to happen depending on how you want to look at it :)


6502 systems had RAM that ran at system speed. Everything was so much slower then so this was feasible... plus, some systems (like the Commodore VIC-20) used fast static RAM.

The 6502 even had an addressing mode wherein accesses to the first 256 bytes of RAM -- the "zero page" -- were much faster due to not needing address translation, to the point where it was like having 256 extra CPU registers. Critical state for hot loops and such was often placed in the zero page.

Do not presume to apply principles of modern systems to these old systems. They were very different.


The saving for the zero page is only the more compact encoding which saves one cycle for memory access (for the non-indexed variants, anyway)

E.g. LDA $42 is encoded as $A5 $42, and so takes 3 memory accesses (two to read the instruction, one to read from address $42). LDA $1234 is encoded as $AD $34 $12, and so takes 4 cycles: to read $AD $34 $12, and one to read the contents of address $1234. Same for the other zeropage vs. equivalent absolute instructions.

See e.g. timing chart for LDA here:

http://www.romdetectives.com/Wiki/index.php?title=LDA


Now I have to wonder whether it would be possible to build a cache around an unmodified 6502. I.e. a component that just sits between the (unmodified) CPU and memory and does caching.


I don’t think an unmodified 6502 would get faster; it assumes memory is fast enough for its clock speed, and won’t, for example, prefetch code or data.

Also, if you add a modern-sized cache, you won’t need the memory at all in a single-CPU system; the cache would easily be larger than the addressable memory.


It'd only benefit if the latency of the external memory is bad enough to keep the cores waiting, sure, and certainly you're right regarding a single-CPU system. I think this fantasy experiment only "makes sense" (for low enough values of making sense) if you assume a massively parallel system. I don't think it'd be easy (or even doable) to get good performance out of it by modern standards - it'd be a fun weird experiment though.


If I were to build a system with lots of 6502s, I think I would build something transputer-like, so that each CPU has its own memory, and the total system can have more than 64 kB of RAM.

The alternative of a massively parallel system with shared memory, on 6502, would mean all CPUs have 64 kB or memory, combined. I think that will limit what you can do with such a system too much.


See: https://news.ycombinator.com/item?id=37011671

These were intended to get to 4k cores (but 32 bit), w/32K on-die memory, but the ability to read from/write to other cores. A 6502 inspired variant would be cool, but to not get killed on the cycle cost of triggering reads/writes you'd probably want a modified core, or maybe a 65C816 (16-bit mostly 6502 compatible).


Should be, the 6502 has a RDY signal, to insert wait states in case accesses aren't ready yet. You'll need some way to configure the cache for address ranges and access types, but that would presumably just be more mmio.


The Apple IIc+ did this with a 4MHz 65C02.


Back then memory latency was faster than a cpu clock cycle so it was not needed.


No division, no floating point, only 8-bit multiplication and addition, no pipeline, no cache, no MMU, no preemptive multitasking, very inefficient for higher level languages (even C). But you would get about 450 of them for the number of transistors on a 486.


Multitasking is tricky because the stack pointer is hard-coded to use page 1. it could be moved first on the 65EC02 and on the 16-bit 65C816.

In a code density comparison, the 6502 is as bad as some RISC processors where instructions are four bytes instead of one. <https://www.researchgate.net/publication/224114307_Code_dens...>

But it was fast, with the smallest instructions taking 2 cycles (one to read the op-code, one to execute). A 1 MHz 6502 is considered on par with a 3.5 MHz Z80, overall.


The 6502 does not have any multiplication instructions. It does have very fast interrupt latency for the processors of its time, though.


Not very good, I wouldn't think. It didn't have many instructions or registers. There would be a LOT of memory reads/writes, a lot of extra finnagling of things that modern architectures can do in one or two instructions very cleanly.

Assuming no extra modern stuff was added on top of the 6502 and you just got a bog-standard 6502 just at a very high clock speed, then as the other comment mentioned there was no memory caching, no pipelining (though there have been lots of people interested in designing one with a pipeliner), and not a whole lot of memory space to boot as it had 16-bit wide registers.

Thus, most 6502 applications (including NES games, for example) had to use mappers to map in and out different physical memory regions around the memory bus, similar to modern day MMUs (just without the paging/virtualization). It would be hammering external memory like crazy.


> and not a whole lot of memory space to boot as it had 16-bit wide registers.

If only. Its program counter is 16 bits, but the other registers are 8 bits.

> It would be hammering external memory like crazy.

For some 6502s, like crazier. The NMOS 6502 has a ‘feature’ called ‘phantom reads’ where it will read memory more often than required.

https://www.bigmessowires.com/2022/10/23/nmos-6502-phantom-r...:

“On the 65C02, STA (indirect),Y requires five clock cycles to execute: two cycles to fetch the opcode and operand, two cycles to read two bytes in the 16-bit pointer, and one cycle to perform the write to the calculated address including the Y offset. All good. But for the NMOS 6502 the same instruction requires six clock cycles. During the extra clock cycle, which occurs after reading the 16-bit pointer but before doing the write, the CPU performs a useless and supposedly-harmless read from the unadjusted pointer address: $CFF8. This is simply a side-effect of how the CPU works, an implementation detail footnote, because this older version of the 6502 can’t read the pointer and apply the Y offset all in a single clock cycle.”

I don’t know whether it’s true, but https://www.applefritter.com/content/w65c02s-6502 (Comment 8) claims:

“The Woz machine on the Disk II card is controlled by the address lines (which set its mode) and exploits this "phantom read" to switch from SHIFT to LOAD mode during floppy disk write operations so in the CPU cycle immediately following the phantom read, which is the write cycle, to the same address as the phantom read (assuming no page crossing this time), the Woz machine will be in LOAD mode and grab the disk byte ("nibble") from the 6502 data bus. The Woz machine does a blind grab of that byte, it does not look at the R/W signal to determine if it's a write cycle. If the phantom read is missing, it just won't work because the data byte will be gone before the mode changes to LOAD.

I am quite sure that this is how the Disk II works. But this is where my own hands-on experience ends.”


> If only. Its program counter is 16 bits, but the other registers are 8 bits.

Oh wow, I didn't remember that. I had to go back to double check it! Maybe the NES had wider registers? Or maybe I'm just completely misremembering, which is definitely probably absolutely the case.

Also, thanks for all the other links! Really interesting stuff.


There's the mostly 6502 compatible 16-bit WDC 65C816 that'd probably be a much better, less painfully minimalist starting point.

Apart from a MMU, the biggest hurdle I think for modern dev on the original 6502 is the lack of multiply/divide and the pain of having to either forgo modern levels of stack use or emulate a bigger stack.


Back in the day, Rockwell had a dual 6502 variant in their datebook. Two cores, each basically interleaving memory access.

The fun part about 6502 and many friends from that era was memory was direct access, bytes moving in and out every CPU cycle. In fact the CPU was slow enough that often did not happen, allowing for things like transparent DMA.

Clock a 6502 at some Ghz frequency and would there even be RAM fast enough?

Would be a crazy interesting system though.

I have a 16Mhz one in my Apple and it is really fast! Someone made a 100Mhz one with an FPGA.

Most 6502 ops took 3 to 5, maybe 7 cycles to perform.

At 1Mhz it is roughly 250 to 300k instructions per second.

1ghz would be the same assuming fast enough RAM, yielding 300M instructions per second.

I have always thought of it as adds and bit ops per second, and that is all a 6502 can really do.


This comment makes me want to read an article focusing on how performance-per-transistor has changed over time.


I'm trying to be wowed by this, but... yes, modern tech be complex. A modern website has more source code than the first few versions of Windows put together.


anecdote : most 'high volume source code' sites i've encountered are that way because of a need to meet standards on a large amount of devices, compliance regulations, and scaffolding/boiler-plate; 90% of it being dead-code boilerplate.

the logic comes at the very tail end, and often it is exceedingly limited, handing off msot of the work to third party APIs/whatever.

I guess what i'm trying to say is that source-code volume is a lousy metric for ascertaining 'complexity'; something can be huge and cumbersome but still only use simple logic that's easy to follow once you get past the cruft.


What you describe is basically how DNA works. Most of it is inactive junk. Parts of which, however, gets activated when... who knows what happens.

Basically we got coded by millions of interns.


And some of the code is never called, but removing it causes weird crashes because the cell.exe compiler is finicky. And some of the code shouldn't be there, but it was left behind by a virus infection that wasn't completely cleared out.


Most are done this way since they are outsourcrd to the cheapest bidders who "glue" multiple technologies poorly on top of each other. And then add tons of ads / tracker / analytics code.

Lots of websites dont care that their technology is terrible.


It probably also has more bugs than the 6502 has transistors :)


All processors have bugs. Despite their simplicity the 6502 were hardly bug-free: https://en.wikipedia.org/wiki/MOS_Technology_6502#Bugs_and_q...


There are transistors that have more than one page of documentation.


Brilliant!


Update to the 2013 numbers: As of today, the Intel SDM (https://cdrdv2-public.intel.com/782158/325462-sdm-vol-1-2abc...) has 5066 pages.


There are parts that should be read, but when you only need a reference a good lookup tool is invaluable:

https://github.com/skeeto/x86-lookup


That is an intriguing stat.


So what? The 6502 is stuck at what, 40 years ago?


50.

The 6502 was a famously simplistic and cheap processor, with about 2/3rds the transistors of an 8080 or 6800 (and like 20~25% the price at release), half that of a Z80.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: