How fast can a 6502 transfer memory?

MarkusWandel · on June 16, 2022

This also misses loop unrolling, combined with an assembly language version of "Duff's device" to be able to do an arbitrary number of transfers even if your loop is unrolled to, say, 8 transfers.

This stuff used to matter! I had an NCR5380 chip on an Amiga, simple, memory mapped I/O, no DMA or interrupts. To get a tape drive to stream (remember that?) the byte transfer loop really had to be tweaked. But once fully tweaked, "whoooooooosh" instead of "chugga chugga chugga".

And truly heroic programming techniques had to be employed on the C64 to do X/Y smooth scrolling games. Often a static part of the screen, conveniently displaying scores etc, existed to make it work - there was just enough bandwidth to do 80% of the screen, say, so you find an excuse to keep the rest of it static.

I kinda miss those days, and I kinda don't. I guess it was good to have experienced them.

vidarh · on June 16, 2022

Most C64 games used character-based graphics (coupled with the smooth scrolling support in the VIC) which meant you'd at most move 2000 bytes to scroll the entire screen every 4 to 8 pixels scrolled.

You can easily scroll the entire screen on a C64 if that's all you're doing.

Some games did also scroll bitmaps. There the naive version requires moving 9000 bytes (40x25x8 for the bitmap data, 40x25 for the colour data) every time you need to scroll, and that indeed starts to bite. There are games which reduces the cost this using a trick called AGSP ("Any Given Screen Position").

But you're right static parts of the screen were often larger to reduce the dynamic part. That was rarely down to just scrolling the screen in isolation, though, but because the overall budget of cycles you had to work with was tiny. Often you might also have a lot of other stuff which consumed lots of cycles affected directly by the size of the playing field. E.g. if you did sprite multiplexing (moving a sprite after it had been partially or fully rendered to reuse the same hardware sprite), you might well be keeping the CPU busy throughout the full rendering of the playing field.

There was also the consideration of how much effort you wanted to go to in order to avoid glitches, since unless you could do the scrolling entirely while the VIC was rendering the parts of the screen outside the playing field, you'd need to make sure the rendering and copying didn't overlap, and of course just restricting playing field size was an easy workaround for that problem.

jimsmart · on June 16, 2022

Ex C64-games coder here! — If your sprite multiplexer was taking most of CPU during the screen draw time, then honestly it was not a particularly great multiplexer! ;)

Most decent multiplexers took just a scanline or two/three, multiple times down the screen (i.e. whenever relocating any already drawn sprites) — often with decent sized gaps (time when the CPU wasn't involved in manipulating sprites and could do other things), with a larger chunk during the offscreen period / at the bottom of the screen, when one was preping the data (mostly sorting the sprite's y-coords) for the next frame's screen draw.

— During debugging/etc, we'd often enable colour changes to the screen border, at the beginning and end of the multiplexer code (for both the interrupt stuff in the playfield, and the non-playfield section), so we could visually see how it was working/performing.

cesaref · on June 16, 2022

64 coder here too!

The border changing thing has just reminded me how bad the development process was using the commodore assembler with a 1541 drive which was horribly slow. assemble, dump image, reboot, crash, reboot, load assembler, try and work out what had happened :)

At some point I ended up with a PC running a system called, I think PDS, which was a cross assembler with dongle to push the image straight into the memory of the C64. I even think you could inspect and change memory on the running machine - it was amazing!

jimsmart · on June 16, 2022

Yeah, we all used PDS too, although not originally. Pretty good system, particularly for that era, and cost/capability-wise (though they weren't that cheap, and folk eventually started cloning the boards for them, IIRC).

I remember it was annoying to have only 8 main source files in PDS though, most big projects went past the 8 files of however many kb (although it could also handle include files, which was how one got around that limit).

Although when I actually started out as a C64 games dev, my dev system was a BBC Micro B, linked to a C64. Not quite a cool as PDS, but it could assemble code 2x the speed of the C64 (the processor clocked twice the speed on the Beeb), and it was great having a separate 'host' system for development.

jimsmart · on June 16, 2022

Here's a link to info about the PDS kit, in case anyone is interested:

https://www.cpcwiki.eu/index.php/PDS_development_system

vidarh · on June 16, 2022

Sure, the "nice" way of doing it is to rely on the raster interrupt. But I've also seen way too much C64 code where pretty much everything ran in the interrupt handler, with associated stupid busy waiting because it saved people from having to synchronise. I'd guess more commonly for cheap and cheerful ports from less capable machines, but it's been a couple of decades since I've actually looked at any of this code.

MarkusWandel · on June 16, 2022

I didn't get that fancy. I got as far as a horizontal smooth scroller, but with the "move the screen memory during one 1/60 second redraw cycle" mentality - racing the redraw, and when it was just about caught up, whoa, time for the static bar at the bottom.

Quite right, one could prepare the moved version in the background during the 7 steps where you're merely diddling the smooth scroll register, and then flip to it in an instant. But wait, was it possible to page flip the colour map? Also, always having the appropriate moved version ready even as the player is doing unpredictable things goes into the "heroic programming techniques" zone again.

As for glitches, it's amazing what can be done if perfection is sacrificed and there were plenty of good games that did have them, e.g. sprite multiplexing. But I did mean "effortless looking perfect" smooth scrolling.

jimsmart · on June 16, 2022

Ex-C64 games coder here: you are are correct - no, you couldn't relocate/page-flip the colour map, like you could the character map. So you had to update it all somehow on the required frame.

The fastest technique I saw for updating the colour map in a single go, was to have the whole thing as a huge block of immediate mode load-stores, then one could 'scroll' the data across the LDA instructions within that code, in advance, over n-frames, then call this self-modified code block when one did the character screen flip (immediate load-stores was faster than load-stores from colour ram). e.g.

  scroll_splat_colour:
    LDA #$00  # colour data for char
    STA $D800 # colour ram
    LDA #$00
    STA $D801
    # etc., for every visible char onscreen in scrolling area

And one would be updating/scrolling those values loaded into the A register, in chunks over previous frames, similar to:

    LDA scroll_splat_colour+6
    STA scroll_splat_colour+1
    LDA scroll_splat_colour+11
    STA scroll_splat_colour+6
    # etc., for every lda/sta in the above

Perhaps not the clearest explanation, but hopefully enough to communicate the idea.

FWIW, I didn't invent that technique, it was an improvement Jon Williams made to my code, whilst we both worked for Images Software (now Climax). Not sure where he got it from, maybe he invented it himself, maybe he cribbed it from elsewhere.

Related: I thought sprite multiplexing was awesome, and there were quite a few tricks there too to get it performant. But that's another far more complex topic.

vidarh · on June 16, 2022

Another "obvious" trick is to narrow the playing field but animate the rest, and then safe cycles by a combination of bands that requires less updates and sprites. E.g. Pole Position is a classic example, where the graphics covers most of the screen, but only about half actually has gameplay. The rest consists of a very narrow band of mountains, and a couple of bands of clouds. I haven't looked at what they did for Pole Position, but that pattern of the actual gameplay being constrained to a much smaller portion of the screen than what looks like the playing area is pretty common.

justinlloyd · on June 16, 2022

Yeah, compiled graphics and compiled colour tables, also, a routine that could self-modify code in regions of RAM to do the colour table writes. A slow set-up function at level start would build the code to be JSR'd later in the level. We did that on a few games on the C64 and the Speccy and Beeb and Atari. Later used the same techniques in DOS on PC. And of course, doing the same tricks but with D0 through D7 and A0 through A6 on Atari ST and Amiga. Also doing "stuff" in zero page because the address loads were shorter. And avoiding 256-byte page boundaries where possible because of the cycle penalty.

jimsmart · on June 16, 2022

> A slow set-up function at level start would build the code

Interesting, and good thinking :)

IIRC, when we used this technique on the C64, we didn't build the code during init at runtime, we actually built the code in the dev environment, using macros, so it got built at assembly/compile time. So we skipped the small time hit at runtime init, at the expense of a slightly longer load time for the user (and a tiny bit longer on our assembly/compile times, although that was fairly negligible cos we were building on PCs).

justinlloyd · on June 17, 2022

Yeah, we did macro builds in the development environment too, but I also had access to doing some nifty and complex macro and pre-compiled graphics code by making use of BBC BASIC to do the manipulations. Like if you were doing soft sprites, pre-compiled functions to auto-shift the sprites on the X axis without having to do any kind of ROL/ROR and handling carry bits. One of my colleagues wrote really nice set of functions where you could specify that "this sprite should have eight shifts created during setup, go create the necessary loops, toot-sweet!" or "this fast moving bullet sprite can only appear on four pixel boundaries, only needs two shifts, and can be pre-compiled at build time."

There were other functions that we all contributed to for doing horizontal and vertical flips and arbitrary rotations and a ton of stuff for collision detection boxes and bind points for weapon pickups and emission points for bullets being launched, muzzle flash, or shell casings ejected from the weapon. Tons of what was effectively dynamic macro code where you just set up a table of stuff you wanted included, and some data about how certain sprites should be handled.

My development environment was a BBC Micro Model B with a macro assembler and double-sided/double-density dual 400KB floppy drives attached - 800KB of storage, which let me assemble the code from multiple pieces. Later I had a 20MB HDD. The BBC Micro could squirt the code over to the C64 via the Beeb's 1MHz Tube IO bus straight in to the C64's cartridge port. Instant load on the C64 effectively. Also had the same setup for the ZX Spectrum, though when I worked at a different company, we used RML 380Z machines IIRC, and everything ran over a shared network.

jimsmart · on June 17, 2022

Nice tricks! I never did much with soft-sprites on any platform really.

One C64 game I worked upon did a bunch of interesting similar kinds of tricks — flips in both axis, plus the stored graphics only used a partial area of the h/w sprite, so I could pack the graphics in advance, and copy them out every frame at runtime — but only for the main character sprite.

My first 'proper' C64 dev environment was actually a BBC B, as I mentioned in another reply — almost identical setup to yours, but without the hard disk. A much better setup than just working on a single machine! After that, we used PCs with PDS cards.

justinlloyd · on June 17, 2022

Oddly enough Andy Glaister, creator of the PDS cards, and I, (Andy is two months younger than me) have very similar early careers of creating video games, heading in to local computer shops and buying C15 tapes, then duplicating them, photocopying the labels, and collecting some nice royalties for an 11-yr old. We had never met up until we both independently moved to the US and our paths crossed for a short time (Activision, other companies). Life is weird.

I'll go hunt for your other comments down thread, seems you and I also share some disconnected history of developing games.

jimsmart · on June 17, 2022

Interesting, cheers.

I did a brief stint at Software Studios / Electric Dreams (in Southampton) — who were owned by Activision — in the late 80s. And then I did some work (via Images Software) for Activision (in Reading), a little later in the 80s. Briefly met quite a lot of well-known (better known than I, anyhow) devs during that time.

I think I recall meeting Andy Glaister at some point (though not at Activision), but I might be wrong: his name was often mentioned in our office because we (Images s/w) were dealing with PDS quite a lot. But if I did meet him, it was only very briefly, and he likely wouldn't remember me. Him and our boss (Karl Jefferies) seemed to have quite a few meetings.

I think the early / home computer games dev world was actually kinda quite small (in the UK), well, fairly tightly networked — lots of folk knew one another, or were just a couple of friend-network links away. It seemed quite commonplace to meet other devs, usually during what is now called 'crunch time' towards the end of a project, in the publisher's office. I still have memories of sleeping under various desks, in various offices, on various projects.

— Life is indeed weird!

SyneRyder · on June 17, 2022

I thought I remembered the name Jon Williams, so I looked it up on Lemon64... and it seems that you and Jon worked on one of my favorite games as a kid, C64 Back To The Future 2! Somewhere around here I still have my genuine boxed retail copy. I remember it as one of the few games I was absolutely determined to complete to the very end. I might have to break out an emulator and give it another play for nostalgia (still have my C64 too, but haven't turned it on in a long time). Thanks for working on that, it made the little-kid version of me very happy!

jimsmart · on June 17, 2022

Thanks for the memory! And I'm glad to hear you liked the game :)

IDK if it's a game that really stands the test of time, but having worked on it always kinda skews one's view: it's hard to be objective.

Yeah, we both worked on C64 BTTF2. Although we occasionally talked and shared a bit of code when we were working on a few other separate projects too. He came to Images after I'd already been there for a while, so he got free rein to use any code I previously written for the company, so we talked quite a bit, and he'd always discuss how he was coding stuff, and share any improvements he'd made. Nice guy, and a good coder.

— Funny thing is, I never received a boxed copy of BTTF2 myself, heh. That was actually a really common thing, on a lot of games. Most of them in fact :/

6510 · on June 16, 2022

> # etc., for every visible char onscreen in scrolling area

For every changed char. (which is sometimes more and sometimes less)

You could do them in order but if you're using only a few characters you need only 1 LDA for each char. (How to do this is left as a creative exercise for the reader)

jimsmart · on June 16, 2022

But the overheads of tracking which characters might have been changed here completely outweighed simply scrolling / updating the whole thing. The code becomes too involved in tracking changes, and fudging about with- / rewriting- the splat code.

You can leave it as 'a creative exercise for the reader', but that's because you can't solve this for the generic case (i.e. any map the graphics artists might give you) in less cycles than simply dealing with each and every character, which is the worst case.

Processing that many bytes, and doing comparisons and extra branches, simply becomes overheads, and, very quickly, your code is slower than simply updating / scrolling everything simply.

For the colour splat routine, having a giant, pre-assmebled block of immediate-mode load store pairs for every character is as optimal as it gets — and handles all cases — on the C64, you only have a frame to update the colour RAM (because it cannot be relocated/paged), and you are generally chasing the scan beam to move that much data before the next frame.

You don't have the luxury of having extra cycles to re-write that block of code at runtime, and rewrite the code that scrolls the data within that code, nor do you have the luxury of having enough spare cycles to be comparing data, and branching conditionally depending on if it has changed or not.

Perhaps you misunderstand the technique I describe, or perhaps you under-estimate the overheads required to perform what you describe. Or perhaps both.

6510 · on June 17, 2022

for example:

say we use only 4 chars and repeat them in the same order

then we only have 4 pre-baked transitions.

You could then swap all 4 characters in all 4 positions for free.

I confess I was a bit amused when you wanted to give the graphics artist complete freedom.

vidarh · on June 17, 2022

I agree with you. So many C64 games have plenty of restrictions on the art that are very clearly a result of cycle and memory budgets.

Other tricks to reduce transitions too.

Say you have a cityscape where the map is a road with assorted buildings and street level stuff. The road itself might never change colour. Assorted stuff low down in the playing field might. You might want building stretching the full height of the screen, but you can avoid full height colour transitions by demanding that the graphics never go directly full height from one colour too another. Splitting the transitions by requiring set-backs and "air" or colour consistency for part of the buildings might well let you reduce the cost of setting the colours significantly.

E.g. I have no idea if Cobra took advantage of this, but on a hunch I just sped through a long-play video of it, and I don't think there are any full height transitions from one colour to another, and there are bands where the colour never changes within a level. Buildings overlap, so most colours are either "add a bit" or "remove a bit", and you if you needed (Cobra is simple enough they probably didn't) I'm pretty sure you could beat the generic case both on speed and size by baking a few small transition functions even with the cost of having to pick the transitions.

Akin to using fixed bands of same colour at the top or bottom, possibly accented with a couple of multiplexed sprites, another "cheap" trick that'd limit the graphics artists a bit but wouldn't do much in terms of appearance would simply be to require colour changes only every few rows near the top of the screen. E.g. let those buildings (or whatever) extend to the top, but require the top floors to stick to the same colour. Now you can reduce e.g LDA/STA/LDA/STA/LDA/STA to LDA/STA/STA/STA and similarly reduce your inlined scroll code.

E.g. Cobra again, I haven't taken the time to verify, but a quick scan suggests there are bands in each level where the graphic is formulaic enough that colour changes only happens in pairs of rows or more. It's certainly close enough that if not, and you wanted to save a few cycles it'd be trivial to make sure it actually did.

jimsmart · on June 17, 2022

> colour changes only every few rows near the top of the screen. E.g. let those buildings (or whatever) extend to the top, but require the top floors to stick to the same colour. Now you can reduce e.g LDA/STA/LDA/STA/LDA/STA to LDA/STA/STA/STA and similarly reduce your inlined scroll code.

Here's the thing: I've just looked at video of Cobra. It's 100% plain to my eye (having worked on a multitude of published C64 games, over a number of years, and played a hundreds of C64 games, and taken their code apart to see how they tick) that this game is using full colour scrolling — albeit in only one direction. It might look like it's not, particularly on the first level, because you think the graphics look too simple. But sadly that's just how it looks.

You can plainly see this when various accent colours for e.g. windows and such like scroll into view. Don't be fooled by the fact it's tile based and the first level has poor graphics. And the fact it's full colour scrolling is even more obvious on later maps, e.g. see 5mins into this video:

https://www.youtube.com/watch?v=eog6-zvts6g

— That's full colour scrolling. In one direction. With tiles of 2x2 chars, with each char in the tile being able to have its own colour. 100%. There's no tricks there. Your theoretical/toy code example simply won't do what is seen on at the 5min mark there (level 2 or whatever).

vidarh · on June 17, 2022

> It might look like it's not, particularly on the first level, because you think the graphics look too simple. But sadly that's just how it looks.

I specificaly did not say that Cobra isn't using full colour scrolling. You're probably right that it is.

But what you've described does not require it.

I am saying that Cobra is an example of the style of level design that could easily get savings without sacrificing the graphics - including the accent colours. Cobra is otherwise simple enough that it probably wouldn't be worth it, but that is entirely besides the point I was making.

I did note the windows etc., yes. But the point is that there is no place in the whole game where there's a lot of big or different transitions going on at once, and the number of transitions is very small, and notably a lot of the transitions only occur at specific rows. The graphics as is is very constrained.

The 5m mark is a perfect example of exactly what I'm saying - only about ~11 rows changes colour, and from what I saw that's one of the most extreme transitions in the game. But while it affects many rows, it also uses few colours and so you can offset even some of that cost.

The way it's structured in faux 3D "layers" seems ripe to "decompress" the levels into a set of transition functions inspired by painters algorithm to "paint" bottom up, front to back, and then JSR at an offset into the transition function for the object "behind" if it extends past. You'd want to flatten some of the "layers" (at the cost of increasing the number of transition functions).

Here's what I'd try:

Since the transition functions can hardcode LDA immediate's, and STA absolute gets X indexing for one extra cycle, you can have the transition functions X index and used that to reuse the transition functions for every column. So e.g. instead of this:

   scroll_splat_colour:
      LDA #$00  # colour data for char
      STA $D800 # colour ram
      LDA #$00
      STA $D801

You'd get this for each transition sequence (could be a slice of an object, but in practice you'd probably want to flatten it somewhat, at the cost of more transition functions but reducing the cycle cost because of fewer JSR/RTS pairs):

    some_transition:
       LDA #$col1  
       STA $D800+offset_of_last_row_start, X
       STA $D800+offset_of_second_to_last_row_start,X
       ... and so on until colour change
       LDA #$col2
       .... sta to next row.
       RTS

You then decompress the level into a series of JSR calls per column, right to left + DEX, and a set of offsets. Every time you scroll, you then overwrite the DEX after the last (leftmost) transition with an RTS (and fix up the previous one), and JSR to the first (rightmost) list of bottom-up JSRs.

[Beware the likely errors in counting; long time since I've looked at cycle counts..]

You need to save 12 cycles per transition function for the JSR/RTS plus 2 for the DEX per column plus a handful of cycles to do the setup per frame to break even, plus 1 cycle per STA.

An LDA immediate/STA absolute pair is 6 cycles, so moving, say a 39x20 field adds up to 4680 cycles that way. If "everything" fails and you have one JSR + one DEX for 39 columns + STA absolute X-indexed for all 780 characters, you pay a cost of an extra 1287 cycles plus setup. Let's call it 1320. But this is before taking into account that you now don't need to do this:

    LDA scroll_splat_colour+6
    STA scroll_splat_colour+1
    LDA scroll_splat_colour+11
    STA scroll_splat_colour+6

Assuming you split that in 7 segments of 111 (for 780/7) updates of 4+4 cycles each, for 888 cycles, we're 432 short in the pathological case where every colour is different to the one below it and different to the one to its right.

But each STA absolute X-indexed we save, because colours do not change right to left, saves us 5 cycles. Each LDA we save because colours often repeat bottom to top saves us 2 cycles. If we assume these are evenly spread, we need 62 of each every frame to break even. In other words only 8% need to repeat. Of course if it just breaks even it's pointless.

The only constraint on the artist to do this would be to set a "budget" for transitions to make sure there's always enough repetition to make it worthwhile, but it's certainly doable.

jimsmart · on June 17, 2022

The main issue here is that this doesn't really work for anything but horizontal scrolling. The code I explained works for all directions. One of the games I coded was an 8-way scroller. Another had horizontal sections seamlessly connected to diagonal sections.

Granted I didn't show different version of moving the data through the splat code, for different directions, I only showed a single direction. And I now realise that multi-direction was not mentioned in my OP. Which is most likely why we seem to be talking at odds with one another somewhat here.

The code I showed works for all of these cases (8-way, horizontal-diagonal transitions, pure horizontal or vertical) — sure, one needs extra variations on the actual scroll code, but not the splat code — plus it is easier to reason about, quicker to develop, and doesn't impose artificial constraints on the graphics/map/tile-usage — which means the graphics won't take as many iterations to get right (you can bet the artist will make mistakes in the map, 100% guaranteed — cos without also modifying the tooling to account for all of this, that's what artists always do. Speaking from experience!).

> The only constraint on the artist to do this would be to set a "budget" for transitions to make sure there's always enough repetition to make it worthwhile, but it's certainly doable.

Even for a single-way scroller, it's quite hard to tell if the benefits you claim here really outweigh all of the costs involved (not just CPU costs - but dev time, map design time, adjustments to tooling, etc). IDK if it'd get the go ahead in any of the places I've worked, without a fully working demo.

But yes, sure, your idea may well be more optimal at runtime, for single direction scrollers. But I don't see how it would / could work for all scroll directions. And the additional constraints would likely give the artists nightmares! ;)

And, as you say, worst case is still gonna be ~25% slower.

The biggest problem that one is trying to overcome here is the additional hit to CPU budget every n-frames (usually 8), when the color RAM update is needed. Having something that might, under certain conditions, be slower, could be problematic. Whereas having something that is constant time is, perhaps arguably, always going to be easier to deal with.

But thanks for taking the time to explain. I kinda thought that that's what you were implying.

Apologies I hadn't been clear that the code I showed (immediate load-stores) was a colour RAM splat technique that was applicable to scrolling in any direction. It's kind of a key point really, and I see I completely omitted it :/ I thought it was mentioned, but I now see that the 8-way scrolling conversation was a slightly different thread.

vidarh · on June 18, 2022

Sure, doing it for 8 way would require a different approach and would likely be a lot harder to make work (not impossible, but it would likely constrain level design quite a bit)

To be clear, the one you showed is a great baseline, and likely the best option for most cases unless you hit a wall where you absolutely need to save extra cycles.

Only then it'd be worth exploring something this much more complex.

I guess my main point is that if you hit that wall, then an approach specialised for the level design can give extra savings.

But of course you're probably absolutely right that it'd not fit the budget of a typical C64 game back in the day.

I'd also add that this is far easier to do today, so it's easy for me to propose as an alternative now, but it's unsurprising if nobody used it back then.

E.g. I could throw together a script to do near optimal arrangement (I'm pretty sure finding optimal arrangements reduces to bin packing, which is NP hard) of transition function to maximise the minimum saved cycles for any given level data on screen and run it fairly easily on my laptop, but the number of permutations of "flattening" of layers would have forced you to resort to educated guesses back in the day, and so your savings might be more marginal.

(But now I kinda wish I had the time to put together a working example and see how far it could be pushed)

In terms of the worst case, to be clear for this to be useful at all you'd need to verify it never hit it. Given how many random colour changes would be required to hit it, I think you could guarantee that simply with a few minor constraints on the tile set (e.g. every 2x2 tile on screen with the same colours in all four positions saves you 2xSTA's and 2xLDAs) coupled with a willingness to make small tweaks (e.g. deleting a window from an otherwise "busy" part of a level etc.).

You can relatively cheaply validate it, since calculating the cycles for any given screen contents is easy, so you could validate levels by "scrolling" through the whole thing and calculating a graph of cycle counts at any given point. Of course, this is again much easier to do today when we can hack up a trivial script to run through the level data in seconds than it would have been back in the day.

And of course as you point out, you'd have needed a very good reason to spend the extra time (and impose the extra limitations on the level design) to make such small savings, or you'd just not have been able to justify it.

So unless you had a game so complex you absolutely had no choice but finding ways of cutting it down, it'd have made no sense. Even then I'm guessing back then people would have made the static parts of the screen bigger instead of anything complex like this.

jimsmart · on June 18, 2022

Sure. But arguably, that's a lot of extra work, plus some arbitrary restrictions (tiles of single colour, scroll directions), for a system that, at worst case, is surely going to be doing at least a little bit more work than the simple optimal colour splat technique I described.

Whereas if one has a system where there is no worst-case because/and it runs in constant time, and is as optimal as possible (i.e. the fastest way of updating all the necessary data) — which requires no restraints on the graphics, can work no matter which way the games/player scrolls, doesn't require much extra work on other frames (i.e. when not updating the colour RAM), nor any extra work using custom-made external tooling — it would, with little argument from most folk, clearly seem to be the better choice.

There are reasons why simple solutions often win out over more complex ones.

Particularly when coding on limited hardware. Particularly when under time pressure to publish. Particularly when graphics designer's iteration time has a cost per hour.

> So unless you had a game so complex you absolutely had no choice but finding ways of cutting it down, it'd have made no sense.

I've worked on lots of games projects that never actually got published, some of them were because we came across a wall, and despite trying lots of novel solutions, often many of which were 'outside the box', those issues couldn't be overcome. These things are massive time sinks, and, often, solutions that seem like good ideas, simply aren't.

Sometimes one simply has to accept that you cannot squeeze an elephant into a matchbox! It becomes easier to spot when solutions might not work as one becomes more experienced. But sometimes one might still spend time trying to squeeze the elephant.

Solving the worst case, with the least restrictions, in both the simplest and the most optimal way, often gives one a sensible measure as to what is actually reasonably possible.

— If you can reduce the time needed to do a full-colour update when scrolling arbitrarily (including 8-ways), with unrestricted use of colour design on the maps/tiles, on the C64 platform, then I'm sure there's a reasonable handful of folk that would like to hear your suggestions. Otherwise, technically you're solving another problem, and it's not the one which we were dealing with back then.

vidarh · on June 19, 2022

We absolutely agree it'd be a lot of extra work, hence I agree with you that it'd not have been applicable in most instances back then. Most of the time if you faced a squeeze like that you'd be more likely to agree to reduce the playing field by a row or two instead, or ditch something else. But being freed from those considerations and considering what is actually technically possible is what is interesting to me.

> — If you can reduce the time needed to do a full-colour update when scrolling arbitrarily (including 8-ways), with unrestricted use of colour design on the maps/tiles, on the C64 platform, then I'm sure there's a reasonable handful of folk that would like to hear your suggestions. Otherwise, technically you're solving another problem, and it's not the one which we were dealing with back then.

They are different problems, yes, but plenty of games have gameplay where 8-way scrolling is not what you're dealing with. And additional constraints also often fall out of the level design. To me it's the way those design constraints create opportunities to use additional tricks that are interesting, not the problems you were dealing with back then.

Frankly it's an extra artificial problem, because the state of the art of scrolling on the C64 today involves using "dirty tricks" from the demo-scene like VSP/HSP/AGSP which no "hardscroll" based approach like the ones we're discussing can compete with.

jimsmart · on June 17, 2022

> I confess I was a bit amused when you wanted to give the graphics artist complete freedom.

That's because I'm speaking from a number of years of actual real-world experience working on all manner of released production games for the C64 (plus other platforms) — from arcade conversions, to film licenses, to cross-platform ports, to original projects — and in most of those situations, you simply don't get to call the shots.

It's hardly some unknown abnormal situation to allow the graphic artist to use the character colour. Loads of games had unrestricted use of colour RAM. It's just what you do in most cases, unless you have a monochrome game.

Your toy example here is all well and good, but is not a fully working system by any means — I don't see how it can generalise out at all. And it certainly doesn't seem to be a more optimal way to handle unrestricted scrolling, with a colour map, on the C64, which is what I was outlining the most optimal method of doing. Furthermore, it seems tangential to your originally mentioned idea.

Whereas the method I outline involves no restrictions, no overheads, and will handle 8-way scrolling — with full colour map. And it works 100%, and has been used in multiple published projects. It's not theoretical.

Do you have any actual examples of released games that used the technique which you describe? Or any articles you can point me at? Or perhaps a more detailed explanation of this technique and how it can be used across the whole screen (in whatever directions). I'm more than happy to take a look to try and see what I may have missed.

— Out of interest, have you ever coded any complete C64 games?

IDK, perhaps you're speaking about some trick used in demos/intros, in which fancy techniques often wouldn't generalise out into a full game.

Apologies if it seems that I am missing something that might be obvious to you. Clearly my work on a whole bunch of published C64 titles, and my years of programming the C64 commercially, in various teams with- or alongside- other experienced programmers, was some years ago now (I quit the game industry in the 90s), and clearly we must've overlooked the technique of which you speak of. Or dismissed it for some reason.

6510 · on June 18, 2022

> Whereas the method I outline involves no restrictions, no overheads, and will handle 8-way scrolling — with full colour map. And it works 100%, and has been used in multiple published projects. It's not theoretical.

yes, it was wonderful enough to give the reader a clue how one would go about squeezing more and more performance out of that good old box. Nothing comes for free of course but there are pretty much infinite interesting trade offs to be had that usually exponentially blow up the complexity.

I mean you started out with a pretty easy to understand bit of memory transfer.

If the goal is to update "every visible char onscreen in scrolling area" it seems pretty much the final solution, nothing better is to be had.

Rephrasing the goal into "every changed char" is simply a different perspective. Most likely less productive but one could explore it as it would be faster if we restrict the freedom of graphics a bit.

Say you have a space ship <ooo> moving left to right by at most a single char. You have to set the chars that make up the ship and you have to insert a space next to it or it would turn into <<<<ooo> or <ooo>>>> There is no need to fill the entire line with spaces.

> perhaps you're speaking about some trick used in demos/intros, in which fancy techniques often wouldn't generalize out into a full game.

I never wrote games, I just tinker and rewrite bits of my demos producing inferior results 99 our of 100 times.

How many chars are we updating really if we change

  _ _<ooo>_ _ 
  1234567890A

into

  _ <ooo> _ _ 
  1234567890A

I think 3,4,7,8 but not 1,2,5,6,8,9,0,A ?

That was the train of thought in those days at least for me :)

jimsmart · on June 18, 2022

Yeah, I get what you're saying. But it's somewhat of a moot point mostly, certainly when one looks at the bigger picture, but arguably perhaps even for the small picture.

Usually (for most games on C64), one simply assigns sprites for most moveable objects, and then one has a background scrolling in whatever directions. Many commercial games were generally produced within some restricted time frame, according to a publisher's date. So under that time constraint the whole game has to be built. Generally, setting up one's graphics / screen drawing code (sprite multiplexer and a scroll system, and a coordinate system to tie them together) is done right at the beginning, and is a tiny percentage of the total time one has allocated for the project. The bulk of the time is usually spent implementing the game itself. Most work on graphics tricks / optimisations is done outside of this, unless there is something the game specifically needs / depends upon — or unless one runs into performance issues.

Having generic re-usable code for one's graphics system — being able to flexibly handle lots of sprites, and scroll the screen in whatever directions are needed (the colour splat code I showed can be used for horizontal / vertical / diagonal / full 8-way scrolling, with the appropriate code to scroll the data through it) — even if that system as a whole might need a little customisation depending on the game — is a huge time saver, and almost a necessity.

Sure, as a toy example, for a single object, one can see there's not much in the way of changes when moving something like a paddle for breakout, when it is aligned to character boundaries. But writing a generic system for that, for multiple objects, on these lower powered systems, will likely see the technique fall on its arse - unless the game is mostly made from things with repeated characters, that move horizontally (or with otherwise limited movements and graphics). And even if that is applicable to the kind of game at hand, it's arguably still not a huge saving of cycles, so if one had already run into performance issues, one would likely be looking for cycle-savings that have bigger gains than that, i.e. bigger, lower-hanging fruit.

There's a lot to be said for simplicity, both at runtime and whilst coding / debugging / optimising.

Which is partly why having a generic and optimal colour-splat solution comes into the equation, because it was always a specific pain-point for C64 games, no matter which direction they scrolled. Having a generic solution means one simply doesn't have to worry about such tricks, and their various associated restrictions, the team can simply build games, using the system and its underlying h/w features. That solution also being as optimal as possible is a bonus, and usually comes after much time / many dev iterations, and the work of multiple bright minds. Hence me sharing the technique that Jon shared with me, back in the day.

Most C64 developers / dev teams back then had generic solutions for h/w sprite multiplexing and tile-based full-colour background scrolling (in whatever directions), and similar was needed on every other platform according to their given h/w, or lack thereof (e.g. Atari ST had no hardware sprites nor hardware scrolling, so one had a different bunch of problems to come up with generic solutions to). It was often pretty easy to swap out a scroll system or a sprite system for another on some given platform, to see if there were any real-world gains from changes in the implementation. Obviously, that possibility is reduced considerably the more one moves away from generic solutions, and into custom tricks, e.g. relying upon how the designer has used / has to use colour throughout the level, or the shape of moving objects, or which directions objects can move it.

Sorry if I come across as overly dismissive. Having spent so many hours on such things, over many years, with the input of multiple talented devs at the time, we've tried a lot of things to end up where we got to. We also came up with a million things that sound like nice ideas, but generally turn out not to be, due to overheads (hidden or otherwise) or other restrictions. That's not to say creative/novel solutions can't work or shouldn't be suggested. Just that often they've already been considered, and there are reasons they've not been used.

Your idea works fine — for things with both limited graphics (and/or colours, depending on where it is applied), and limited movement. But will likely have more overheads / take more cycles than one might imagine, if trying to turn it into a more generic system (for more objects, or for objects with less restricted graphics or movement).

It's not such a useful idea for background scrolling as one might initially think, because it either requires limiting the graphics (fine if you can do so: but these platforms had poor enough graphics, and gave graphics/map designers enough headaches, without additional restrictions), or having a bunch of code to calculate/track the changes.

Considering/proving the simplest case — here that's horizontal movement, over a horizontally aligned screen memory, for a single object — usually leaves a lot of issues unsolved. In general, and in most areas of coding.

vardump · on June 16, 2022

How to handle the case when player changes direction to exact opposite immediately after the frame color data was transferred? Double buffering splatting code? Although one copy of it is 5001 bytes, ouch.

jimsmart · on June 16, 2022

You generally don't handle that case! :) — Instead you let the player move within a rectangular area onscreen, and decide on which way you are going to scroll the screen in advance (or rather: after the fact, depending on how one looks at things), based upon where the player is inside that rectangle. So the screen catches up with where the player is moving/pushing.

Eight-way scrolling like this was always a massive pain on the C64 (and other systems that used buffered scrolling with no h/w, e.g. Atari ST), but that way (a box the player moved around inside) was the only realistic way of handling it if you had to do a bunch of work in advance before doing the actual scrolling. Turns out that having the player in a loose rectangle is also easier on the eye too, which is perhaps why it's also used on systems that don't suffer the same h/w restrictions.

Yeah, the colour RAM update was a lot of bytes to move. But dedicating a big chunk of code to it meant one could be a little freer to use slightly slower techniques elsewhere in the update cycle. Side note: the C64 actually only had 39 visible char across the screen when in 'scrolling' mode, because the borders where shrunk-in slightly (and slightly more than one expects). So one less char to worry about per line. That saved a tiny amount of code / memory / execution-time for the colour splat (and the scrolling of partial chunks - whether on back buffers, or the data within the colour splat code - over the other frames). Sure, it's only one less character. But it saved some cycles. And cycles mattered! Particularly when doing something with that much data to move / that took that much time.

vardump · on June 16, 2022

> C64 actually only had 39 visible char across the screen

But 40 color cells were still visible, unless horizontal scroll register was 0.

jimsmart · on June 16, 2022

No, but that's an easy enough mistake to make :) — It's called 38-column mode, and when enabled the VIC shrinks both borders in by 8 pixels, and then offsets the screen according to the x-scroll register bits.

[Edit: another source says it's actually 7-pixels hidden on the left, and 9 on the right. But whatever: same principle, the screen is shrunk by 16 pixels in total horizontally]

Which meant that at most only 39 characters were visible across the screen — with two of those, one at each end of the row, being partially visible — and that applies to both the character screen and its associated colour RAM. Only 38 characters were visible when the x scroll register was zero, and as soon as one shifted to a value of 1-7, the 39th column became visible (and the 1st one became partially offscreen). But the 40th column is never visible when in that mode.

For more info see:

http://www.devili.iki.fi/Computers/Commodore/C64/Programmers...

"When scrolling in the X direction, it is necessary to place the VIC-II chip into 38 column mode. This gives new data a place to scroll from. When scrolling LEFT, the new data should be placed on the right. When scrolling RIGHT the new data should be placed on the left. Please note that there are still 40 columns to screen memory, but only 38 are visible."

— But it's discussed on a handful of other pages too, if you google.

vardump · on June 16, 2022

Oh damn... and I did a fair bit of coding on C64 back in the day. :-D

Somehow I thought it hid 4 pixels both sides. Totally wrong.

PS. Then it's so unfair bad line still takes 40 cycles!

jimsmart · on June 16, 2022

Please stop giving the above comment downvotes because of this person's lack of knowledge: we all have to learn things — there was once a time I didn't know this either.

It's not like vardump here was being a dick about anything in their comment, cut them some slack!

vardump · on June 16, 2022

Thanks. Although I really should have known better, wrote scrolling routines 35 years ago.

It's scary time can corrupt memories we consider as facts.

Luc · on June 16, 2022

I made an 8-way full-screen scroller.

To avoid situations like this the player sprite at the center of the screen had momentum, i.e. the sprite had to rotate 180 degrees to change to the opposite direction, giving a few frames time to set everything up.

jimsmart · on June 16, 2022

That's a nice little trick, cheers for sharing. (Not that I'll get a chance to use it these days, but still)

egypturnash · on June 16, 2022

That's... that's horrible. Beautiful, but horrible.

Which kinda describes any advanced c64 technique, really.

jimsmart · on June 16, 2022

Indeed, I totally agree on all points :)

mgkimsal · on June 16, 2022

Just watched a video of C64 "Seven Cities of Gold" with a colleague yesterday, trying to convey just how... exciting that was in 1984. Watching on YouTube, I had forgotten just how small the playing 'viewport' was. It seems like possibly a more extreme example - I don't remember too many other games having an action viewport that small.

js2 · on June 16, 2022

> This also misses loop unrolling.

It's mentioned at the bottom of the article in the "Thoughts" section:

You could certainly use self modifying code and unroll this copy routine to get better performance at the price of flexibility and arguably understanding for the average casual 6502 assembly coder. Again, this was not a "how fast can we absolutely make it" but an everyday use examination.

hinkley · on June 16, 2022

Duff's device is a fixed size loop unrolling with an ugly hack to make it behave for arbitrary inputs. The assembly makes sense but the C code is rough.

It's not quite as fast as self-modifying or custom compiled code, but it's pretty close.

le-mark · on June 16, 2022

Crypto mining with fpga was in the weeds down this path, or 100 Gbps signal processing. Two examples where low level stuff is still relevant, just not commodity and widely available like the 8 bit micros were.

aerique · on June 16, 2022

I also read a high frequency trading blog that was posted here a few years ago. Same thing: hacking hardware and software so the first bytes of info could be grabbed from a stream and acted upon, instead of needing to wait for the whole package to have come in.

Also when I was in the demo scene on the Atari ST one had to do specific timings in the assembly code to be able to draw outside the screen's borders (so the borders were on screen but couldn't be drawn on by code).

djmips · on June 16, 2022

Those days still exist! If you want your mind blown watch the Epic Games Nanite talk from last year's SIGGRAPH where the core rendering of the dense vertex data is done directly in Compute , IE software rendered, instead of using the hardware rasterization hardware which has a minimum 4 pixel invocation overhead which gets expensive with very small triangles.

This is but one example of this that's happening every day, there is much much more like hair rendering in EA FIFA soccer or automatic trading financial software running on GPUs.

There's a whole world of applications where people are still concerned with every last cycle of performance just like in the C64 days.

djmips · on June 16, 2022

Here I'll save you the trouble of trying to find the video.

https://www.youtube.com/watch?v=eviSykqSUUw&list=PLabw4gCouT...

djmips · on June 16, 2022

Kind of off topic for 6502 memory copy speeds but with regard to scrolling, there became a pretty cool software hack for the C64 (called VSP) where you could trick the poor VIC chip into starting scanning out the screen position later in memory. Move the start one character and the whole screen shifts left by 8 pixels. You only need to repaint a vertical column for this 'course' scroll instead of moving the entire screen of characters. This is something that should have been built into the hardware and was very useful on other systems that had that ability (like the NES for example)

With it you can reduce the amount of memory you need to copy every 8 pixels (the 8 pixel part can be done with smooth scroll registers).

There's a thread and example code on github here. https://www.lemon64.com/forum/viewtopic.php?t=70539

Also note it's such a terrible hack on the DRAM that it doesn't work on all C64s and there's a technical discussion about that here. https://www.linusakesson.net/scene/safevsp/index.php

Hardware mod if VSP doesn't work on your C64 and more technical details. http://wiki.icomp.de/wiki/VSP-Fix

Also it makes mention of the C64-Reloaded which is a modern C64 product that includes the fix.

localhost · on June 16, 2022

The C64 VIC-II chip would grab the address bus from the CPU every 8 scan lines on the screen. Some of the early "fast load" cartridges like the Epyx FastLoad cartridge that would accelerate loading games from the floppy drive would blank the entire screen during load so that their async data transfer routines wouldn't get interrupted by the VIC-II chip grabbing the bus. I wrote a similar (better?) cartridge where I would need to use the register on the VIC-II chip that reported the scan line as a sync marker to transfer 3 bytes asynchronously from the 1541 down the clock and data lines of the serial bus. Good times.

MarkusWandel · on June 16, 2022

In my recollection Epyx Fastload did not blank the screen, though some earlier fast loaders did.

I also remember the software voice synthesizer "SAM" needing to blank the screen to render glitch-free sampled audio. Then along came, what was it "Impossible Mission" ("Another visitor! Stay a while...") doing pretty clean sampled audio with the screen on. Not that the C64 SID chip was even remotely intended to be able to play sampled audio in the first place!

The Amiga was unimaginably powerful by comparison. Even a basic configuration had 8x the memory a C64 had, and it had all those fancy DMA toys to offload the CPU.

jimsmart · on June 16, 2022

> Not that the C64 SID chip was even remotely intended to be able to play sampled audio in the first place!

I don't recall how SAM did it, but sample playing on the C64 SID chip was indeed a nice trick — it was actually done by modulating the main output volume, which made a slight click when changing.

Eventually this got used by some of the C64 musicians / music player libs, so one could play a channel of samples as well as the three regular synth channels on the SID. IIRC, Outrun used this particularly well in its title screen and/or loading music, having some vocal samples "O-O-Outrun!" (and skidding sound effects) as well a sampled drums.

Annoyingly, IIRC, some revisions of the SID chip behaved slightly differently, and had louder or softer sample playback when this hack was used. But still: clever stuff.

weinzierl · on June 16, 2022

... and main output volume had only 16 levels so the samples were 4 bit quantized. It is a wonder that we could get understandable vocal samples with this hack at all. I distinctly remember "Goal!" from Peter Shilton's Fottball and "Accolade presents" from Test Drive [1]. In the examples one can hear the amount of quantization noise that low bit depth caused.

[1] https://m.youtube.com/watch?v=L1u-WydiiCI

vardump · on June 16, 2022

You can actually get about 6-7 bits resolution out of same SID volume register. 4 bits from the volume register, channel 3 disable and 3 filter bits. Requires some setup to get SID in a particular state first.

For details, see: https://livet.se/mahoney/c64-files/Musik_RunStop_Technical_D...

weinzierl · on June 18, 2022

I was just wondering tonight, while lying awake, if you could get more depth by either using the other voices (probably not) or the filter. Now this paper and the demo are so cool, really made my day.

vardump · on June 18, 2022

> using the other voices

Yeah, you can do that as well, well kinda, but it's harder, provides a lower maximum frequency and wastes more CPU cycles.

See: https://csdb.dk/release/?id=72678&show=notes

There have also been efforts to do "sampled" sound synthesis with C64's 3 voices, speech, etc. I don't remember any links about that, unfortunately.

djmips · on June 17, 2022

>The Amiga was unimaginably powerful by comparison.

Interesting that the two systems were only about two years apart in development. The first C64 prototype would have been in 1981 and the Amiga 1983 but of course they were targeting different price points.

6510 · on June 16, 2022

I eventually managed to do a scroll texts fast enough to do them while the pixels of the scroll text were printed to the screen. It was even fast enough to have char combinations in the scroll text that modified its speed and direction with speeds like "one time scroll 1 pixel the next 2". One of the tricks was to do "poor" timing with NOP's by requiring an empty row of bits between each scroll text. (The text becomes unreadable anyway if there is no space between the lines)

Psyladine · on June 16, 2022

>I kinda miss those days, and I kinda don't. I guess it was good to have experienced them.

There's a certain reflective quality and even satisfaction from using a chainsaw after coming up using a tree saw by hand. It feels progressive, even if it is just optimizing for time at the expense of energy.

tialaramex · on June 16, 2022

Tom Duff's device was doing that because he's doing MMIO, you should not [I know you're not suggesting it, but just in case anybody reading thinks it's clever] do this today when you don't want MMIO, your compiler is very capable of just doing an actual copy quickly, so tell it that's what you want, don't write gymnastics like Duff's device.

However, expressing these partially unrolled loops nicely is a nice performance-not-safety feature of WUFFS called "Iterate loops":

https://github.com/google/wuffs/blob/main/doc/note/iterate-l...

Well, I say performance not safety, as always they want both, but you could safely just write the never unrolled case, while the existence of Iterate loops allows you to express a much faster special case but know the compiler will fix things up properly no matter what.

vardump · on June 16, 2022

We're talking about C64 (and maybe Amiga).

Compiler is not going to do absolutely anything for you on those retro platforms.

tialaramex · on June 16, 2022

Aw, just needs a better compiler (with a 6502 target) :D

Jason Turner's CppCon 2021 talk, "Your New Mental Model of constexpr" has half the presentation as a C64 program (though for practical reasons not actually running on a C64 but instead an emulator) because most of the heavy lifting is done by the C++ 20 compiler. https://youtu.be/MdrfPSUtMVM?t=1422

Now, Jason's approach is not going to beat hand-crafted 6502 machine code in a fair fight but he often doesn't need to fight fair and that's the point of his talk.

amelius · on June 16, 2022

We're now doing sort of the same tricks, but with power management on mobile devices.

joosters · on June 16, 2022

On the original ZX Spectrum, you could measure the write bandwidth visually, because on startup it would write the value 2 into each byte in memory (which included the graphics RAM). It would then re-read and decrease the value of each byte twice, to check for any faulty memory.

You could see these patterns on-screen as the reads and writes took place (I think it took about a couple of seconds to do this to 48k of RAM)

becurious · on June 16, 2022

You could change the stack pointer to the top of the area of memory you wanted to fill and then use PUSH to fill at I think 11 clock cycles per two bytes. It was faster than unrolled LDI or LD (HL),A followed by INC HL. It would be filling memory in the wrong direction for a Rainbow processor but you could use it for repeating patterns. I think I did a checkerboard pattern that would shift every frame and it was pretty smooth.

joosters · on June 16, 2022

To be really pedantic, there's a big difference between 'memory bandwidth' and 'memory transfer speed'. The former is just reading (or writing) to a block of memory, and the latter is copying data from one location to another. So a 'memory transfer speed' is going to be slower.

jmull · on June 16, 2022

I think it's actually not that pedantic.

"Memory bandwidth" is being used in marketing materials today, so it's a little useful to understand what it means. (The author of this article confuses it with memory transfer, probably others do as well.)

kken · on June 16, 2022

This doesn't even cover all the neat assembly tricks with self-modifying code that you would actually use on a 6502 to speed up memory transfer.

vikingerik · on June 16, 2022

I did this in a homebrew Atari 2600 game. For a Space Invaders grid of sprites. Each is triggered by writing to a register, as the electron beam scans through to display each sprite.

The interval between sprites on the same scanline is 3 cpu cycles. That's a single 6502 instruction, the write to that register. How do you do any kind of load or compare instruction along with that to decide whether to display that sprite?

The answer was to copy that stream of instructions to RAM ahead of time, and replace each write to a missing invader with a no-op. The code is here if anyone wants to see (the "inv3" demo): http://dos486.com/atari/

krallja · on June 16, 2022

> copy that stream of instructions to RAM ahead of time

Even this is easier said than done: there are only 128 bytes of RAM in the entire machine, and that has to suffice for global variables and stack memory in addition to storing modified code like this!

rasz · on June 16, 2022

Afaik its <120KB/s with all the tricks. 6502 was hand designed and brain optimized for clever use of available silicon real-estate, roughly 20% of CPU bus cycles are dead/bogus/useless. RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6 wasted , all increments at least 1 cycle wasted etc. Sad to think state machine handling DMA transfers in REU is probably less than 50 macrocells, and Commodore ran its own fab, they could have build-in REU DMA in C128 and it would cost cents.

mywittyname · on June 16, 2022

Is there a way to make a compatible 6502 variant that doesn't have this waste?

rasz · on June 16, 2022

https://en.wikipedia.org/wiki/CSG_65CE02#Pipeline_improvemen... fixed most painful ones, but afaik not all dead cycles. But it was 1988 and commodore didnt bother putting it into anything other than some IO card for the AMIGA, not to mention it still did nothing to cover slowness of moving data around. Japanese decided to do something about it for TurboGrafx-16 in 1987 Hu6502 http://shu.emuunlim.com/download/pcedocs/pce_cpu.html

Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) - pretty much x86 'rep movsb', except not great at 6 cycles per byte (~160KB/s). For contrast 5 years older 80286 already did 'rep movsw' at 2 cycles per byte. 6 years later Pentium did 'rep movsd' at 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb' full cachelines at a time at full cache/memory controller speed.

JPLeRouzic · on June 16, 2022

I think there are tricks to rewrite the microcode on Pentium, does similar tricks exist for 80286, 386 or 68K?

It would be fun to reconfigure one as a high speed 6502.

krallja · on June 16, 2022

“The 100 MHz 6502” does a different clever thing - it copies all the dedicated RAM and ROM into its own FPGA copy. Then it can perform 7 to 25 instructions before the next external read/write cycle!

http://www.e-basteln.de/computing/65f02/65f02/

MatthiasWandel · on June 16, 2022

For games that scrolled the screen, those had to happen essentially between scans, so a lot of tricks were employed. Fixed addresses in the code, unrolled loops, and self modifying code to avoid the expensive zero page indicrect indexed addressing mode (the slowest instruction on the CPU). The other trick was to start moving the first line of screen just after it got displayed, which would give you nearly two jiffies to do it before the scan caught up to you on the next frame.

natly · on June 16, 2022

It's crazy how much work went into those old games. I have a feeling those programmers weren't even paid that well considering how few people owned computers back then (so the market can't have been large).

kabdib · on June 16, 2022

My first job out of college at Atari in 1982, writing game cartridges for the 400/800 computers, paid $25K a year. My first raise after a year was to $30K.

There were programmers in other divisions making royalties off of their games. Tod Frye famously got $700K or so for his terrible version of 2600 Pac-Man (it was terrible not because he was a bad programmer, but because marketing decided that 2K of ROM had to be enough, and he was smart enough to pull off a miracle . . . of sorts).

Also, the OP apparently doesn't know how to unroll loops, which is the first thing you do to your game's hot spots. (Never had to resort to self-modifying code).

DerekL · on June 17, 2022

The Pac-Man cart was 4k, not 2k. But maybe it should have been 8k, like Ms. Pac-Man later was.

kabdib · on June 17, 2022

You are correct on both accounts :-) (I mis-remembered the ROM size).

wkearney99 · on June 16, 2022

If you ever play(ed) the Atari 2600 version of River Raid, you got to witness some SERIOUS tweaking to work around the limits of that console. Every scanline processed on the fly during the vertical blanking interval. No screen buffer. The animation was soooo smooth.

vardump · on June 16, 2022

No need, it can easily happen during the scan. As long as the scan and update memory location never meet, there's absolutely no problem.

mmphosis · on June 16, 2022

This also misses a lot of other tricks: self-modifying unrolled code, keeping track of blocks of memory that don't need to be updated, or blocks of memory that have the same value. Memory moves may not be the fastest way.

  0 HIMEM: 5608
  1DATA5532563300922139021390213932135045238263536503037252627503845615305817312092293902932835845478875645836575037148845550351083966132653706165276445377489621322135821322334213502130051520282036
  2DATAQLNZQLNZQAQQDSAQRDSAQQDSAQVDSAQXCKDNAPFXANAQXXANANXNXNXQXNXCQXKXNZQXKXNZQCQODMAQODMAUUYQXCKATAUVQXCKMNXQXKANXUAUTCQXKENXUMUSQJHSAHFQZTXRDFQZTAOCQZTAOBQHDSAPDSAQADSANZNZKDSAQZDSAXZQZOZXZUAXZJ
  3 READ L$: READ H$
  4 FOR I = 1 TO  LEN (L$)
  5POKE767+I,10*(ASC(MID$(H$,I,1))-65)+VAL(MID$(L$,I,1))
  6 NEXT
  RUN
  HGR : CALL 768: CALL 5608

samstave · on June 17, 2022

Jeauzeus Chistmas!

I completely forgot about HIMEM... and what a difference it made when it happened...

God there is so much computer history knowledge thats going to die in the next 20 years.

danachow · on June 20, 2022

Perhaps I’m forgetting but was HIMEM control not something that was pretty well known since early on in the C64? I didn’t think it was a secret. There’s articles describing it’s use in January of 1983.

samstave · on June 20, 2022

Sure, but the fact is when was the last time you had a "startup.bat" file you had to edit that had HIMEM stated in it....?

Easy to forget that the velocity of understanding for people in tech has been ridiculous over the decades.

Ask a new hot shot programmer to do dip-switches and IRQs etc... if they are <40 years old... they may not know what youre even talking about, let alone be able to tell you a PIN switch on a mobo for IRQ settings (PRE IPv4)...

There was a generation of people brought up THROUGH THE STACK in a hardware manner...

So much history is lost through the advancement of history.

This is why we lost the tech to construct pyramids of scale.

"Kids these days just arent interested in levitating blocks with sound waves, and they just only want to domesticate Camels and plants for something called "agriculture""

danachow · on June 21, 2022

HIMEM here refers to a control mechanism for switching out the mapping of the C64 kernal (the ROM “OS”) to access more RAM.

The C64 is called that because it has 64k of memory and the 6502 is a 16 wide address bus so 64k of memory requires bank switching since there is no room for anything else. This was baked into the design of the system from day 1.

Anyway, I also really have a hard time understanding how an old DOS driver to overcome the limitations of PC architecture when it was pushed into service well past its sell by date is or should be relevant to anyone except hobbyist tinkerers with old hardware. Most people won’t be able to start a Model A Ford, either. I don’t think forgotten knowledge of DOS/PC shittiness is somehow such a great loss of domain knowledge.

> Ask a new hot shot programmer to do dip-switches and IRQs etc... if they are <40 years old... they may not know what youre even talking about

Meanwhile, for those that are interested it’s never been easier to get into low level development. I do not fucking miss the days of etching PCBs in my basement and always cursed the cost of small run manufacturing. The types of things the home hobbyist can work on/solve is far beyond the frankly waste of time futzing with IRQ and blech DMA on IBMs steaming POS. Home hobbyists have the tools for free to design their own multilayer PCBs or “just” integrate the myriad embedded hardware devices available to them. More young people are involved in robotics then ever, a far more interesting endeavor to me then moving jumpers around to make a crappy machine produce 8 bit audio cassette quality sound.

> PIN switch on a mobo for IRQ settings (PRE IPv4)

Yeah even I’m fucking totally lost on this one.

samstave · on June 22, 2022

>> PIN switch on a mobo for IRQ settings (PRE IPv4)

>Yeah even I’m fucking totally lost on this one

---

You dont recall on a 286? Having to swap DIP PIN switches on the motherboard to establish the IRQ port the card in said slot was going to use?

IRQ3 was typically the modem...

----

You were etching PCBs in your basement

I was flipping DIP SWITCHES.

Different era of education there...

danachow · on June 23, 2022

I know what DIP switches or configuration pin headers do. For one, I've never heard them referred to as DIP PIN switches like PIN is some kind of acronym - perhaps DIP pin switch is some kind of regionalism. But that was not the confusion:

>> "Ask a new hot shot programmer to do dip-switches and IRQs etc"...

Ok.

>> let alone be able to tell you a PIN switch on a mobo for IRQ settings (PRE IPv4)

After the first part about dip-switches and IRQs this makes it sound like "PIN switch" is something distinct. Further confusing, I still have no idea what PRE IPv4 means here. The only IPv4 I've heard of is internet protocol and not something I would associate with PC hardware configuration. (IPv4 was released pretty much at exactly the same time as the original IBM PC in 1981).

> IRQ3 was typically the modem

IRQ3 was more specifically the 2nd set of serial UARTs COM2/COM4 - with the IO base port assignment 2F8h and 2E8h. COM1/COM3 was IRQ 4 (base ports 3F8h, 3E8h). So a modem may very well been on IRQ 4. In practice it was often dependent on whether one was a using a serial vs PS/2 mouse.

scarface74 · on June 16, 2022

I immediately noticed that in one of the code samples that he is loading and storing data from memory that’s not on the first page - memory locations $00-$FF - memory access to and from the first page took one less clock cycle.

The LDA, STA, etc operators for zero page access are different opcodes than their two byte address equivalents

bluemax · on June 16, 2022

Wow, this brings back memories from more than 3 decades ago. I created a routine on the C64 to copy memory and calculated the performance then around 25KB/sec.

The first version contained a memory corrupting bug that took some time to figure out. Depending on the locations of the source and destination you have to start copying forwards from the beginning of the source, or backwards from the end. If there's an overlap you risk overwriting the source before it is copied to the destination.

Joyfield · on June 16, 2022

My DOCSIS 3.1 Internet connection has more download bandwidth than my Amiga 500 had to RAM. Latency, not so much.

cmrdporcupine · on June 16, 2022

The 65816's MVP/MVN opcodes can do bulk transfers a teeny bit faster.

lscharen · on June 16, 2022

For more 16-bit 65816 context -- other than for space-savings, these instructions are never used when performance is needed due to the low effective throughput of 7 cycles per byte. A basic unrolled loop using 16-bit instructions is 20 - 30% faster and specialized graphics routines that are able to use the stack can approach 3 cycles per byte using the PEA and PEI instructions.

cmrdporcupine · on June 16, 2022

I'll defer to you I guess, as you seem to know more about this than me. The only thing is searching through the 6502.org forums I don't see a consensus on this?Plenty of people talking about the advantages of MVN/MVP for bulk transfers. I seem to recall doing the cycle counting myself at one point, too, and finding it advantageous.

One neat trick (I remember reading about from Alan Cox I believe) if you have control over the hardware is to memory map I/O devices like serial input / output such that incrementing addresses starting at a given address all point to the same physical device/register. E.g. allocate 256 contiguous bytes in your memory map to point to the same thing. This way you can do bulk I/O transfers to/from memory using MVP/MVN instead of "get a byte, put a byte" instruction by instruction.

rasz · on June 16, 2022

The trick you describe was being used by Silicon Valley Computer ADP50L IDE controller from early nineties (1991). Memory mapped I/O instead of traditional x86 port access lets you skip doing manual loop for 'rep movsb', result can be 50% speed bump

https://forum.vcfed.org/index.php?threads/performance-of-lo-...

Port IO Read Speed : 219.39 KB/s

MMIO Read Speed : 310.77 KB/s

Some variants of XTIDE hardware also implement this, as does the free bios.

cmrdporcupine · on June 16, 2022

Ah here it is: http://forum.6502.org/viewtopic.php?f=2&t=5035 referencing a now-lost G+ post from Alan Cox:

"The emulator also has a fun hack for disk performance I'm hoping will get replicated in some of the upcoming retro 65C816 board design. Like the 6502 the 65C816 sucks at continually reading from an MMIO port and writing it to sequential memory locations. It sucks less than a 6502 because you've got 16bit index registers, but at the same clock it was doing about 100K/second that a Z80 can do 250K (with ini loops). The revised emulated disk interface has the same mmio port replicated across a chunk of address space and this allows a block move instruction (MVN) to do all the work at 6 clocks/byte. At that point the 65C816 suddenly jumps to twice as fast as the Z80 on disk I/O."

ksherlock · on June 16, 2022

MVP/MVN are 7 cycles per byte.

If you're moving memory around in bank 0 (or have memory mapping), you can use the direct page register to read/write anywhere in bank 0 and the stack to read/write anywhere in bank 0.

16-bit LDA dp, PHA is 4 + 4 = 8 cycles or 4 cycles per byte. Best case would be if you know it's constant data before hand, eg, LDA #0, PHA, PHA .... 2 cycles per byte!

For general purpose copying MVP and MVN are easier and have better code density.

mmphosis · on June 16, 2022

2 cycles per byte! It takes 4 cycles for PHA to push the 16-bit Accumulator, two bytes, onto the stack. There's also 16-bit PHD, PHX and PHY.

spc476 · on June 16, 2022

The 6502 doesn't have a pipeline, so it's quite easy to count instruction cycles and find out how much time a given piece of code will take. I did this technique to bit-bang a serial port back in the day (given two one-bit ports with the CPU doing all the work because the system was too cheap to have an actual UART).

vardump · on June 16, 2022

Two nits to pick from this article:

While the article does mention it ignores loop unrolling, it's a bit disingenuous, because that almost DOUBLES performance and it's what nearly all real world code is doing.

Also Sam's Journey PAL version does not need any kind of DMA transfer tricks. NTSC version is just a tiny bit behind in timing, so to be glitch free, REU is used. PAL version still works with minor glitches on an NTSC system.

This is because NTSC has 263 * 63 - 25 * 40 = 15569 cycles available per frame (ignoring those stolen by sprites) and PAL 263 * 63 - 25 * 40 = 18656 cycles (again, ignoring sprites).

The difference is enough that the NTSC version can't move required 2000 bytes of color RAM and character RAM in time in the worst case without REU.

Rediscover · on June 16, 2022

I'm remembering (possibly incorrectly) PAL being 312 (not 263).

Is that what You intended?

vardump · on June 16, 2022

Yeah. I accidentally left that value wrong after a copy paste. But the result is right. :-)

aidenn0 · on June 16, 2022

There's an article that shows up on HN from time to time about (ab)using the stack manipulation instructions on an 8-bit machine to improve times for filling a framebuffer for a video game, but my search skills are failing me today...

forinti · on June 16, 2022

I once thought about reading HD floppies on a BBC Micro (which can handle SD and DD). But it turns out it can't handle the speed at which the bits come in (500kbps).

SD and even DD are fine (125kbps and 250kbps), so you could read 360KB floppies from a PC.

metadat · on June 16, 2022

Is Mario Bros a ripoff of Sam's Journey? Or vice versa?

Either way, beautiful.

gs17 · on June 16, 2022

Sam's Journey looks to be a 2017 release, so it definitely didn't inspire Mario.

[0] https://www.knightsofbytes.games/

cldellow · on June 16, 2022

Vice versa, Sam's Journey was released in 2017: https://www.knightsofbytes.games/samsjourney

cmrdporcupine · on June 16, 2022

Anybody else interested in writing a WASM VM for the 6502 or 65816 etc? This was my brainwave this week. I think this would be a supremely nerdy fun thing to do.

jlundberg · on June 19, 2022

I have actually been contemplating this, either for 6502 or 8086 :)

DeathArrow · on June 16, 2022

Beat that, Apple!

cestith · on June 16, 2022

Apple used the very same processor family at one time. ;-)

They've come a long way.

NobodyNada · on June 16, 2022

For a very loose definition of “same professor family”, they still do — ARM is sort of a spiritual successor to the 6502: https://en.wikipedia.org/wiki/ARM_architecture_family#Histor...

ARM was designed by the team at Acorn that had worked on the BBC Micro, which used a 6502. They decided to design a custom processor because bit felt none of the 16- or 32-bit processors on the market at the time met the standard set by the 6502 for simplicity and low cost. So, they designed their own architecture which took cues from both the cutting-edge RISC research in academia, and the simple practicality of the 6502.

(On a similar note: the 6502’s main competitor, the Zilog Z80, is an early ancestor of x86! The Z80 is an enhanced clone of the Intel 8080, which of course the 8086 was heavily based on.)

This legacy still shows up today in the instruction mnemonics: ARM uses “branch” naming (BEQ - branch if equal, BCS - branch if carry set, etc) because that’s what the 6502 used, whereas x86 spells it “jump” (JEQ, JCS, etc.). ARM uses LDR/STR to load and store registers from memory (like the 6502’s LDA/LDX/LDY/STA/STX/STY), whereas x86 just spells everything “MOV”. ARM only uses memory-mapped I/O to access hardware, whereas x86 has separate input and output ports.

cestith · on June 16, 2022

The 6502 was a clone-ish of the Motorola 6800 made to be lower cost. The 6800 led to the 6809 (another competitor,used by the Tandy CoCo and IIRC the Dragon) and to the 68000 series, used by Apple in the Mac, Sun in its early systems, NeXT, Amiga, Atari in their later systems, and more. That led to the PowerPC partnership of Motorola, Apple, and IBM.

PowerPC was outliving its useful life due not to ISA, but manufacturing limitations. So Apple went to Intel, but that wasn't fit for mobile. Apple partnered with ARM to make their mobile chips. Then their mobile chips grew into the M1 and M2 along with ARM, bringing them back to a RISC-ish platform like they had with PowerPC. So it's sort of a dual path back to the same place.

cmrdporcupine · on June 16, 2022

I honestly don't think there's any kind of straight line from the 6809 to the 68000. They share little in common other than the '68' prefix and coming from the same company and being big endian. The instruction sets are very different. Designed by different teams. The peripheral chip set and bus management was different too.

The 68k shares more with 1970s minicomputers especially the PDP-11 and/or VAX architectures than any MPU that preceded it.

NobodyNada · on June 16, 2022

> So Apple went to Intel, but that wasn't fit for mobile. Apple partnered with ARM to make their mobile chips.

There’s a lot of interesting history there too: in 1990, after seeing the first-generation ARM CPU, Apple partnered with Acorn to co-found ARM Ltd and develop a mobile processor for the Apple Newton. Although the Newton was a failure, ARM was very successful and powered pretty much the entirety of the mobile device revolution — including of course the iPod and iPhone.

Apple’s co-founder status gives them a lot of influence over the ARM architecture — they led the AArch64 design process, and they seem to be allowed to do things that even other architectural licensees aren’t allowed to do, like implementing custom instructions in their ARM cores: https://news.ycombinator.com/item?id=29783549

And Apple’s iteration of ARM owes a lot to the PowerPC world as well — Apple’s processor design team was originally PA Semi, a company that designed PowerPC cores.

klelatti · on June 16, 2022

> they led the AArch64 design process

Interesting - is there a reference for this?

NobodyNada · on June 16, 2022

Here’s a Twitter thread from a former Apple engineer: https://twitter.com/stuntpants/status/1346470705446092811

> arm64 is the Apple ISA, it was designed to enable Apple’s microarchitecture plans. There’s a reason Apple’s first 64 bit core (Cyclone) was years ahead of everyone else, and it isn’t just caches.

> Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes. When Apple began selling iPhones containing arm64 chips, ARM hadn’t even finished their own core design to license to others.

> ARM designed a standard that serves its clients and gets feedback from them on ISA evolution. In 2010 few cared about a 64-bit ARM core. Samsung & Qualcomm, the biggest mobile vendors, were certainly caught unaware by it when Apple shipped in 2013.

> Apple planned to go super-wide with low clocks, highly OoO, highly speculative. They needed an ISA to enable that, which ARM provided.

> M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.

klelatti · on June 16, 2022

Very interesting - many thanks!

Edit: I’m a bit puzzled by the claim that Apple was selling Aarch64 before Arm had finished their first design - A7 announced at end 2013 but A53 appeared in 2012?

NobodyNada · on June 16, 2022

It looks like A53 was announced in October 2012, but I’ve found no indication of whether the design was actually finished by then [0]. And remember that ARM just sells IP and other companies are responsible for manufacturing it; it doesn’t look like anyone actually produced A53 cores until 2015 [1] — whereas Apple was shipping actual consumer products with A7’s in them by October 2013.

[0]: https://www.techspot.com/news/50656-arm-announces-64-bit-cor...

[1]: https://en.wikichip.org/wiki/arm_holdings/microarchitectures...

klelatti · on June 16, 2022

Very fair point. OTOH there was a lot of detailed info on the A53 available in 2013 and SoCs were being announced with it.

I suspect this thread may be slightly exaggerating the position but certainly the case that Apple were well ahead of all the competitors - and no doubt they were deeply involved in the ISA design.

DeathArrow · on June 16, 2022

>the Zilog Z80, is an early ancestor of x86! The Z80 is an enhanced clone of the Intel 8080, which of course the 8086 was heavily based on

I owned an Z80 based computer when I was 8 to 10 years old. Its instruction set and memory access does not have any resemblance for me with 8086.

They seem like very distant relatives.

bonzini · on June 17, 2022

Apart from unique functionality such as the separate I/O port bus and the ability to access the 16-bit registers' two 8-bit halves, there are quite a few instruction encoding quirks that reveal the ancestry of the 8086:

* the x86 encodes the first four registers in the order AX, CX, DX, BX. This roughly matches the Z80 ordering of AF (accumulator), BC (used as counter for string operations), DE (no particular purpose), HL (the "main" address register).

* PUSH/POP operate only on 16-bit registers

* the encoding of flags (SZ0H0P1C on x86, SZuHuPNC where u is undocumented on Z80). The "auxiliary carry" and instructions such as DAA, and the "parity" flag, are particularly weird and common to both Z80 and 8086. Flags exclusive to the 8086 (interrupt, direction, overflow) are kept in the high bit of the flag register, so that the LAHF instruction makes AX look like the Z80 AF register.

* the eight conditional jumps of the Z80 are encoded in the same order in the 8086 (C<NC<Z<NZ<S<NS<PE<PO; the 8086 fits 8 more conditions in the holes)

klelatti · on June 16, 2022

The 8086 was designed to allow automated translation of 8080 assembly to 8086 assembly - so the instruction set may ‘look’ different but in fact has a lot in common.

Not quite right too to call the Z80 an ancestor of the 8086 but certainly closely related due to the common inheritance from the 8080.

NobodyNada · on June 16, 2022

Yeah, perhaps more of an uncle than a direct ancestor :)

klelatti · on June 16, 2022

Indeed - someone should do a family tree of CPUs!

cestith · on June 16, 2022

We need to decide where the NEC v20, v30, v40, and v50 live.

krallja · on June 16, 2022

And the NSC-800, which is like a Z80 with 8085 half-interrupts!

klelatti · on June 16, 2022

It’s the offspring of the marriage of Z80 and 8085!

dusted · on June 16, 2022

I wouldn't call the 6502 a RISC CPU.. It was clearly designed for humans to program, it has multiple and complex addressing modes and instructions to make it easier for us..

Sure it is a small instruction set compared to modern CPUs, but RISC is an idea, not a number.

I'd venture to say that RISC is designing with the goal of making very efficient instructions and allow very efficient compilers to be written for it.. It's the idea to have one faster way of doing something rather than multiple convenient ways, because the compiler don't care, and compiler vendors appreciate not having to chose between multiple almost similar instructions that may or may not be faster in some particular case if they don't have to.

jsrcout · on June 16, 2022

Don't remember where I saw this, but someone said the 6502 was really an RTCC (Reduced Transistor Count Computer).

pvg · on June 16, 2022

It's a joke, as the article says.