This reminds me of a quote by a master magician, which I unfortunately can't locate right now. The gist was that one way that magic tricks work is that the audience would never believe the sheer amount of work that goes into developing the skill to pull of the trick.
The C64 demoscene is at this point a pure case of magicians developing tricks for other magicians. Compare to 20 years ago, when the tricks were made to wow users familiar with the platform and its limitations.
Now someone without intimate knowledge of the C64 would not understand what is the hard part of this demo part. Letters scroll up and sometimes expand or shrink a bit. We've seen lots of different scroll text parts. Is this hard?
Consider the brute force approach. The text scroll area is 192 pixels by 200 pixels. If this was just a bitmap, that is 4800 bytes to update. Pure unrolled code to move bytes would do
LDA src,x
STA dest,x
for a minimum of 8 cycles per byte (extra if 256-byte page boundaries are crossed, and to update the x index, and do the logic to pick out which letter to draw) or 38,400 cycles to update that bitmap. But there are just 19,656 cycles free per frame! The best a brute force approach would get then is one update per 3 frames, or 17fps.
So all the cleverness is getting the machine to do something at 50fps that naively it could do at best at 17fps. This is by racing the display raster and playing tricks with the hardware bugs in the cpu/video chip interface.
Magic tricks and con artists employ many of the same tricks. A convincing con is one so elaborate it looks like it'd be too expensive to perform, would require way too much rehearsal if it was really a con, so people are lulled into complacency.
Maybe it's that one form is being deceptive to entertain, the other to steal, but they have a lot of commonality.
Holy cow! I was big into the demoscene as a kid - they consumed most of my hours spent on the C64... but WOW, that was an amazing demo: not only implementing some crazy-new tricks, but also aesthetically beautiful! Now I need to go back and watch all the other gems I've missed over the last 25 years.
Having watched the video of what he accomplished on the C64 I am now retroactively furious at every drop of framerate I have experienced on every platform ever.
When you work back through the causal chain enough, you end up having to be furious about 1. standardized architectures with many compatible hardware models; and 2. non hard-real-time operating systems.
Interestingly, in a modern environment (unlike in the 80s-00s), both of these seem like choices that could be made either way.
Re #1 — all the major OS manufacturers (Microsoft, Apple, Google) now have their own flagship devices running their OS, and could standardize their app-store certification processes to involve perf testing on their own hardware if they wanted. Especially for the mobile hardware: there's no reason the QA process for e.g. releasing a game on iOS can't involve certifying zero stutter on a given iPhone, and then restricting the game to only be playable on that iPhone or newer (the way consoles effectively work right now.)
Re #2 — there's no reason, other than inertia, that personal computer (including mobile) OSes still put every single task on the system into one big (usually oversubscribed, from a realtime perspective) scheduling bag. We could—using the hardware hypervisor built into all modern architectures—split PC OSes into two virtual machines: one for the foreground "app", and one for everything else; and give the foreground-app VM fixed minimum allocations of all the computer's resources. Then you could make real guarantees† about an app or game's performance, as long as it was said foreground app. It'd be very similar to the way some game consoles (e.g. the Wii) have separate "application processors" and "OS service processors", to ensure nothing steals time from the application.
† And those guarantees would also be requirements: if the foreground app says it needs its VM to have 5GB of RAM, you literally won't be able to run it if you don't have 5GB of free physical memory to hand it (though the OS would probably first try to OOM-kill some sleeping apps to give it that memory, like iOS does.) Much clearer than the current "this game will be really slow if your computer is more than four years old, but it's not exactly clear which part of the computer is below-spec" we have today.
"there's no reason the QA process for e.g. releasing a game on iOS can't involve certifying zero stutter on a given iPhone"
To guarantee that, testing has to go through _all_ possible game states. For almost any non-trivial game, that's infeasible.
"and then restricting the game to only be playable on that iPhone or newer"
There is no guarantee that newer hardware would be faster for every possible program execution, and even if it were, timing differences could affect game play.
There also is no guarantee that newer hardware produces the exact same results. For example, better anti-aliasing or fonts drawn at double resolution could affect hit detection.
This even isn't guaranteed on the 'same' hardware. For example, there might be C64's that don't have the bugs that this demo exploits.
If only things were that simple. A demo video can show that there is an interaction that doesn't stutter, not that there is no interaction that stutters.
I realize it's imperfect, but it would raise confidence to see a video of something that looks fairly consistent with the advertised usage that satisfies the benchmarks.
The video doesn't guarantee bug-free play, but it can give us confidence that the benchmarks are an honest test since we can see what the test looked like.
If it's a text editor and it just shows that opening a new blank file and typing a few words is fast, or a game that shows the game runs well in an empty room level, then we know the demo is dishonest. And having a "boring" video in their store page won't win them much buyers. You want all the ad content to look as good as possible, so the demo video would be best if it shows off the most impressive features of your application, which is where we'd expect performance problems.
That said, it's probably not feasible because creating a full benchmark script that also doubles as a real-world demo for an app would be too much burden to put on developers.
Every time I see a scroll text that jerk, the old demo coder in me wants to scream - I never released anything exciting, but, but figuring out how do scrolling smoothly was pretty much the first thing you'd learn.
This is a PAL demo, so it's worth setting your monitor to 50Hz refresh rate if possible to see it as intended. Many monitors will sync to 50Hz even if they don't report it in the EDID. On Linux you can attempt to use arbitrary modelines with xrandr.
I particularly like Linus A's contributions to the C64 demoscene, because more than likely he'll do an excellent writeup like this about a newly discovered or perfected technique that was central to it.
It's particular a big deal given how hard it used to be to get information about these things... When I finally learned how to do DYSP's, it was thanks to finding someone willing to photocopy something like a 3rd generation photocopy of a cycle-count diagram that someone had drawn by hand.
So much effort was lost because of the communications barriers.
I used to program demos for the C64 back in my teens. A lot of the learning was simply reverse engineering the code from other demos, sometimes verbatim copying snippets, always trying to understand. At age 16 I was pretty confident that I had achieved the skille level of a wizard but in reality my understanding was pretty sketchy.
An example: Rendering graphics (i.e. sprites) at the far horizontal edges of the screen would require the CPU to perform some shenanigans to trick the video circuits; this would need to be done on every scanline and required the timing of the CPU to be in close sync with the video hardware. I understood that. I had also experienced how sprites and every eight scanline (aka "badline") would mess up the carefully planned timings. Eventually, I kinda understood that concept. I had also seen, from code that I copied, how triggering a badline could be used to force the CPU in sync with the raster beam but it was akin to black magic for me. Wasn't until years later, programming on the Amiga, that the penny dropped for me.
And of course, grasping the concept and implications of DMA was pretty basic stuff compared to what's going on in this article. I don't think that I'll ever devote the time to understand it in detail but I find it fascinating how people keep discovering new unintended features in the old C64 architecture.
linus if you are reading, i just discovered your website via this post, and have really enjoyed checking out all the stuff on it. lots of great fun on there, and also lots of very thought provoking ideas about music and the compositional process. thank you!
This looks like a pretty nutty technique, essentially exploiting an undocumented state (or bug?) in the VIC chip.
The Atari 2600's TIA had lots of sharp corners caused by the reliance on polynomial counters, which saved silicon but made for lots of seemingly-random edge cases (after all, polynomial counters are also used to generate pseudo-random noise!)
> essentially exploiting an undocumented state (or bug?) in the VIC chip.
You just described most of the new effects in C64 demos over the last 30+ years...
But, yes, this is one of the nuttier.
When I was a kid and trying to do demos, the "simple" stuff like tricking the VIC to keep the borders open was still amongst the more exciting things you could do (opening the top and bottom border is trivial once you know how; opening the left/right border was harder - especially if moving sprites in the Y-direction as it affects the number of bus cycles available to the CPU).
That was quite literally childs play compared to this one.
1. On each raster scan line, at precisely cycle 15, you need to clear the Y-expand register (vertical pixel size doubler thingy that sprites can do). This throws the hardware into confusion, the internal registers keeping track of where in memory a sprite is being drawn from is scrambulated and you wind up with the interesting graph presented on the right as to how the indexes progress from scanline to scanline. Y-expand is a single byte register where each bit belongs to one of the eight sprites. Simply clearing a sprite's Y-expand bit on clock 15 every scan line is sufficient to introducing glitch pandemonium.
2. On some rows you want your sprite's Y-expand to be cleared to trigger the glitch and sometimes not to have the next row be read in sequence, so before the scanline ends, we need to toggle Y-expand for a sprite to 1 on a per-sprite basis. How did the author do this efficiently? ...
3. ... by using sprite-to-playfield character collision detection! He put the sprites in the background, behind the character graphics and placed a single pixel vertical bar using redefined characters or bitmap graphics to cover the right-most pixel of the sprite. In the sprite definition's right-most pixel for the current row, he would encode either a 0 or 1 to decide if the next row's Y-expand should be 0 or 1. The natural collision detection of the sprite hardware would transcribe a 0 or 1 into the sprite-to-playfield collision register for all 8 sprites when both the sprite and the playfield had a filled bit. You'd wind up with a byte that is ready-made for the Y-expand setting for the NEXT scanline for all 8 sprites. (I assume before the end of the scanline, the idea is to read the collision register and write its value to Y-expand.) What a clever way to save a memory fetch! Also by reading from the collision register before the end of the scanline, it resets the VIC chip's readiness to test collisions and collisions will be tested again on the next scan line, so the whole process can repeat.
I find this hack to be really beautiful. That use of collision to dynamically build the next scanline's Y-expand values kindof reminds me how modern 3D games may encode all kinds of different scene information into various layer buffers and color channels as a game frame is rendered over many passes.
As a kid I reused the 64 sprite hardware over and over to fill the screen with sprites. As I recall, I lost a lot of CPU time because the VIC chip was hogging access to the memory more than it normally would. This trick would have let me fill the screen with more sprites than ever. One of my dream goals as a kid was to reuse the sprite hardware faster, to change their colors and be able to get more colors on the screen.
I recall trying to find a way to get the sprites to be only one scanline high and spend all my time simply changing colors and memory locations on the sprites. It never worked, it was like the VIC chip was locked in on the settings for a sprite until it was done drawing. And that in fact is the trick with this Y-expand thing - you can trick the sprite hardware to finish a sprite in fewer scanlines than should be possible (apparently as few as 4, according to the author!) Once the hardware thinks the sprite is finished, the hardware is relinquished and can be commanded to reuse that sprite on subsequent scanlines to paint more pictures.
It seems the demoscene may achieve my dream someday - perhaps my fancy super-color-1-pixel-high-sprite bitmap display mode hack may one day become a reality!
The C64 demoscene is at this point a pure case of magicians developing tricks for other magicians. Compare to 20 years ago, when the tricks were made to wow users familiar with the platform and its limitations.
Now someone without intimate knowledge of the C64 would not understand what is the hard part of this demo part. Letters scroll up and sometimes expand or shrink a bit. We've seen lots of different scroll text parts. Is this hard?
Consider the brute force approach. The text scroll area is 192 pixels by 200 pixels. If this was just a bitmap, that is 4800 bytes to update. Pure unrolled code to move bytes would do
for a minimum of 8 cycles per byte (extra if 256-byte page boundaries are crossed, and to update the x index, and do the logic to pick out which letter to draw) or 38,400 cycles to update that bitmap. But there are just 19,656 cycles free per frame! The best a brute force approach would get then is one update per 3 frames, or 17fps.So all the cleverness is getting the machine to do something at 50fps that naively it could do at best at 17fps. This is by racing the display raster and playing tricks with the hardware bugs in the cpu/video chip interface.