Overhauling Mario 64's code to reach 30 FPS and render 6x faster on N64 [video]

Lerc · on April 20, 2022

This is a good lesson about the benefit of time when it comes to software development. Mario 64 could have been a much much better game on the same hardware, but the time cost would have resulted in a release late enough that the entire game might have been considered irrelevant.

Game consoles have had the benefit of the platform freezing for a few years allowing improvements to accumulate and later titles utilizing knowledge gained from the earlier titles.

Much of the software we use every day could benefit from this level of scrutiny but the pressure to deliver on time means we won't get to see such benefits unless it becomes a crazy persons labour of love.

egypturnash · on April 21, 2022

There are externalities to consider that make a later release less likely, too - it was a launch title, to match the pack-in of SMB in the NES and Super Mario World as one of two games available when the SNES launched.

Wikipedia’s entry on SM64 tells me that the N64’s launch was delayed because Miyamoto wanted more time to polish the game. I don’t think a later release would have been possible. This was the flagship title, the system seller. Come play Mario. In 3d.

carlmr · on April 21, 2022

>This was the flagship title, the system seller. Come play Mario. In 3d.

And I think they were right. I bought it just because of Mario64. Otherwise I'd have gone for Playstation.

larrik · on April 21, 2022

I decided to buy it for Shadows of the Empire, but Mario64 was actually great (while SotE was not).

GauntletWizard · on April 21, 2022

You take that back. SotE was rough, but had some excellently varied gameplay and interesting, fun levels with cool enemies. Also some losers in there, but every great game has a bad level. :)

bscphil · on April 21, 2022

I watched the video through only once, but unless I'm mistaken (the video is a little confusing on this point), the claim is that the original game did run consistently at 30fps (the design framerate of the console), and the point of these hacks is to improve framerate for mods and devices that can play the game at higher rates. Unless I misunderstood something, it sounds like the original game was exactly as good as it needed to be, as released. And that's before you get to the fact that the mods require the RAM expansion pack.

kibwen · on April 21, 2022

The original Mario 64 would suffer lower framerates in certain areas (e.g. Dire Dire Docks). We can also imagine that if the artists had had more headroom to work with, then they would have spent it on making levels larger and more detailed.

pjerem · on April 21, 2022

Is it only a techno limitation concerning the size and details of the levels ?

Mario 64 was already pioneering basically everything in the 3D platformers, and maybe even 3D games as a whole.

Given it was a launching title, they had to stop somewhere and ship the thing. So, like any software with ever, they had to make tradeoffs.

And, well, here we are talking passionately about this game 26 years after. For me it’s an amazing proof that they made amazing tradeoffs.

Because, yes, we now have plenty of time to analyse the thing and find issues with the game. But, the 6yo me, the player me, they have nothing to criticize about this game.

For me it is still a perfect masterpiece given its era and technical context. I don’t think the game could have been really better than it is without breaking something and I’m pretty sure that the engineers at Nintendo considered the game as « finished » and liked the result themselves. It shows in the game. The better would have been the enemy of the good.

But in the meantime you are also right because Mario 64 DS was a really good extension. I miss this game, it really deserve to have its remake. But this version hugely benefited from the evolution of the Mario universe and the technical superiority of the DS so I’m not even sure it could have been done in the 90’s.

I hope we’ll today be able to play a newer version of this game. Be it official or not, I crave for SM64DS with the original controls.

thejosh · on April 21, 2022

I read something that they didn't compile with compiler optimisations, doing so would have fixed the water level.

eddieroger · on April 21, 2022

The video also addressed that the game shipped with the compiler in debug mode. That alone would have helped.

hbn · on April 21, 2022

The whole "Nintendo forgot to set the O2 optimization flag" thing has been somewhat debunked, or at best it's misleading. ModernVintageGamer did a video discussing the topic that's pretty well-researched

https://www.youtube.com/watch?v=NKlbE2eROC0

To summarize what he says: since the game was a launch title, it was likely being compiled with an early, less-stable compiler/SDKs, with known bugs that wouldn't be fixed until later (he even cited some official developer documentation that mentions this). So setting that flag could have caused known issues/instability at the time so they just left the optimization out.

And the other thing is the game is using a lot of libraries that were indeed compiled with the O2 flag, and the performance drop by removing those is far more significant than the tiny gain by adding it to the top-level of the makefile where the rumor suggested it should have been added.

skocznymroczny · on April 21, 2022

Quake launched a day before Mario 64, so I think 3D games would be fine without M64.

pjerem · on April 21, 2022

I have nothing against Quake, it’s also a wonderful game. But Mario 64 solved more complex problems of the 3D nature : huge scenes with a lot of (moving) objects, moving in all directions, complex move set, camera, camera, camera…

Quake is a nice game as it is but it’s much simpler when the player is the camera.

ehaliewicz2 · on April 21, 2022

More complex problems?

I'm not sure. A sophisticated camera is quite nice, but quake introduced a visibility determination solution that was used in dozens, perhaps hundreds of games afterwards. It also helped pioneer modding (it has a compiled scripting language), and later on had online deathmatch with predictive netcode (so it actually worked over the internet!) and was one of the first games to support hardware accelerated 3d graphics on pc.

It also has lightmaps, all surfaces are textured, and runs in software with no hardware acceleration, z-buffering, built-in perspective correct texture mapping, etc.

Obviously, that doesn't make it more fun to play, but Quake was an incredible technical achievement for the time, I don't think anybody would put Mario 64 in the same basket.

kolanos · on April 21, 2022

Quake supported cameras, as you put it. People made movies in the Quake engine with literal camera shaped models floating around. Also, Quake levels could be huge, much larger than Mario 64. Take a look at Ziggurat Vertigo [0][1], for example. Mario 64 was much more colorful, though, Quake had an intentionally limited color palette.

[0]: https://quake.fandom.com/wiki/E1M8:_Ziggurat_Vertigo [1]: https://www.youtube.com/watch?v=a72k_XBi2ls

Kableado · on April 21, 2022

They practically bruteforced all technical problems, using dedicated hardware. And simply ignored some others, like lighting.

dudeinjapan · on April 21, 2022

The epic music of Dire Dire Docks compensated for the framerate. https://www.youtube.com/watch?v=Zqa2mgjbOIM

JetAlone · on April 21, 2022

The music practically eased you into accepting the frame rate, slowing the player down a little with its calming mysterious chime.

sitkack · on April 21, 2022

Pairs nicely with Opus 1 https://www.youtube.com/watch?v=N7xn5zeJ4D4

dudeinjapan · on April 22, 2022

I see your Opus 1 and raise you Billy Cobham - Heather https://www.youtube.com/watch?v=e3E9vx5vVck

jbay808 · on April 21, 2022

Wasn't that because it had compiler optimizations disabled, due to a bug in the compiler that was subsequently fixed after the initial cartridge release?

jsiaajdsdaa · on April 21, 2022

No, it wasn't compiler flags. Put simply, there are too many objects on the screen in that particilar part of the level, and the engine has wasteful code that computes a certain piece of physics three times for every object on the screen iirc, among other things

ed25519FUUU · on April 21, 2022

That’s particularly why performance in dire dire docks was bad, but Kaze goes WAY beyond just enabling -O2 in the compiler.

mcronce · on April 21, 2022

It wasn't the only reason, but -O0 was a contributing factor. -O2 alone improves things, but doesn't bring the frame rate in DDD up to 30

hnlmorg · on April 21, 2022

Assuming they had the spare capacity on the cartridge and the time to develop additional content. Neither of which is a given.

prophesi · on April 21, 2022

I think the main thing is that Nintendo had splitscreen multiplayer planned, but performance prevented it.

Edit: Got farther into the video and he talks about this at 14:50 ( https://youtu.be/t_rzYnXEQlE?t=890 )

dudeinjapan · on April 21, 2022

L is real 2401

bredren · on April 21, 2022

The video is confusing on this point.

The focus is on all the cool improvements and it seems like there is a lot to like.

But at 6:40, the dev says their solution would not work on the stock 4MB ram included w the N64. It is kind of snuck in there.

It’s kind of an important detail, because like the GTA loading screen fix last year it’s way more interesting when stories like this have a What If?! slant to them.

Another commenter mentioned this below, but it’s buried.

https://news.ycombinator.com/item?id=31104112

shdon · on April 21, 2022

That's only for one of the many optimisations he did, making use of different sections of memory to improve throughput. Many (all?) of the other ones don't increase the memory requirements. Quite a few (such as re-rolling unrolled loops) should decrease code size (though that won't improve memory use on a ROM cartridge based system) improving cache efficiency.

_pn3l · on April 21, 2022

About midway into the N64 lifecycle a ram expansion card (expansion pack) was released. His modifications sound like they still play on a stock N64 but the expansion pack is needed. This was fairly typical with higher end games made by Rare. Even as a purist, this added detail doesn’t really detract from the outcome of the project to me.

hnlmorg · on April 21, 2022

Almost all games worked without the Expansion Pak ("Pak" intentionally spelt without a 'c'). In fact there were only 3 released games that required it[1] and even some of those were still playable without it, you just lost some content (eg Perfect Dark).

If this Mario 64 mod requires the use of the Expansion Pak then it's definitely not typical of other N64 games and a point that, like the GP, I felt needed more emphasis rather than a passing remark that was very easy to miss.

I'm not taking anything away from the incredible work that developer has done though. It's impressive and really interesting to see. From a hacker perspective it makes total sense to use the Expansion Pak because most retro gamers who might be interested in this mod would likely already have one. But from an authenticity perspective, this is a little bit of a cheat.

[1] https://en.wikipedia.org/wiki/Nintendo_64_accessories#Expans...

propertymagnate · on April 21, 2022

For perfect dark the content you lost was the main game, so it was pretty essential for it.

hnlmorg · on April 21, 2022

Indeed, however and, somewhat counter-intuitively, the multiplayer game did work without the Expansion Pak. But arguments about the completeness of Perfect Dark aside, my point is that Perfect Dark is still the exception in that the vast majority of games worked without the Expansion Pak. Which is contrary to the earlier commenter who said a lot of N64 games required it.

rawbot · on April 21, 2022

I believe the game can run at 60 FPS on a real N64 with the Expansion Pack installed using his modifications. I don't think all these changes, and mods, are limited to usse on emulators.

selimnairb · on April 21, 2022

Noob question, how does one get his code to load on real N64 hardware?

AnIdiotOnTheNet · on April 21, 2022

https://krikzz.com/our-products/cartridges/ed64x7.html

These things are amazing for retro gaming enthusiasts because they let us play patches, mods, homebrew, and hard-to-find games on real hardware. Not that emulation isn't also wonderful.

Narann · on April 21, 2022

An anecdote is that Krikzz was a Ukraine company. He moved to Spain when war begin and is slowly starting to ship stuff again.

Its hardware are amazing.

amilios · on April 21, 2022

the framerate dropped in certain areas, it ran mostly consistently at 30fps but not perfectly consistently. These mods make it so that even in those areas it still maintains 30fps.

nemetroid · on April 21, 2022

Target FPS varied between N64 games. F-Zero X, Super Smash Bros. and some others ran at 60 FPS, while Zelda only ran at 20 FPS (17 FPS in PAL regions).

Narann · on April 21, 2022

One reason for the gap in those specific titles is FZX and SSB use 1 cycle color combiner. Zelda use 2 cycles.

SSB also have a very specific Z management. Levels are drawn without Z buffer, from bottom to front, then the character section (1 meter width) is drawn with Z buffer enabled, and the rest of the from level is drawn without Z buffer, once again from bottom to front.

I never inspected FZX but I suspect its Z management is managed is a similar way.

finnh · on April 20, 2022

> ... unless it becomes a crazy persons labour of love

Or there are valid business reasons to keep chipping away, in particular on performance-per-dollar problems at scale. If you're spending $1M/month on hardware to run something, each 10% win is worth enough to employ a couple senior engineers.

abriosi · on April 21, 2022

Many senior engineers who operate systems which spend 1M/month make 500k yearly

1M * 12 * 10% = 1.2M

500k * 12 * 2 * taxes >= 1.2M

How many 10% wins do they have to achieve in what timeframe to have a good ROI on micro-optimization, considering that these seniors could be producing stuff that allows for multiple times their salary in revenue?

charleslmunger · on April 21, 2022

Improvements are one-time investments of engineering time, but ongoing savings in infra costs. Consider also that these savings also permit scaling to more customers or more work.

Suppose you have two engineers that total to $1M of personnel cost. If they deliver a cost savings of 1% per month (compounding) the savings-per-month will have surpassed their pay per month; you will see your infra costs at 85% of what they were after 16 months, and on month 17 you will have made a net profit on those engineers, where the savings they got you has totalled more than what you've paid them.

If your business is growing, you could project your infra costs to grow beyond $1M per month, and the numbers become even more favorable.

abriosi · on April 21, 2022

Unless a company is running an engineering team composed of monkeys, it will be hard to achieve 1% optimization per month, even without compounding. There are a lot of optimizations which look good on paper but in practice bring other unforeseen cons

Anyway, a profitable software company will be looking at getting a big multiple of 10x the salary of their engineers

saagarjha · on April 21, 2022

I work in perform engineering, although in a client rather than backend role. We usually do better than 1% a month. In fact we do this even as >1% of regressions make it into the build from the software development process.

charleslmunger · on April 22, 2022

It depends what you're optimizing. Mature c++ codebase that is routinely profiled? Large legacy codebase with few tests? Mixed workloads?

All hard. But there's also a lot of easy cases out there - even good engineers might code like monkeys if they're under time pressure and trying to grow the business.

I've done a lot of performance work, and while of course no one individual could achieve a 1% savings per month on the entirety of FAANG compute loads, there is lots of code out there that nobody thought would run at scale or be on the critical path of something important, but now it is and it has gotten expensive.

bilkow · on April 21, 2022

You shouldn't multiply by 12 if the 500k is yearly. Also what's the 2? 2 engineers? The correct math would be:

1M * 12 * 10% = 1.2M

500k * 2 * taxes >= 1M

abriosi · on April 21, 2022

Someone said "a couple engineers" hence 2. The 12 was a bad copy paste, thanks for pointing that out

CSMastermind · on April 21, 2022

As a rule of thumb, I devote about 0.5% of my engineering org to cost optimization. At about 200 people it makes sense to invest a single engineer. By the team you're approaching 800 it makes sense to have a team dedicates to it.

Though cost opt generally involves way more than optimizing code. It's normally about auditing things, creating budget tooling, right sizing, broader architecture redesigns, etc.

klodolph · on April 21, 2022

I don't see how a rule of thumb could make sense, since the cost of operations can vary so much from company to company.

Cthulhu_ · on April 21, 2022

I mean in this case, I'm confident that the codebase for Mario64 was used as a basis for subsequent N64 titles; when pioneering something like this, you want to spend some extra time during or after development to tweak these codebases.

losingom · on April 21, 2022

It was definitely used for at least Ocarina of Time and Majora's Mask, albeit tweaked. I don't doubt that they used it in other titles too.

gernb · on April 21, 2022

Much of the software we use every day does benefit from this level of scrutiny. There was a popular post today about changing C++ std::sort for more perf. There's JavaScript engines running 100x faster for a given CPU than they were in the 90s? gcc and llvm optimize more than ever?

bombcar · on April 21, 2022

But there’s tons of untapped potential that is rarely realized - compare some of the greatest Doom PWADs made today (on the vanilla Doom engine, not on limit removed engines) vs the original game.

They were technically possible but there wasn’t time to do them and still get the game out on schedule.

Zardoz84 · on April 21, 2022

Or simply nobody thought that could be done things like that (for example, the classic "invisible" sector bridge), before a PWAD did. Plus the game was released targeting to work under a 80386, so the stock maps was limited to ran at an acceptable performance on these kind of CPUs.

bombcar · on April 21, 2022

Yeah, “limits” are sometimes in hardware and sometimes conceptual. Look at the demoscene - most of what they do is designed to run on period hardware, but far surpasses anything anyone did at the time (it’s the whole point).

anthk · on April 21, 2022

Check FastDoom too :).

giantrobot · on April 21, 2022

Super Mario 64 was a launch title for the N64. One of two launch titles in North America. There really was no way to delay the game without delaying the console.

With that in mind, SM64 is a pretty impressive game for a console launch title. It makes good use of the N64 controller, performs well, and is pretty fun. It's not the best game of all time but it's definitely not a bad game.

eternityforest · on April 21, 2022

This is a good reason to stop changing the core frameworks every 5 minutes, and stop building one off infrastructure to solve individual problems, unless there's a really good reason.

When something is stable and widely used, you can do optimization like this, and justify it for real production systems, and everything built on it benefits.

robertlagrant · on April 21, 2022

You clearly need to spend more time in a JavaScript team. Knock that nonsense right out of you :-D

eternityforest · on April 21, 2022

JS: The mediocre language made wonderful by it's great ecosystem.... with a dev community constantly trying to use random crap instead...

Seriously guys we have great utility libraries and bundle optimizers... why are we still piecing together micro libraries like this is an old UNIX mainframe?

Jesus_piece · on April 21, 2022

This is completely true for non -cloud, on-prem hosted code/programs, but this notion disappears when programs/software is hosted in the cloud, and updates can continuously occur

agumonkey · on April 21, 2022

I don't know if there's a term for this. The limit of reach any task has. Everything has a deadline and we all live with partial results in a way. It's like civilization internal resistance (with a cycle of survival on top)

ChrisRR · on April 21, 2022

It's not completely about time, and as much the benefit of 25 years of experience, not writing for brand new prototype hardware, not writing a brand new 3D engine when 2D had been the norm for decades.

AstralStorm · on April 21, 2022

That's the part about batch rendering. Everything else there is architecture specific and typically not a problem on modern hardware anymore, unless you really force it to its limits.

The coders accidentally optimizing to reduce GPU compute load when there's GPU compute to spare and memory bandwith is limited, for example.

noobermin · on April 20, 2022

This is why honestly developers need to stop seeing issues as technical problems and more as being social ones (which they often are).

nomel · on April 20, 2022

This is why we have product software engineers, and UI designers. The UI is a social problem and the purpose/goal of the product usually is. The guts that make that UI and achieve that purpose are very rarely social.

phillipcarter · on April 20, 2022

"roll those loops back up and don't compile in debug mode to achieve significant performance gains" is not the takeaway I thought I'd have going in, heh.

infogulch · on April 20, 2022

Unrolled loops trade instruction cache for cycles. Whether it's good depends on what your bottleneck is, in this case ram access is a bigger bottleneck than cpu due to being shared with the rendering coprocessor. I wonder if we'll hit this crossover point again in the future.

monocasa · on April 20, 2022

And the code N64 was particularly unrolled for whatever reason. IIRC, there was one game that bit me when writing a JIT N64 emulator because there was a single code stream with no branches that was larger than multiple MMU pages. My JIT kind of assumed that actual code in practice wouldn't be larger than the actual N64 I$ without branching. Womp womp.

But hey, I guess I don't want to be judged on my mid-90s code either.

nomel · on April 20, 2022

> And the code N64 was particularly unrolled for whatever reason.

My naive assumption is that they performed the majority of testing with debug code (as you would). And, since they had exactly one chance to get things right (being a physical cartridge), they couldn't afford to do all the testing again, with the new/different binaries generated with debug turned off.

monocasa · on April 21, 2022

My understanding is that the massively unrolling loops thing had to do with how different branches felt versus previous consoles. Compared to filling an I$ line (somewhere in the many dozens of cycles on an N64), a branch mispredict was very cheap. However branch mispredicts were very visible. Delay slots were present in the ISA to decrease their cost and the pipeline stalls were noted in pretty much all literature on RISC at the time. On top of that, the N64 really didn't have great performance monitoring tooling, but was complex enough of a memory hierarchy to have really benefited from such tooling had it been available. It's pretty hard to understand the system effects pieces of code have without that.

phire · on April 20, 2022

From what I understand (as I didn't start writing code until the mid-2000s), loop unrolling was a very highly regarded optimisation of the early 90s.

And it was a very long time before programmers (and compilers) realised just how bad of an idea loop unrolling was once we entered the era of superscalar processors with instruction caches and half-decent branch prediction.

hugey010 · on April 21, 2022

It never occurred to me that loop unrolling could end up being a de-optimization due to compiler optimizations!

quietbritishjim · on April 21, 2022

The things the parent comment referred to (instruction caches and branch prediction) are in the processor hardware rather than other compiler optimisations.

stonemetal12 · on April 21, 2022

To be fair early 90s branch prediction was garbage. Back then it seemed like every new processor release was accompanied by a note on how branch prediction was improved by 80-90% compared to its predecessor.

AstralStorm · on April 22, 2022

To be even more fair, counted constant sized loops never required particularly good branch prediction, especially if the code is well laid out so that your "miss" does literally one more cheap step.

Misprediction there means you computed one more vector or value, not a huge loss compared to a cache miss. Branch misprediction hurts a lot when it forces a cache miss mostly, whether it's for data or for instructions, or ties up a deep instruction pipeline completely with lousy abort, which most CPUs do not have. (See: Nehalem.)

butwhywhyoh · on April 21, 2022

He did so much more than that. The not compiling in debug mode was well known before this video, and he addresses that. His speedups are independent of not compiling in debug mode.

phillipcarter · on April 21, 2022

I know, and I hadn't appreciated all the ways that doing something less CPU-efficient was actually faster due to the way the hardware was architected.

I just actually shouted when he said that if you simply compiled for release mode it ran 30% faster.

astrange · on April 20, 2022

It was common belief in that era that you made code faster by making it longer and more complicated. The opposite is true now (because of cache size and deep OoO) - simple predictable loops are great - but even then people tended to overdo it.

Of course, it's a way worse problem that they compiled at -O0.

AussieWog93 · on April 20, 2022

>Of course, it's a way worse problem that they compiled at -O0.

There was a Modern Vintage Gamer video going into this. Apparently it wasn't a mistake or misunderstanding, but a bug that existed in the original version of the compiler that caused issues when compiled with optimisation. Even then, the main library that did most of the heavy lifting was compiled with o2.

Super Mario 64 was a launch title, and they simply didn't have the time to work out the kinks in the compiler to get those few extra fps in Dire Dire Docks.

pcwalton · on April 21, 2022

As I recall, that MVG video was incorrect—Nintendo could have compiled with optimizations and the game would have worked.

The most plausible explanation to me (and, IIRC, the one that Giles Goddard speculated was the case) is that they simply didn't want to take the chance that optimization could surface unknown bugs in the code, and they knew that the unoptimized version worked and played fine, so they played it safe and shipped the known-good unoptimized version. This is a perfectly understandable decision in terms of risk management, even if in retrospect it was too conservative. Less likely, but still very plausible, was that turning off optimization was just a mistake; mistakes happen when you're rushing a release for a deadline, and build deployment was not as orderly and automated back then as it is today.

taneq · on April 21, 2022

Sounds like a good call to me. If you have something mission critical that's tried and tested, and reliably does 99% of what you need it to, you don't mess with it. It's very easy in software dev to get caught up in software optimizations, and it's a lot harder to quantify the risk of continuing to mess with something the closer you get to ship date.

Of course it "shouldn't make any difference" but who among us hasn't at some point been bitten by a "that couldn't possibly affect... ohhh. Ohhhhhhh." style bug?

When you're dealing not just with a single piece of software, but an entire hardware+software ecosystem, you don't risk jeopardizing the whole stack for an extra few FPS in some back corner.

bombcar · on April 21, 2022

Especially in an era where a buggy release could require an expensive return/reprinting step.

Back in those days it had to ship as close to perfect as possible because you couldn’t just send out a patch layer.

anthk · on April 21, 2022

-O2 was considered "OK-ish". With -O3 often you get a can of worms.

bombcar · on April 21, 2022

Yeah, and most development houses would test and develop with -O2 on. The problems came if you ended up thinking you might need to switch near the end.

3836293648 · on April 21, 2022

No, that's not more plausible. Because it's only the NTSC release that was compiled in debug mode. The Japanese and PAL releases were O2

FartyMcFarter · on April 21, 2022

But those are different versions. Just because they worked with -O2 doesn't mean NTSC would.

I'm not just saying this in a theoretical sense either; I've seen much weirder bugs.

ReactiveJelly · on April 20, 2022

Yeah it's hard to remember the time before you could get 2 near-perfect C/C++ compilers on the 5-ish most popular architectures.

And the architectures have consolidated since then.

throwawaylinux · on April 21, 2022

It's always been a mix. Inlining, unrolling, and otherwise choosing longer sequences of instructions that are faster is still a major theme of optimization and optimizing compilers.

Even on today's high performance CPUs, which are far more sophisticated than the primitive 5-stage scalar in-order R4300 (not sure it even had any branch prediction actually).

kimixa · on April 21, 2022

> Even on today's high performance CPUs, which are far more sophisticated than the primitive 5-stage scalar in-order R4300 (not sure it even had any branch prediction actually).

I believe it didn't, with a short scaler pipeline (only 5 stages) the cost of waiting for a conditional branch to be calculated is relatively less, and MIPS of that era actually exposed the idea of a "delay slot" - rather than pausing the pipeline while calculating the jump, it still executed the next few instructions - so they were always executed no matter if the jump was actually taken or not. Filling this with NOPs makes it equivalently the same as a pipeline bubble, but I guess the intention was in some cases it could actually be used for something.

I think a few RISC architectures of that era did similar things, their entire goal was simplicity after all, the argument being that something like a branch predictor and the logic required to reverse speculatively executed instructions "should" instead be used to make the CPU smaller or higher performing in the best case (either as implementing better performing but larger logic, or higher frequency due to shorter critical paths).

I think this was dropped in later ISAs as it confused a lot of people, it wasn't well used by compilers (which is interesting as much of the MIPS ISA was explicitly designed around what would be easy for a compiler rather than handrolled asm), and improving process technology made transistors to handle the expensive superscaler speculative execution architectures relatively cheaper.

jart · on April 21, 2022

Now that's a 10x programmer if I ever saw one. I'm sure the compiler helped too. The generated code of C compilers from the 80's and 90's versus the generated code created by something like modern GCC -- it's like night and day. We're able to build faster tinier software than we ever have before, and it really helps to illuminate the impact modern tools have had when we apply that advantage to something old we already know.

jpgvm · on April 21, 2022

Ironically part of the problem was that the optimising compilers of today make different space and memory throughput tradeoffs vs CPU execution speed that don't hold for the slower memory architecture of the N64. Namely optimisations like loop unrolling that benefit modern CPUs are a detriment to the N64.

I think most of the cool stuff he did here is around restructuring the code to remove contention on the RAMBUS as it can do operations on different banks in parallel when servicing reads/writes from different components vs causing contention if they wish to read from the same bank. Also writing directly into the bank that will be used by the graphics unit from the CPU while disabling cache features of the RAMBUS, freeing up the cache for more code pages for the CPU to use.

Moving more of the code to execute on the "maxed out" GPU and actually cutting execution time because of the reduced memory bandwidth requirements was also pretty neat.

wiz21c · on April 21, 2022

After 20 years not coding (or coding in python, ruby, R, that sort of thing), I started up writing code in rust that needs performance (emulator). I was absolutely amazed by the quality of the compiler today compared to what I had 20 years ago. It's really amazing to see how clever it has become.

djmips · on April 21, 2022

10x in one dimension. But not by the modern definition of 10x since this actually took a lot of time. This is more like a craftsman or artisan who toils away until they are ready and the result is perfect.

jasinjames · on April 21, 2022

I really like the way you put that into words; the creator of the video clearly has a deep passion for his work.

pupppet · on April 20, 2022

I always wonder if the original devs ever find these videos and are like oh shit we should've done that.

MBCook · on April 20, 2022

It would certainly be interesting to see their reaction.

Mario 64 is a bit of a worst case as the game was being developed at the same time as the hardware and had to get out on day one no matter what. Between the Japanese and US releases I believe they even fixed some bugs.

Some of the things he did seem quite obvious, like cutting down the repeated calculations for every single coin on the screen. You have to think they would’ve thought of that if they have the time.

It would be interesting to see a series like this on a number of games over the console’s lifetime. Especially if it highlighted the tips and tricks the developers started to learn to get more out of the hardware over time.

orblivion · on April 20, 2022

> You have to think they would’ve thought of that if they have the time.

Stuck forever in the issue tracker as "priority: low"

munk-a · on April 20, 2022

Speedrunners famously need to hunt down first press cartridges on the US release because the Shindou release fixed the bug that allows BLJs (backwards long jumps).

SLWW · on April 20, 2022

This explains a lot for me; since I kept trying to achieve a BLJ but for whatever reason I couldn't.

Now, years later, you comment explains a lot O_O

satsuma · on April 21, 2022

it's worth noting that the virtual console (wii, wii u) releases are built off of the original nstc edition, meaning backwards long jumping is in play. but the most recent 3d all stars edition is based off of shindou, meaning blj's are not possible.

ehsankia · on April 20, 2022

Can't even imagine being a dev back when you only had one shot at your game going gold and couldn't update it after the fact to patch bugs.

joshspankit · on April 20, 2022

Tbh it made things much riskier for the game companies, but a lot better for the players.

For example: It was almost impossible to profit on the “make a killer trailer, presell, then deliver junk and ‘patch it later’” model that’s unfortunately popular these days.

omegaham · on April 21, 2022

These days, I see this as a subsidy for us older folks who wait a few years to buy a game. The clueless pre-order folks are funding my gaming experience!

cercatrova · on April 21, 2022

The subreddit /r/patientgamers [0] is exactly about this philosophy. Games get simultaneously better and cheaper over time and all I need to do is wait? Sign me up.

[0] https://old.reddit.com/r/patientgamers/

ComradePhil · on April 21, 2022

[flagged]

WJW · on April 21, 2022

Responsible adults waste their time with respectable hobbies like mindlessly watching TV and going to sports games! /s

ComradePhil · on April 22, 2022

> mindlessly watching TV

Also childish.

> going to sports games

More teenagy than childish, but still just as bad.

Sohcahtoa82 · on April 22, 2022

What do you do for entertainment in your spare time?

imtringued · on April 21, 2022

Real men die on the battlefield. Staying alive is for children.

unclekev · on April 21, 2022

> playing video games is for children

Geez, thats a pretty narrow minded opinion

anthk · on April 21, 2022

You mean like the old as hell adult people playing tabletop games with Poker or Spanish decks or Domino in a tavern on any European village or town?

Collosal Cave back in the day was an adventure for children. Today it's a great game because the puzzles can be pretty hard even for adults.

Sohcahtoa82 · on April 21, 2022

Why don't you like fun?

PawBer · on April 21, 2022

There's always /r/stopgaming

Gigachad · on April 21, 2022

Eh, its just the agile development process coming to gaming. Deliver the minimum viable product, work out which parts of the game people care about most, and perfect those bits. Rather than trying to guess what the market wants and spending the entire budget before the game actually gets real usage. If you just wait 6 months after release you end up with a better product than if the game just came out 6 months later.

ehsankia · on April 21, 2022

There's definitely a balance. Yes, they needed to test their games a lot more, but no amount of testing will equal millions of people trying all sorts of weird things you would've never thought about. That being said, games have been also getting a lot more complex in multiple ways. Lastly, PC games also have to deal with an exponential number of hardware variations that is very hard to test exhaustively in house.

LocalH · on April 21, 2022

At launch, yes. But revisions were quite commonly slipped into newer production runs in a general industry sense.

giantrobot · on April 20, 2022

One thing mentioned in the video is his memory tricks require the RAM expansion module. Mario 64 being a launch title couldn't have used the RAM expansion. The RAM module came out two years after the console launched if I remember right.

Salgat · on April 21, 2022

The vast majority of what he does is a combination of benefitting from hindsight with 30 years of industry learned best practices, a vastly superior development environment (including compiler), and just having the time, with minimal pressure or demands, to dedicate to these problems. What he did was impressive, but something someone experienced with developing on these platforms could achieve reasonably well.

mrandish · on April 21, 2022

Just to be clear for anyone that doesn't watch it, he explicitly acknowledges most (if not all) that in his commentary. His overall tone is quite deferential to the original devs and the time, tool and technique constraints they were operating under.

deterministic · on April 21, 2022

I didn’t work on Mario 64 but I did work on other N64 games years ago. And yes I did have that reaction :)

markus_zhang · on April 21, 2022

Oh good sir can you please share your story?

vmception · on April 20, 2022

there are often dev reaction videos on youtube as well!

in response to mods, as well as speedruns (that may/may not use mods). sometimes the commentary from them is interesting.

ripley12 · on April 20, 2022

Do you have any favourite dev reaction videos?

8K832d7tNmiQ · on April 21, 2022

Psychonaut's dev reaction is my favorite one because they had one-to-one with a speedrunner talking about every game exploits used to get the fastest time possible.[1]

[1]:https://www.youtube.com/watch?v=lsDc1YVxHA0

kevin_thibedeau · on April 20, 2022

The Cuphead devs commenting on speedrun tricks they didn't anticipate being possible.

a1371 · on April 21, 2022

Not the same game, but there is this for half-life https://youtu.be/sK_PdwL5Y8g

gentleman11 · on April 21, 2022

Or do they just sigh in exhaustion that, even so many years later, people are still demanding more out of them, despite them being underpaid and overworked already? Then take some heart medication and go back to work?

Timpy · on April 21, 2022

None of the games I play these days are CPU intensive so while I understand FPS is important to a large group of gamers I have never paid attention to it. Before clicking I thought "does frame rate really matter for Mario 64? How big of a difference can it make?" I'm absolutely blown away, I can't believe how smooth this looks. I don't know what else to say, I'm speechless. Stellar job. Vroom vroom.

antifa · on April 21, 2022

High framerates are cool but often I wish more games had the option to cap framerate at 30/40/45 and use less battery/produce less heat instead of racing to 120fps.

wilg · on April 21, 2022

I'm the opposite - I need the framerate to be as fast as possible. I can't even look at a 30fps game!

simooooo · on April 21, 2022

Ever played on a 144hz+ monitor?

vmception · on April 20, 2022

pretty amazing what you can do when you push the release deadline back by 26 years.

iforgotpassword · on April 20, 2022

Heh, but even then, 6x is still impressive. I doubt in 26 years you could optimize a tripe A title from today to the same extent. It seems that the vast majority of optimizations is actually achieved manually and not through Compiler magic and other modern tools. But then again maybe in 26 years we will have magic tools that will get the job done. ;)

NobodyNada · on April 20, 2022

> I doubt in 26 years you could optimize a tripe A title from today to the same extent.

Well... https://news.ycombinator.com/item?id=26296339

dmix · on April 21, 2022

Wow Rockstar released his optimization. That's awesome.

iforgotpassword · on April 21, 2022

I knew this would come up :-)

Doesn't improve overall game performance though.

Hikikomori · on April 21, 2022

Loading times were however the main reason I couldn't enjoy that game. Took like 15 minutes from starting it to joining some MP game with friends, then 5-10 minutes between each game looking at loading screens.

maccard · on April 21, 2022

> I doubt in 26 years you could optimize a tripe A title from today to the same extent.

It helps that in this very particular case it was made much easier for him by basically enabling optimizations in gameplay code (it's fairly unlikely you'd get that one for free these days). But look at the difference in quality of launch titles for the PS4 vs what's being released today - that's 7-8 years worth of development. It's not unreasonable to expect that with another 2x that we could make similar gains. I've worked on a few AAA games and seen the codebases (and more importantly the issue trackers and profiling data). Today lots of these problems are "known" but it's a tradeoff - do you spend a person-week on fixing a frame drop in an awkward area or do you spend it on an incremental improvement across the board.

jccalhoun · on April 21, 2022

And have access to the ram pack that the original devs didn't have.

amatecha · on April 20, 2022

Wow, this is amazing! Plus he's adding split-screen co-op mode??! Mind blown. Impressive work, seriously.

andai · on April 20, 2022

Apparently the original game was meant to include it, but it was scrapped because it made the game run too slowly. Now that the game has been almost entirely rewritten for performance, this feature can be backported in :)

YesBox · on April 20, 2022

How would one learn about this depth of C and its optimizations? I'm interested in it as an art form, something to do in my spare time.

emsy · on April 21, 2022

C is not that deep. In the end its just something that produces assembly/machine code: jumps, reads writes etc. you need to get a feeling for what the compiler outputs for a given input (try Godbolt!). Then you need to know how those outputs will perform on your target hardware. This video is a good example: it states, modern compilers unroll loops, because they usually perform better on modern systems, but not the N64. So knowing your hardware is just as important.

MaxBarraclough · on April 21, 2022

> C is not that deep.

It absolutely is. Look at its memory model, or look at any StackOverflow question or HackerNews discussion about edge-cases of undefined behaviour. Example: [0]. Alternatively, look at challenging interview questions that require a deep understanding of the C language.

C gives the appearance of being a simple language, but has many dark corners that many programmers are unaware of.

[0] https://news.ycombinator.com/item?id=22867059

saagarjha · on April 21, 2022

Very little of this is relevant for performance. It’s important to know what’s UB so the compiler doesn’t miscompile your code, but most performance improvements don’t come from “oh the compiler can run aliasing analysis on this better so it’s 10x faster” but “this loop is O(n^3)” or “I should convert this linked list to a flat array”.

emsy · on April 21, 2022

That's basically what I meant. But even taking all the ugly parts of C into consideration, they are shallow. You don't need to dig into 10 layers of abstraction to get to the bottom of things. And the example posted is about what others say about how to interpret the C standard. I don't care. At the end of the day you can look at the asm output and understand what's happening for the compiler you're using.

panic · on April 20, 2022

Familiarize yourself with the tools for measuring performance on whatever platform you want to optimize for. Then you can just start messing around and seeing what kinds of things make the number go down.

nomel · on April 20, 2022

> Then you can just start messing around and seeing what kinds of things make the number go down.

And this will almost certainly require a deep understanding of the instruction set, and dropping into inline ASM, periodically.

panic · on April 21, 2022

Optimizing memory layout and reducing allocations are pretty ISA-independent. But yeah, at some point you will have to start looking at the assembly to optimize further (even if only to know when to tell the compiler to stop inlining a function that causes tons of register spilling inside a tight loop).

saagarjha · on April 21, 2022

In most cases there’s a lot you can do by just peeking at the source code. Almost all programs ship without having been run through a profiler at all.

bfz · on April 21, 2022

apt-get install linux-tools

perf record -g myapp

perf report

Play with the output for a few weeks. "Why is this slow?" is a never-ending question leading down infinite corridors and weird specialist tools.

Cachegrind and VTune are also instructive to play with

fps_doug · on April 21, 2022

Some of that doesn't have to do with C even. Like, changing memory layout of variables, or the way you access data.

A simple example is if you have a 2D array, and you have the data from the individual rows consecutively in memory, but then you loop over the columns in your outer for-loop and over the rows in your inner for-loop. This means you access the first element from the first row, then the first element from the second row, then the first element from the third row, and so on. All these elements are far apart in memory, but every time you access one element, let's assume it's an uint32_t, the CPU fetches a whole cache lane of e.g. 64 bytes and puts it in the CPU cache in anticipation that you access data close to this in the near future. But you don't, so the CPU has to fetch another 64 bytes block for the first element of the second row, uses only 4 bytes from that, and so on. If your 2D array is large enough, by the time you finish the first iteration of the inner loop and start reading the second element of every row, the 64 byte cache lane that was fetched when you read the first element of the first row has already been evicted from the CPU cache again when you read the first element of row 2000, so the same 64 byte block has to be fetched from RAM again. This makes a huge performance difference, and is applicable to pretty much every programming language.

klodolph · on April 21, 2022

A big part of it is learning computer architecture. If you have a solid grasp of things like how memory access works and how the CPU pipeline works, you can start to get a better mental picture of how fast a particular piece of code will run based on what instructions the CPU is executing and what the memory layout is.

There are textbooks and classes on computer architecture. Funny enough, many of them use MIPS, which is the architecture used in the N64.

Optimizing for N64 also requires understanding how the RCP works, which is a separate topic.

djmips · on April 21, 2022

Old skool book recommendation. Not necessarily C, more ASM but still a good book for the early nineties optimization that still has a mentality that's still valid. I believe that people today would be more concerned about how fast their code runs if it was visible, that is, if they profiled it with good tools!

https://www.amazon.com/Inner-Loops-Sourcebook-Software-Devel...

ninjinxo · on April 21, 2022

Read Agner Fog's optimisation manuals.

https://www.agner.org/optimize/#manuals

joshspankit · on April 20, 2022

Start with assembly

Maursault · on April 21, 2022

Machine code is only moderately less convenient.

Gravyness · on April 20, 2022

Its absolutely amazing the level of expertise and dedication a single person can have on the inner workings of something.

BolexNOLA · on April 20, 2022

Mario 64 has to be one of the most dissected games ever. It’s kind of wild what speed runners, modders, etc. have uncovered.

soulbadguy · on April 20, 2022

Naive question : how did we get access to the source ? Is that from a disassembler or did the source code leaks at some points

nebula8804 · on April 20, 2022

From what I understand he used Mario 64 decomp code.

The game was reverse engineered as 1-1 compilable to the original rom.

[1]:https://github.com/n64decomp/sm64

Fun fact: The people who did this moved on to Zelda 64 Ocarina of time and recently completed that project as well

[2]:https://zelda64.dev/

There are ongoing projects by other people for Mario Kart 64, Goldeneye, Perfect-Dark, Banjo-Kazooie and others although I think the next target for completion may probably be either Majora's Mask or The Minish Cap.

holmium · on April 21, 2022

PD will definitely be the next game to be fully (or near enough) decompiled. The dev behind that one is great.

If you're bored, his write up on the "Challenge 7 Bug" is great read: https://gitlab.com/ryandwyer/perfect-dark/-/blob/master/docs...

nebula8804 · on April 21, 2022

Thank you. I was honestly thinking the Minish cap would be done (why I don't know). It sits at 89% complete while PD is at ~74% but I guess I am not drilling into what is actually left in each project. I CAN'T wait to see what will be done with the PD source code!

YesBox · on April 21, 2022

A 30+ FPS Perfect Dark would be amazing. The RAM extension pack wasn't enough for that game. The AI was pretty good IIRC (although definitely went in loops). Had lots of fun playing 1 vs 11 medium difficulty. I miss throwing N-bomb grenades.

thrdbndndn · on April 21, 2022

What is "SH" region mentioned in [1]?

ThatPlayer · on April 21, 2022

https://github.com/n64decomp/sm64/blob/master/Makefile#L34

An updated version that was released later

BizarroLand · on April 20, 2022

IIRC from an earlier read about this, the code is from a disassembler.

Narushia · on April 21, 2022

The code is not from a disassembler. The developers of the sm64 disassembly project used a disassembler to look at the machine instruction output of the game’s functions, and then wrote their own implementation in C that results in the same output.

sonicggg · on April 21, 2022

Why wouldn't Nintendo release the source, considering they're is little commercial value in this particular game now?

colechristensen · on April 21, 2022

Nintendo is still selling Mario 64 in various ways. You can buy it in "Super Mario 3D All Stars" for $60 on the Switch.

charcircuit · on April 21, 2022

Nintendo, like many other Japanese companies, are very protective of their IP.

mook · on April 21, 2022

That assumes they even still have it. Not sure about Nintendo in particular, but many other games of the era are known to have lost source code.

Nintendo did (supposedly) use an emulator for the N64 games in Nintendo Online (for the Switch); so it's possible that they don't have the source (or maybe it had enough machine-specific code that doing a source port would be more effort).

Of course, if they can re-sell it as a subscription service they probably would want to avoid releasing the source anyway…

shadowofneptune · on April 21, 2022

Virtual Console (or whatever Nintendo calls it now) releases mean that the game can be a moneymaker decades past the end of retail sales. You also see this with MS-DOS games, sometimes modders will voluntarily stop making their fan patches if an official rerelease makes it to Steam.

tambourine_man · on April 21, 2022

> I don’t fault them for making most of these mistakes. C was a rather new language at the time.

In the 90s? It was 20 years old by then and mostly ruling the world already.

kibwen · on April 21, 2022

Not on game consoles, and perhaps not even for PC games more generally (hedging my bets in case Walter Bright waltzes in and starts talking about how he wrote Empire in C in 1972 or something :P ). It wasn't until the mid-90s that C ate the game industry. Recall that C wasn't standardized until 1989, that C compilers weren't free or ubiquitous or good, that plenty of now-forgotten contemporaneous languages existed, and that making entire games in assembly was still quite popular and liable to get you better performance anyway.

tambourine_man · on April 21, 2022

Interesting. I didn’t know console’s market was so different. 90s PCs were dominated by pascal and C variants on the OS and app front, with a tiny bit of assembly sprinkled here and there.

MBCook · on April 21, 2022

I believe the Nintendo 64 is the first Nintendo console where using C started to become prevalent. I’m not sure about the Saturn or PlayStation. But I know the 16 bit consoles were usually assembly because they had such a limited resources.

lynguist · on April 21, 2022

I don’t know about the Saturn, but the PlayStation is the first mainline console that I know that had a C SDK and games were written in C on it.

The first Nintendo handheld to use C was the GameBoy Advance which was also the first Nintendo console that had an ARM processor.

It’s true that C only really took off in the 90s while the 70s and 80s are more of a proto-C era which still built the entire Unix system.

IntelMiner · on April 20, 2022

I know they don't "owe" anyone anything. But it is disappointing they've apparently declined to release their code changes

It's also apparently unclear if the overhauled version will run on a real N64 or if it only runs in an emulator

Quillbert182 · on April 20, 2022

He is planning to release the source code at a later date. He is waiting because the source code is intertwined with a ROM Hack he made, and he is unable to release the improvements without releasing the ROM Hack in an unfinished state.

The overhauled version will run on a real N64 as long as it has the RAM expansion pack.

djmips · on April 21, 2022

It will run on a real N64 but it needs the memory expansion. So that's one thing he mentioned but people do gloss over. It (the extra memory) helped a lot in reducing memory contention between the CPU and the GPU.

sonicggg · on April 21, 2022

He has patreons backing this project, so he may actually owe them something.

tedunangst · on April 21, 2022

He owes them a credit at the end of the video, which they have received.

sonicggg · on April 21, 2022

Well, then as they say "a sucker is born everyday"

AnIdiotOnTheNet · on April 21, 2022

I'm a patron of Kaze, and all I want is to keep seeing amazing N64 romhacks. Though having my name appear in the video credits is kinda nice too.

randomifcpfan · on April 21, 2022

Great work, but a significant fraction of the improvement comes from installing and using the 4MB RAM expansion pak, which isn’t really a fair comparison.

itsgrimetime · on April 21, 2022

Installing and optimizing the code to make use of the 4MB expansion pak*

rincebrain · on April 21, 2022

My understanding, which may be incorrect, is that while one of the base optimizations he performed requires the second 4MB, most of the rest of the optimizations definitely do not.

So while I would not be surprised if he made some changes that assume more RAM is available in total beyond what was mentioned in the video, I believe most of the refinements covered could still be applied without.

fareesh · on April 21, 2022

Would be nice if it became the norm that the source code and assets for games were available as an open source purchase after X years

lynguist · on April 21, 2022

25 years ought to be enough, no?

darepublic · on April 21, 2022

Very interesting video and it brought back a lot of nostalgia. At the time this came out I was blown away by it. Remember watching some kid play the bbomb level at the department store. Looking back as an adult I no longer feel the awe, but have the greater satisfaction of learning about the ram bus and low level programming considerations

gnutrino · on April 21, 2022

I remember playing it at the store thinking we had reached the pinnacle of computer graphics, and that there was no way they could get better. Boy was I wrong! Still one of my favorite games of all time.

modeless · on April 21, 2022

Does this mean that Nintendo could have theoretically shipped Mario 64 as a 60 Hz game on N64? I know some of this stuff requires the RAM expansion, but not all of it.

arriu · on April 20, 2022

This is a serious amount of effort for what I imagine is little payout. I really enjoy work like this and hope that he is able to make a living off it.

Anyone have thoughts on how to make projects like this more sustainable?

magpi3 · on April 20, 2022

Someone on HN once said, if you want to build a reputation and a brand, do interesting things and talk about it. This is the perfect example. No, this will not make him money because there is not a large market for superior Mario 64 performance, but you sure as hell know who he is now, don't you?

And I will say his youtube skills are top notch. The video is very entertaining and informative. I just cannot imagine this person will have trouble making money in the future.

EDIT: Added a missing word

eigenvalue · on April 20, 2022

This seems to clearly be a passion project. The author is being paid in the satisfaction that he gets in improving this famous game way beyond what the original creators managed to do, all while being bound by the same technical limitations. He also gets to show off his skills to a fairly large (in absolute numbers at least) global audience of people who are also really into this particular game. I doubt you would see someone going to such great lengths for an obscure title that few other people would appreciate or care about.

colechristensen · on April 21, 2022

People often do things for their own merits and not just in exchange for money.

But the guy has a Patreon and this video already has a half a million views which should be at least $1000 in revenue, possibly 2-3x and it's still quite new.

panic · on April 21, 2022

You can give him money using Patreon: https://www.patreon.com/Kazestuff