Hacker News new | past | comments | ask | show | jobs | submit login
Finding the best sine function for Nintendo 64 [video] (youtube.com)
184 points by codetrotter on June 27, 2023 | hide | past | favorite | 46 comments



It's almost criminal Kaze isn't really welcome at most tech conferences because he plays fire with Nintendo's IP.

A huge audience is missing out learning to do these things themselves, so my plug is we're running an indie conference [0] with Kaze as the featured speaker.

We should follow in his footsteps.

[0] https://handmadecities.com/boston


I find him a bit abrasive even though I enjoy his videos (and actually submitted this a couple days ago).

It's alot of cool work, but the way he presents it to laymen is kind of annoying. Even if he had never heard of data orientated design in the original video series (where he claims to not know what its called when you organise data to improve throughput) he should by now because clearly he does huge amounts of research. (Case in point at 9:30[0] he talks about localised memory patterns as though he's the first to come up with it).

Again, love his work, watch every video and even submit to HN because I think others would. But that doesn't mean he's not very annoying.

[0] https://youtu.be/xFKFoGiGlXQ?t=570


> Case in point at 9:30[0] he talks about localised memory patterns as though he's the first to come up with it

This seems like an uncharitable interpretation to me. When he says "I'm not sure if anyone else has ever used this approach," it's pretty clear from context that the "approach" he's referring to is interleaving the sine and cosine tables to improve cache usage. And he doesn't even claim to be the first to come up with it, just to have independently discovered it.


Just telling you why even as a fan I can see why he rubs people the wrong way.

If you look at previous HN threads on his content many have similar complaints.

Every one loves the content and premise of revisiting SM64.

Many don't like the presentation.

If anything I think I'm being charitable assuming it's optimized for the lay public.


Now that SM64 has been fully reversed and can be compiled back to a bit-perfect ROM, I wonder if there would be any merit in replacing the default lookup table with Kaze's new function(s):

- 200 microseconds saved

- 33333 microseconds per frame -> 0.6% of time saved

- 1KB of RAM saved

- Looked really cool doing it (Kaze's words & I agree)


Kaze has already done so, and picking the low-hanging fruit results in much greater improvements than the optimizations shown here (which frees up less than one millisecond per frame). Here's the previous video with explanations and video demonstration of their improvements: https://www.youtube.com/watch?v=t_rzYnXEQlE


Kaze has essentially been doing that for his own ROM hack, not just this improvement but a bunch of others that have enabled the game to run at a consistent 60fps.

I believe he's mainly working on the ROM hack at the moment but has talked about releasing a patch to backport all the 100% SM64 compatible fixes to the main game.


The question here is whether the new, optimized SM64 implementation is CPU-bound in the first place. Games on the N64 are often limited by memory bandwidth, which is taken up by rasterization, and SM64 was something of a special case—compiled with optimizations turned off due to GCC bugs. After recompiling with optimizations enabled, and maybe after swapping in newer versions of the RSP microcode, my guess is that further improvements to CPU efficiency would have very limited returns.


> Games on the N64 are often limited by memory bandwidth, which is taken up by rasterization

So a lot of that is overstated, IMO.

The N64 was in most cases the first system with a modern memory hierarchy game developers had come across. From my experiments, rdram is dozens of cycles away from the CPU at least, so it's pretty easy to be memory bound without coming close to saturating the little more than ~200MB/sec memory bandwidth the system is capable of sustaining. Remember too the context of the mid 90s, where the previous Nintendo console had single cycle access to main memory[0]. So you'd see crazy stuff like loop unrolling into 16KB of straight code with no branches despite the CPU only having a 16KB instruction cache, guaranteeing that you just flushed everything else.

Similarly the GPU seemed really hampered by it's small FIFO at basically the ROP stage which meant that the RMW of both color and z buffers meant a lot of pipeline stalls. I think a lot of the benefit of switching to z sort late in the system's cycle had less to do with overall memory bandwidth, but instead the fact that you don't have to wait on memory just to then blit out the pixel. There's no z to check against, so you know that yes, as soon as the pixel hits the ROP stage, the GPU can just write it to memory (assuming alpha isn't involved).

And that's not to shit of the devs in question; it wasn't until probably the last half of the PS2 era that the industry as a whole really internalized how to approach true long tail memory hierarchies. I certainly hadn't despite being able to regurgitate the textbook definitions involved.

I guess what I'm saying: hey demo scene coders, there's a bunch of untapped power in this system with weird hangups. : )

[0] Yes, the memory system of the SNES is hard to talk about in broad strokes like this with FastROM/SlowROM, wait states on cart mem accesses, etc, but 'single cycle' is the right order of magnitude for this discussion.


Maybe that’s right, memory bandwidth is not the right explanation. But the rasterization itself is still often a big bottleneck, and maybe chalking it up to fill rate limitations is more explicatory. This is based on my limited observations writing my own N64 code and trying different scenes that put different amounts of load on the RDP—varying the load on the RDP was by far the easiest way to get framerate drops. If the CPU load is relatively constant, if the RSP load is relatively constant, then explaining it as running into fill rate limitations seems like a good theory to me.

I’m a little skeptical that there’s “huge” untapped power in this system, given the performance of some later games like World Driver Championship (1999). There’s certainly a lot of processing power across the CPU, RSP, and RDP, but given how heterogeneous the system is, and how many weird hangups there are, I have doubts that we’re going to see something much better come out of the demoscene community. You need a lot of appetite for a long-term project in order to make something impressive on the N64, and while we have better emulators and compilers now, it’s hard to compete against someone in the 1990s who got to spend multiple years on the system full-time, with the support of a team and from the console developers.

There are some tricks I can imagine using, like spending more time with the RDP in single-cycle mode, or rendering just the fields to get 480i at the cost of 240p, but there are just so many thorny problems to deal with.

This is speaking as someone who participates both in the demoscene (I was just in Boston for @party), and N64 homebrew (you can find me on the Discord).


In one of Kaze's older videos [1], he tracks both CPU and RCP time over various optimisations. The numbers make it pretty apparent that you can get significant RDP performance gains by reducing the amount of CPU bandwidth and/or improving it's access locality.

The N64 is a unified memory system, and memory stalls triggered by the CPU will slow RDP down.

The N64's memory controller appears to be very simple. As I understand it, if one sub-component attempts to access a DRAM row that is closed (DRAM rows are 2KB long), then the entire memory controller stalls for ~100ns as the RDRAM chip closes the previous row (writing out any dirty changes) and opens the new row.

The controller doesn't appear to do any reordering to optimise access patterns. If multiple components are accessing the same 1MB bank simultaneously, you can hit pathological bad cases were the entire system slows down, as the memory controller is continually stalling for 100ns at a time while the memory chips close and reopen rows.

Which is why some games can enable a high resolution mode when the memory expansion pack is present. They often don't need the extra 4MB of ram, but simply being able to strategically spread their data across eight different banks instead of four banks can massively improve performance.

[1] https://www.youtube.com/watch?v=t_rzYnXEQlE


The bank organization is relatively well-known, so maybe you put the framebuffer in one bank, you put the zbuffer in another bank, and you have two banks left over without touching expansion memory. I mentioned World Driver Championship specifically because it runs at 640x480, and runs well, without the expansion pack.

Yes, memory access by the CPU will slow the RDP down. But the RDP is plenty slow even when the CPU is idle. The reason that we are seeing such improvements with SM64 is because SM64 was in such bad shape to begin with—something to be expected, given the novelty of 3D hardware in 1996 and problems with compiler bugs.


Kaze explained once that a big part of performance issues was related to fill rate which I guess is part of the memory bandwidth issue. Essentially, polygon counts aren't a huge issue provided they don't use a large amount of screen space. While the Z-buffer will help out with larger polygons, it isn't a silver bullet solution.

That said, I don't doubt SM64 was in bad shape, this was really Nintendo's 1st attempt at doing 3D at that scale and almost everything was experimental. But even with that, it is amazing to see just how well a lot of it works despite this. But seeing some of the later stuff released on N64, it did seem like a really tapped out resource. Mind you looking at the Portal 64 coming along, it is neat to see that folks are still trying to push it just a little bit more.

Back to SM64, one thing I still find really cool is when you get Mario on a rotating platform, the rotation position and angle is calculated as it should be for both location and angle. That is just slick to see on something that old.


Your example of World Driver Championship is important; It had a custom RDP microcode implementation to achieve that. The RDP microcode implementations Nintendo distributed were pretty mediocre, either being very slow but accurate, or fast and too inaccurate, even for 3D games. Some of the most impressive games on the console were only achieved by implementing custom microcode, which was excessively difficult because Nintendo refused to distribute resources to help do so, despite that being an intended use of the console's hardware.

Another huge problem in rasterization was that pretty much all graphics resources had to be in a 4kb texture cache to be rasterized, and unless you micromanaged that cache really effectively (and it was cut in half if you wanted certain RDP features!) that cache constantly had the wrong data, and this would slow everything down. If they had given it just a bit more cache for textures, it likely would have been much closer to it's claimed peak performance in normal usage.


Isnt World Driver Championship fast because it doesnt use Zbuffer in the first place freeing half of memory BW normally used by gpu? Afair their custom RDP code sorted triangles back to front and just hoped you wouldnt notice glitches.


One of the big optimizations Kaze makes is to remove a huge lookup table, which greatly reduces cache/memory usage.


That's kinda interesting - I imagine a lookup table was introduced as as an optimisation in the first place. I'm at work and I've had meetings so I haven't seen the linked video (or one of the ones where Kaze patched some functions in to get SM64 running at a constant 60fps) but I'm really keen to do so when meetings subside :) These little self-contained optimisation problems are really appealing, especially when someone smart has done something cool and innovative


> The question here is whether the new, optimized SM64 implementation is CPU-bound in the first place. Games on the N64 are often limited by memory bandwidth, which is taken up by rasterization,

Your premise is correct but your conclusion is wrong.

The CPU and the "GPU" (the RSP, Reality Signal Processor, was not actually a GPU, but for the purposes of this discussion it's close enough) shared memory bandwidth. If the CPU was using memory bandwidth, it was stealing memory bandwidth from the GPU, slowing down rasterization.

The thing to make better was to reduce the total amount of bytes the CPU sends to/from main memory. If you could reduce the number of bytes the CPU sends to/from RAM, that gave the GPU more memory bandwidth, and that would improve GPU performance. CPU time was irrelevant.

This changes a lot of what you think about when you think about performance optimization. -Os is substantially faster than -O2, for instance. Loop unrolling is always a loss, even for small fixed size loops. Temporary variables are bad; if you can do a thing by manipulating an existing variable and then de-manipulating it, and that prevents it from spilling onto the stack, that will improve performance. Lots of optimizations a normal programmer and/or the compiler does will make the CPU faster but the extra RAM bandwidth used will slow down the GPU.

> After recompiling with optimizations enabled, and maybe after swapping in newer versions of the RSP microcode, my guess is that further improvements to CPU efficiency would have very limited returns.

He's actually done this, it is not a hypothetical. The original game got 20FPS. (ballpark) He recompiled with optimizations enabled and was getting 30FPS. (ballpark) He performed a bunch of other code optimizations and got this up to 60FPS. So there were enormous returns still on the table.


SM64 was compiled with gcc or mipspro?


Kaze has expressed intent to produce a patch for sm64 with all the optimizations he's found after he releases his overhaul hack.


I like Kaze's content. There is something fun about how hacky and ridiculous it is to put that much effort into Super Mario 64 optimizations and brand new levels.

This video is more technical than the others, but I suggest checking out his other videos if the idea of Super Mario 64 development piques your interest.


A lot of calculations and informed guesses and words like "could" "would" "should" but no benchmarks. Disappointing. With things like these, especially once you start to involve not-really-deterministic things like cache, you need to do integration testing of performance through benchmarks, not back-of-the-napkin calculations of how many instructions you are executing.

> Mario "could" walk in a different direction than he is pointing in

But did he? Was that a real, noticable problem or just theoretical?


Another cool N64 project, Portal N64 https://youtu.be/eN_-V7PDKAo


The logical next step is creating virtual machine and brute force generate function with less then 27 instructions and see if it produces something close to sine. Let it run for few months and surely it will find something.


If you're using a square root anyway it's also possible to only fit the part between 0 and pi/4, and use the square root to calculate the sine or cosine (whichever is greater).

However you need to be a bit careful that you return exactly sqrt(1/2) at pi/4 otherwise you get a discontinuity.


Fantastic video


The video is very well produced and engaging, however (from skimming it), it seems that all the implementation methods it presents are suboptimal.

The N64 CPU had an FPU, so it seems obvious that the best method, from both a speed and accuracy (or whichever is deemed more important) standpoint would be the same method that's used in all the libms in libcs like Glibc, musl or LLVM-libc, or in the standard libraries of languages like Go or Java: approximation using a polynomial. This polynomial is not derived analytically, rather it's found using an iterative algorithm, traditionally using the Remez algorithm, but also see my WIP project here: https://gitlab.com/nsajko/FindMinimaxPolynomial.jl.


> This polynomial is not derived analytically, rather it's found using an iterative algorithm, traditionally using the Remez algorithm, but also see my WIP project here: https://gitlab.com/nsajko/FindMinimaxPolynomial.jl.

Is this your independent attempt? Because I think RLIBM did the essentially same thing recently [1].

[1] https://arxiv.org/pdf/2104.04043.pdf


One difference is that their approach doesn't scale to 64-bit floats AFAIK. My code and their approach are actually quite different, with different goals, different places where they're smart and different places where they could be improved.

I hope to improve my code based on their ideas when I finally read their papers, however I still have plenty of my own that I want to implement.

See also this thread from twenty days ago, someone asked me basically the same thing, linking an earlier paper by Lim and Nagarakatte (et al):

https://news.ycombinator.com/item?id=36201197


Cool! I think the RLIBM approach is indeed fresh and has potentials, but I was never sure if it can be generalized to domains past binary32 when we are looking for correctly rounded results. I guess it's a fundamental difference between RLIBM and your algorithm---you don't guarantee correctly rounded results, right?


> I guess it's a fundamental difference between RLIBM and your algorithm---you don't guarantee correctly rounded results, right?

Yes, but note that the linked package doesn't even attempt to implement a polynomial approximation (for use in software), rather it just finds the coefficients of a polynomial. Another WIP package of mine will use FindMinimaxPolynomial to actually implement the approximation (with source code output etc), however even that package was never meant to guarantee correctly rounded results.

In particular their work is more ambitious than mine in that they seem to account for the range reduction and output compensation while looking for the coefficients with LP, which enables them to guarantee correctly rounded results.


I really don't think you can claim that the solution is suboptimal. Kaze has clearly spent a lot of time on this and has a real code to show for it. You're just theorizing that it's bad without providing any real evidence.


I'm not claiming anything extraordinary, though, most or all of what I said is well known in the field.


The burden of proof is on you though. If you think you can do it better then do it. It's ridiculous to post a dismissal without even watching the full video.


A good reference is the Handbook of Floating-Point Arithmetic. If you can be more specific about what is it that's not clear, I'll give more specific "proof". Even though you're being a bit "sealion"-y.

> It's ridiculous to post a dismissal without even watching the full video.

The video is jarring to watch. It's well-produced, engaging, etc.; but also pretentious and overconfident. It seems to be chock-full with irrelevant "developments". So it's not reasonable to expect me to watch it end-to-end.


> The video is jarring to watch. It's well-produced, engaging, etc.; but also pretentious and overconfident. It seems to be chock-full with irrelevant "developments".

Don't even try to read the YouTube comments :-) “Kaze has to be the best assembly programmer of all time”, “this is the greatest genius since the inverse square root trick”, “this SM64 hack is the most optimized Nintendo 64 game of all time”, etc…


Polynomial approximation was used throughout the video, and minimax was one of the final methods. Timestamp is sometime around 18:00. There seemed to be numerical issues with the approximation producing sin values above 1.0.


Yeah, turns out I missed that part of the video. So the author, in fact, knew that minimax approximation is a thing, however he only tried to do it by hand! It's a bit silly to try to optimize a polynomial for minimizing the maximum error and then complain when the result predictably does worse in a completely different metric. What he should have tried is taking some tool for minimax approximation, like Sollya (Remez-based, also offers other functionality), and then try too look for a minimax that would additionally satisfy his other constraints. Sollya allows a weighted approximation, so he could have tried constraining the polynomial to specifically be more accurate around zero, π/2 and π. NB: My WIP package will be even more flexible than Sollya's remez and offer more direct ways to specify such constraints, due to the completely different approach.


Why does he even care about pi/2? Shouldn't the wrapping of the input (due to the use of fixed-point) make sure that sin(pi/2) == sin(0)?


True, the author of the video should probably have done even more range reduction.


By itself, Remez is not usually helpful for these kinds of low-accuracy polynomial approximations, because getting the minmax error beneath the quantization threshold is too strong a constraint. Usually one ends up doing some brute force polishing at the end to try and reduce the polynomial degree and/or the bits necessary to represent coefficients.


FWIW, Sollya's Remez algorithm takes intermediate floating-point accuracy into account. At the very least, one should start by such a polynomial and then try to tweak it by hand, not just drag coefficients around from scratch. Numerics isn't “just messing around with numbers” as he claims; it's a large field with lots of cleverness.

As a quick example (using Maple, though, not Sollya): f(x) = x * (1.020875931 + x * (-0.0683760152 - x * 0.1122036320)) gives f(0) = 0 (exactly), f(Pi/2) = 1 (to at least nine decimal digits) and has a max error around 0.0019, significantly better than his hand-tweaked one (max error 0.007). For comparison, his fourth-order function has 0.002787 and the Bhaskara formula has about 0.0016. I don't know how many bytes it needs in MIPS assembly, but I normally count function speed in cycles :-)

If you allow a divide between two second-degree polynomials, like in the Bhaskara function, you can go down an order of magnitude in error, of course at extra cycle cost.


I looked at the video a bit more closely, and it seems he requires the symmetry around 0 (he doesn't explicitly do range reduction to positive values), so you want only odd-numbered coefficients. If so, you can do a formula of the type x * (c + x*x*(b + x*x*a)) (aka: ax⁵ + bx³ + cx; this is essentially the same as his formula with c=1 and prescaling x by a constant), which Sollya readily gives for single precision:

  > fpminimax(sin(x*2*Pi/65536.0),[|x,x^3,x^5|],[|SG...|],[0;16384]);
  Warning: For at least 2 of the constants displayed in decimal, rounding has happened.
  x * (9.58634263952262699604034423828125e-5 + x^2 * (-1.46252571143166976153082714517950080335140228271484e-13 + x^2 * 6.1585575698811928868593794069233315902067715796875e-23))
This has a maximum error of 0.000108 over that range. This is pretty close to optimal, but you can squeeze it ever so slightly lower (just below 0.0001) if you're willing to spend a lot of CPU time. :-)


How are you computing the max error here? What would be the max error for taylor series?


Take an interval, for sine it might be, for example [0,π/4], I think it was [0,π] in the video (a mistake, most probably). You're interested in the maximum approximation error over the interval, so you could, for example, sample ten thousand points from the interval uniformly, evaluate the error over all of them, and take the maximum.

Technically, what you're really looking for is the supremum, not the maximum. The former is the limit of the latter.

Sollya does something smarter than just sampling uniformly, though, it knows about the derivatives of all the functions it works with (and derivatives of the derivatives, etc), and uses that information to compute an arbitrarily small interval that is guaranteed to contain the supremum: https://www.sollya.org/sollya-weekly/help.php#supnorm

My package also does something more complicated than just sampling uniformly, however it doesn't know anything about the derivatives of the relevant functions, so the maximum it finds is just approximate.


In this case, you know the input is one out of 16384 distinct values (originally 65536, but due to range-reduction and symmetries, you only need to consider one of them). So you can simply test them all.

But in the case of the minimax polynomial, I just got it out of Maple (it can tell you as part of computing it). That's not accounting for floating-point inaccuracies, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: