Hacker News new | past | comments | ask | show | jobs | submit login
Cosine Implementation in C (github.com/ifduyue)
322 points by opisthenar84 on March 31, 2023 | hide | past | favorite | 158 comments



I love these things! Approximation is art.

I have a similar example in my book Geometry for Programmers (https://www.manning.com/books/geometry-for-programmers). There I make a polynomial approximation for sine that is: 1) odd (meaning antisymmetric); 2) precise at 0 and Pi/2 (and -Pi/2 automatically); 3) have precise derivatives and 0 and Pi/2 (and -Pi/2); 4) have better overall precision than the equivalent approximation with Taylor series; and it only consists of 4 members! Adding more members makes it impractical in the modern world since we have hardware implementations for sine on CPUs and GPUs.

Spoiler alert, the secret sauce is the integral equation ;-)



That was a really fun read. Thanks for sharing!


The realization that a 6th order approximation is good enough given our definition of float is beautiful.


You can approximate 8 to 8 significant figures using 987654321/123456789 = 8.0000000729


Interestingly, this pattern ([all digits reversed] / [all digits] ~= [base] - 2) holds in base 16 as well: 0xfedcba987654321 / 0x123456789abcdef = 1147797409030816500 / 81985529216486900 ~= 13.99999999999999878027


This pattern holds in many bases:

1.0 1 1 1.4 12 21 2.111111111111111 123 321 3.020618556701031 1234 4321 4.002680965147453 12345 54321 5.000262295081967 123456 654321 6.000020444462617 1234567 7654321 7.000001321561743 12345678 87654321 8.0000000729 123456789 987654321 9.000000003504939 123456789a a987654321 10.00000000014928 123456789ab ba987654321 11.000000000005706 123456789abc cba987654321 12.000000000000197 123456789abcd dcba987654321 13.000000000000007 123456789abcde edcba987654321 14.0 123456789abcdef fedcba987654321

https://ideone.com/DvljON


It is technically 12th order, but every other coefficient is zero, so it calculates a polynomial approximation in x^2 space


CPUs do have a 'hardware' (I guess microcode) implementation in the 80387 era FPU. But it isn't used. One reason is that the software implementation is faster in modern floating point code. I don't know about GPUs but I would guess compilers transparently take care of that.


It depends on the precision and accuracy and data types of the arguments and result. The 80387 was absolutely faster than software emulation for F32/64/80 by hooking real-mode interrupt handlers the way it was back to the 8086/8087. For I8/16, lookup tables were faster. For I32, a quality result would be very expensive. There were no GPUs in the 387 era, and there wasn't SIMD (MMX) either. In the x86 era, compilers weren't that smart. If they were, then they were slow.


the real problem is Intel screwed up their sin implementation by only using 66 bits of pi and then refused to fix it.


Which is just a support function for cos (which doesn't do argument reduction). See [1] for the actual function, and [2] & [3] for the argument reduction code.

[1] https://github.com/ifduyue/musl/blob/master/src/math/cos.c

[2] https://github.com/ifduyue/musl/blob/master/src/math/__rem_p...

[3] https://github.com/ifduyue/musl/blob/master/src/math/__rem_p...


To give a bit more background on this:

You really only need to compute one eighth of a circle segment (so, zero to pi/4), everything else you can do by symmetry by switching signs and flipping x, y. (And if you supported a larger range, you’d waste precision).

The function supports doubledouble precision (which is a trick, different from long double or quadruple precision, in which you express a number as an unevaluated sum of two doubles, the head (large) and the tail (much smaller)).

So, the “real” cosine function first takes its argument, and then reduces it modulo pi/2 to the desired range -pi/4, +pi/4 (with minimal loss of precision - you utilise the sign bit, so that’s better than using the range 0, +pi/2) in double double precision (rem_pio2 returns an int denoting which quadrant the input was in, and two doubles), and those two doubles are then passed to this function (or its cousin __sin, with the sign flipped as appropriate, depending on the quadrant).

Many people really have put in quite some care and thought to come up with this neat and precise solution.

ETA: Double double (TIL that it’s also something in basketball?): https://en.m.wikipedia.org/wiki/Quadruple-precision_floating...

Range reduction: https://redirect.cs.umbc.edu/~phatak/645/supl/Ng-ArgReductio...


You really only need to compute one eighth of a circle segment (so, zero to pi/4), everything else you can do by symmetry by switching signs and flipping x, y. (And if you supported a larger range, you’d waste precision).

In high school I was writing a paint program in BASIC on the Apple ][. Drawing circles was mind-numbingly slow until I read an article somewhere that pointed this out. It was like magic.


Interesting idea to compare the absolute value by simply not doing a float comparison.

I see they do a bit more than that now. Wonder how much time that saves.


This is a faster implementation of cosine that utilises six static const double variables (C1, C2, C3, C4, C5, C6) to speed things up. These variables represent the coefficients of a polynomial approximation for the cosine function.

For the genuine article, check out: https://github.com/ifduyue/musl/blob/master/src/math/cos.c


It’s the core implementation for a specific and very narrow range of arguments. The “genuine article” moves the argument into that range (via the rem_pio2 function performing range reduction, which is worth an article in itself), and then calls this core implementation.


Why do you return x-x for inf or nan? I can't test now, but I imagine they both return nan?

I run a high performance math library in C#, and I'm always looking to improve.

https://github.com/matthewkolbe/LitMath


It's an easy way to return nan and set FE_INVALID. If you're not worried about floating point exceptions you can just return NaN directly. Idk how much C# even guarantees about the floating point operations and if it's applicable there.

https://en.cppreference.com/w/cpp/numeric/fenv/FE_exceptions


Using the expression ensures that the FP exceptions are properly set. It also does quieting of sNaN or NaN payload propagation as appropriate.


I remember an approximation function used in vacuum-tubes based surface-to-air system of the 60s: cos(x) = 0.7. As explained to us rookies, that precision was enough to hit the target.


I can see it now.

  float cos(x)
  {
      return 0.7; // Chosen by fair dice roll
  }


That can't be good for all x. You mean x=π/4 or to 1dp.


Eh, it's close enough. After all, in the time of war the value of sine can go up to 4.


Only 4 ?

sin(1.5708) = 1

sin(1.5708 - 1.3170 i) = 2

sin(1.5708 - 2.0634 i) = 4

...


wait what? ELI grad student, I know you can plug i into the sine taylor series expansion but what is happening here


Sin/cos are tightly connected to the complex exponential and exhibit a lot of such behavior.

In particular, sin(x+it) = sin(x)cosh(y) + i cos(x)sinh(y). Set x something to zero out cosine, and rely on the exponential nature of cosh to find a target y value.


Oops, autocorrect, sin(x+iy). Also x and y should be real.


sin(x + pi/2) = cos(x) = cosh(ix)

Therefore, sin (pi/2 + i arcosh(2)) = cosh(-arcosh(2)) = 2

where arcosh(y) is the familiar inverse of the hyperbolic cosine, i.e. log(y + sqrt(y^2 - 1)), so we find sin(pi/2 + i log(2 + sqrt(3))) = 2.


The above should be intelligible to someone with high school math, no? I'd be interested in the grad student version, just out of curiosity


Most highschool students mainly use sin to calculate the lengths of sides of triangles. So it feels very counterintuitive that sin can be greater than 1.

If you define sin as the solution to a particular differential equation, then it's relationship to the hyperbolic functions is much more obvious.


This is radians, which can go from 0 to tau (2 * pi, about 6.28).


It's probably a pretty good 1st order guess that a surface to air missile launcher is aimed at 45 degrees. It's not aimed straight up and it's not flat on the ground.


> It's not aimed straight up

Well that certainly depends!

The VLS in ships are exactly that, vertical launch. As are, I believe, most Soviet/Russian large SAM systems like the S300 and S400


Well, cos(π/4) happens to be the RMS value of the function.


> That can't be good for all x.

It's within 1.3.


My first experience with "Use the source Luke" was wondering as a young lad how computers calculated trig functions. This led me to the Unix V7 math lib source code[1], and thence to a book by Hart & Cheney which by luck my university library had on the shelf. Good times.

[1] https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libm...


A like minded reader may want to have a look at the Go lang. math standard library source.

* https://pkg.go.dev/math


Woah why do they have s390x assembly versions of everything?

https://cs.opensource.google/go/go/+/refs/tags/go1.20.2:src/...

That's for a mainframe?


What are the functions P and Q defined here? https://cs.opensource.google/go/go/+/refs/tags/go1.20.2:src/...


Polynomials with the coefficients defined in the sin and cos arrays (and likely in the referenced literature).


Implementing low precision transcendental functions is much more interesting IMO, as it allows for a wider variety of algorithms and approaches. When you are limited to the IEEE definitions, there really isn't much room for creativity and you have to handle all the obscure edge cases and chase least significant bits, bleh. That's not something you want to do if you need to calculate a trillion atan's. If one is curious how these algorithms work, check out the book "Elementary Functions".


In reality though, a typical stock libm is not that accurate; their error generally ranges from 1 to 5 ulps with no actual guarantees. I'm more interested in a hybrid libm with configurable errors, from the most accurate possible (0.5 ulps) to something like 2^20 ulps, and all in between.


This function is completely constant in running time, so I don’t think these considerations are important here.


You can always have a smaller constant!


You sure can, but if you can beat a single IEEE multiplication on an arbitrary processor my god that’s a program I hope to never ever in my life have to reverse engineer.


Never quite got this argument. You can frequently do 1000 fold improvements or do even better by optimizing within the same algorithmic complexity. That frequently flips the bit of "is this software usable".

Sometimes you can even make real world performance significantly better by ignoring the technically better algorithmic complexity, especially when you almost always with low n's


The IEEE standard doesn’t really require any particular precision for transcendental functions IIUC? The original standard doesn’t even define them. The 2008 version includes optional correctly-rounded transcendentals, but there seems to be a total of one implementation, CRlibm, which is an academic project of more than a decade (and a dead home page because of the INRIA Forge shutdown). Your run-of-the-mill libm has little to do with this.


The latest revision IEEE 754-2019 do require correctly rounded for transcendental functions: https://en.wikipedia.org/wiki/IEEE_754#Recommended_operation...


Right, it requires correct rounding for the optional correctly-rounded functions, which ISO C, for example, does not require to correspond to your usual libm ones, or even to be provided at all (even with __STDC_IEC_559__ or __STDC_IEC_60559_BFP__). I think I saw a spec for correctly rounded functions for C somewhere, but even the latest draft[1] of C23 is content with just reserving the cr_* namespace and not mandating much for the usual functions outside it (F.10/16):

> IEC 60559 specifies correct rounding for the operations in the F.3 table of operations recommended by IEC 60559, and thereby preserves useful mathematical properties such as symmetry, monotonicity, and periodicity. The corresponding functions with (potentially) reserved `cr_`-prefixed names (7.33.8) do the same. The C functions in the table, however, are not required to be correctly rounded, but implementations should still preserve as many of these useful mathematical properties as possible.

[1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf


There are recent efforts to provide a correctly rounded and performant C23 libm, such as https://core-math.gitlabpages.inria.fr/

Maybe correct rounding (and hence consistency) will be made into the standard requirements (at least for single and double precisions) in the near future?


Does anyone know of a good reference on signals and systems for people with a pure math background rather than an engineering or physics background?

So, the implementation linked above uses the polynomial of best approximation to cosine on an interval. That polynomial was presumably obtained using the Remez algorithm, which, given a function f and a set of n+2 points in your desired interval, yields a polynomial of degree n best approximating f in the interval (in the uniform norm sense)[1] by performing a sequence of linear solves

I have a math and specifically approximation theory background, and the wiki article on the Remez algorithm makes a lot of sense to me. Like, this is the language I speak: https://en.wikipedia.org/wiki/Remez_algorithm

I'm less comfortable though when I try to understand the documentation for the remez implementation in SciPy because it is using the language of signals and systems, which is how engineers think about this stuff: https://docs.scipy.org/doc/scipy/reference/generated/scipy.s...

Signals seems to use a lot of analogies to sound waves, and it does map in some way or another to the mathematical language I'm comfortable with. I've spent a little bit of time reading papers and introductory texts on signals, but the definitions feel a little loosey goosey to me.

Anyway, just opportunistically using this thread to ask if anyone knows of some book or papers or blog posts that can help me translate this stuff into math language :)


I feel your pain.

I've spent most of my career working at the intersection of multiple highly technical domains, both theoretical and applied, and a) the "jargon switch" b) the methodology differences and worst c) the switch of conceptual framework required to work on a problem that spans multiple domains is initially very irritating.

I'm pointing that out because if you are only looking for a rosetta stone to only tackle a), even if you find one, it'll leave you wanting because you'll still need maps to tackle b) and c)

Example: analog electronics. From a mathematician's POV, an analog circuit is just a system of differential equations.

But very few analog electronics specialist ever think about a circuit that way.

The words they use are the least important part of the "translation" problem. The way they approach and think about the system will be utterly alien to a pure or even applied math person.

Eventually, after learning the trade, you'll start to see familiar patterns emerge that can be mapped back to stuff you're familiar with, but a simple "this EE jargon means that in math language" won't be enough: you're going to have to get your hands dirty and learn how they think of the problem.


This mirrors my experience (failing) to learn a second language.

The words are the easy bit. It's the constructs, genders etc that's the hard bit.


I don't know much about numpy.signal package, but there are other python packages that have good implementation for generating polynomial approximation. Among them are pythonsollya (a Python wrapper for Sollya), and SageMath.

Other than that, some of the overview references about Remez algorithm (should be a bit closer to your taste) that I like are:

- J.M. Muller et. al., "Handbook of Floating-Point Arithmetic", section 10.3.2 (and follow references in that section).

- N. Brisebarre and S. Chevillard, "Efficient polynomial L-approximations," 18th IEEE Symposium on Computer Arithmetic (ARITH '07), which is implemented in Sollya tool: https://www.sollya.org/sollya-2.0/help.php?name=fpminimax


The scipy documentation you linked is specifically for generating a FIR filter using the Remez algorithm, which is why it is written in that style. It is a very specific use case of the algorithm and probably not the one I would recommend learning from. Another top level comment has a link to a decent overview of the Remez algorithm written without the "Signals and Systems" language.


Thanks, I think my request was unclear, but I'm specifically trying to understand the documentation linked, not the Remex algorithm itself. I'm trying to reconcile my knowledge of it with the language used in the documentation.

I have a pretty comfortable understanding of the Remez algorithm and generalizations and have written my own implementations for a research project


Take a look at the 2 listed references by Parks and McClellan. They are considered seminal papers in the field of signal processing and are the basis of the function you linked. And they will probably appeal to your math background more.

The name of the algorithm is Parks McClellan filter design, or sometimes you will see it named firpm, as popularized by MATLAB.

https://en.wikipedia.org/wiki/Parks%E2%80%93McClellan_filter....


Just noting that your link to docs.scipy is broken. I'm also in a similar boat to you and would be interested in an answer to this question.


Thanks, fixed it!


Some of the more recent implementations of cosine for single precision that are correctly rounded to all rounding modes according to IEEE 754 standard:

- The CORE-MATH project: https://gitlab.inria.fr/core-math/core-math/-/blob/master/sr...

- The RLIBM project: https://github.com/rutgers-apl/The-RLIBM-Project/blob/main/l...

- The LLVM libc project: https://github.com/llvm/llvm-project/blob/main/libc/src/math...


This takes me back. I did the transcendental functions for a CPM C compiler in the early 80s using Chebyshev polynomials.

My tutor failed my implementation because my code formatting was utterly crap. :) I am still laughing as there was no sin, cos in the maths library at all at the time.


Bit shifted sin values in a pre-made table of integers could be used so the maths was all integer based.

Commonly used on processors like the 6502 or 68000 for vector graphics.

It was accurate enough for games.


> It was accurate enough for games.

Unless like me you made a typo when copying the table from a book and had a vector routine that mostly looked alright but would jerk slightly when that one value that was incorrect was chosen! <facepalm />


It was floating point math. There was nothing in the CPU to support it so it was ridiculously slow.

But, sometimes you just want to have those functions to do some complex geometry.


I'm pretty sure that indent was around even in the early 80's. Even if you couldn't be bothered to run it, your tutor could have. Must not have been a very good tutor to fail you for style.


How sure?

indent was written in 76, but didn't make it to BSD until 82.

CP/M was first released in 74!

It was a pretty different world.


I was using CP/M since about 1978, but you're right -- AFAIK there never was a port of indent to CP/M. I did make some assumptions, but developing for CP/M on CP/M was somewhat painful. There just wasn't enough memory space to work efficiently, or (often) even fit the compiled image into memory (<~60k after BIOS/BDOS/CCP on CP/M 2.2). Chebyshev polynomials certainly imply floating point, which could be a BIG memory hog with so little room.

If I had the resources at the time, I would have undertaken such a development task on a more capable (Unix) platform. Such an environment would improve the ability for developing/debugging the optimizations required. As you say, indent was available in BSD since '82, and I know it was present on the Sun2 systems I used in '83.


I had an Osborne 1. I had to swap disks multiple times to compile anything. I remember the C compiler by some guy, 'Walt Bolisvki'?, had the arguments passed in reverse on the stack (last on the top) and printf and friends ended up crazy hacks.

I didn't have access to a Unix at the time or very limited open source so ident was not a thing.


Personally I like Universal CORDIC, you can calculate much more than Cosine, but it tends to be slower. See https://en.wikibooks.org/wiki/Digital_Circuits/CORDIC#The_Un... and a somewhat janky fixed point library that implements it https://github.com/howerj/q


The CORDIC algorithm is great! Just a few months ago, I implemented a sine generator with it and also used fixed point arithmetic. It is a real eye opener to see that you can substitute multiplications by bit shifting when you store the numbers in the “right“ format.


CORDIC is what many digital calculators use (along with decimal floating point).


> along with decimal floating point

Source? I'd find this a bit strange because:

1. I suspect decimal floating point is quite expensive to implement computationally.

2. Not an expert, but I think the floating-point algorithms and proofs for bases other than 2 are less well researched and less well known.

3. A calculator is a resource-constrained device.

4. Are there even any real benefits? Yes, some human-friendly numbers have an exact finite representation that is missing in base-2, but I can't see why would that matter, compared to just using 32-bit or 64-bit IEEE-754.


The TI-83 series of calculators not only uses decimal floating-point, but the format is documented:

http://merthsoft.com/linkguide/ti83+/vars.html

In practice, the cost is not that much different because four-function calculators do not do many calculations at once. Even in a graphing calculator, drawing a plot is never slow enough to make the device useless.


One of the more memorable examples: https://news.ycombinator.com/item?id=28559442

Decimal, and particularly BCD floating point, is much simpler than binary for a calculator since the numbers don't need any additional conversion steps to display.

And the fact that 0.1 + 0.1 works as expected is indeed a huge benefit.


Decimal floating point is pretty easy to impliment in hardware.


You may be confusing floating-point with fixed-point?

Hardware with decimal floating-point only appeared in 2007, and still doesn't seem to be easily available:

https://en.wikipedia.org/wiki/Decimal_floating_point

https://en.wikipedia.org/wiki/POWER6


I expect that's because the cost versus effort for having two kinds of floating point isn't very good. Not because decimal floating point is notably more expensive than binary floating point.


Very good explanation about Remez algorithm: https://xn--2-umb.com/22/approximation/index.html


Personally I go with the recursive definition. Use the trig identity cos(x)^2 + 1 = cos(2x).

    2cos(x)^2 - 1 = cos(2x).
    2(2cos(x)^2 - 1)^2 - 1 = 2cos(2x)^2 - 1 = cos(4x)
    2(2(2cos(x)^2 - 1)^2 - 1)^2 - 1 = 2cos(4x)^2 - 1 = cos(8x)
Keep squaring, doubling and subtracting one one the left. Keep raising by powers of 2 in the argument on the right. Eventually, if you're working with rational numbers, multiplication by 2^k moves all k fractional bits over to the integer side of the decimal, and then integer times pi is the base case of the recursion. If you rescale the units so that pi = 2^(n-1) for some n, you don't even have to think about the non-fractional part. The register value circling back to 0 is supposed to happen anyway.

You could Just as easily recursively define cos to "spiral downwards in x rather than outwards", each recursive layer reducing the argument of x by a factor of 2 until it reduces down to the precomputed value cos(machine precision).

This was all just circular reasoning


> This was all just circular reasoning.

Nice :)


Here is a naive question. A lot of effort goes into dealing with floating point imprecision and instabilities, right. So why aren’t these algorithms implemented with integers? A 64 bit integer could represent the region from 0 to 1 with 2*64 evenly spaced slices, for example.

Would that make things more accurate across operations? Is this something that is done?


That is called 'fixed point', and it is useful for some applications. What makes floating point interesting is that it can represent values with very different magnitudes; your 64-bit fixed point number can't represent any number greater than 1, for instance. Whereas a 32-bit float can represent numbers with magnitudes between ~2^-127 and ~2^127.


Maybe Cosine could be fixed and converter to float before use. Using identities to calculate Sine and Tangent functions.


Seems like a lot of effort for no actual gain.


Just an option if magnitude is an issue. You only needs 0-1 plus a sign bit.

Fixed to float is easy enough. It'd let you use a fixed point implementation for 0 - pi/2 benefiting from smaller look up table and simpler implementation.


> It'd let you use a fixed point implementation for 0 - pi/2 benefiting from smaller look up table

Why would that table be smaller if you use 2^64 values in that range instead of the fewer than 2^55 doubles in that range?

> and simpler implementation.

Why would it be simpler? With doubles, the hardware will do scaling for you. With fixed point, you need a helper function or macro to do multiplications, and those have to use an intermediate result with more than 64 bits (in this special case, some shortcuts are possible, but those won’t make the implementation simpler).


One or both of us have lost the thread here.

There was an exchange about using fixed point (which apparently is being used according to other comments for easier correctness verification, speed and simplicity). Someone pointed out that float gives you a wider range of values. But those values are not needed for cosine which has a range of 0-1 in the interval 0 to pi/2.


> But those values are not needed for cosine which has a range of 0-1 in the interval 0 to pi/2

Ah not quite true though; the “range” is also referring to the exponent range of representable numbers, so don’t forget the small numbers near 0. You can get much smaller numbers with much higher precision out of a single-precision float than a fixed-point 64 bit [0,1] number.


Ah makes sense. I imagine depending one how you intend to use the numbers you need that precision at the steeper parts too.


There should be plenty of C trigonometric algorithms made for fixed point. They usually use lookup tables and interpolation. Used a lot in embedded microcontrollers that don't have hardware floating point or hardware trig functions, and in automotive where floats were (still are?) a no-go due to the dangerous float math pitfalls and the safety critical nature of ECUs.

I remember writing a fast, small footprint, but very accurate fixed point trig function for the microcontroller used in my batcher project, many lives ago. I'll try to find it and put it on github.

The neat thing is that you can use the native overflow wrap around property enabled by C integers and processor registers to rotate back and fourth around the trigonometric circle without needing to account for the edge cases of the 0/360 degree point. You do need to suppress MISRA warning here though as it will go nuts seeing that.


I'd love to see some implementations of fixed point trig functions, but I haven't found any! Do you have any links by any chance?



Nice, I didn't realize Forth used them - thank you!


I don't have the code at hand to show you, nor do I have any links similar trig stuff as I wrote everything from scratch, first in Matlab to prover the concept, then in C, as there wasn't anything similar online at the time, but I used the principles of fixed point fractional math of representing floating point numbers as scaled integers in the Q notation, which is well documented for use in microcontrollers and DSPs of the era (early to late '00s)

Here's the wiki gist with some C code snippet examples [1]. Here's a more detailed math paper that goes in the weeds[2]. Here's a nice presentation from Motorola-Freescale-NXP I just found which I wish I had back then as this is awesome and explains the basics for everyone to understand, and seeing those Motorola Windows 95 GUI tools brings a nostalgic smile on my face.

[1] https://en.wikipedia.org/wiki/Q_(number_format)

[2] http://darcy.rsgc.on.ca/ACES/ICE4M/FixedPoint/FixedPointRepr...

[3] https://www.nxp.com/files-static/training_presentation/TP_CO...


What you're describing is fixed point. Fixed point absolutely exists in practice, but it has an important limitation: its accuracy is scale-dependent, unlike floating-point.

Suppose for example you were doing a simulation of the solar system. What units do you use for coordinates--AUs, light-seconds, kilometers, meters, even miles? With floating-point, it doesn't matter: you get the same relative accuracy at the end. With fixed point, you get vastly different relative accuracy if your numbers are ~10¯⁶ versus ~10⁶.


Re: "imprecision and instabilities" on standard hardware there really are neither. Floating point numbers have an exact meaning (a particular fraction with an integer numerator and a denominator that is a power of 2). There are exactly specified rules for basic operations so that adding, subtracting, multiplying, or dividing these fractions will give you the nearest representable fraction to the exact result (assuming the most common round-to-nearest mode).

The bit that appears as "imprecision and instabilities" is a mismatch between three things: (1) what the hardware provides (2) what controls or compiler optimisation settings the programming language exposes (3) what the user expects.

There are several sticky points with this. One mentioned in the parent article is higher precision. Starting from the original 8087 floating point coprocessor x86 has supported both 80-bit and 64-bit floating point in hardware, but programming language support has mostly ignored this and only shows the user "double" precison with 64 bits of storage allocated. But with some compiler settings parts of the compuation would be done at 80 bit precision with no user visibility or control over which parts had that extra precision.

Newer hardware modes stick to 64 bit computation so you're unlikely to run into that particular stickiness on a modern programming stack. There is another that comes up more now: fused multiply and add. Modern hardware can calculate

   round(a*x+b)
in one step where a,x,b are floating point numbers (i.e. fractions) and the "round" operation produces the nearest representable fraction to the correct result. This is faster and more accurate to calculate than doing the multiplication and addition seperately, which gives

    round(round(a*x)+b)
But again programming language control is weak over whether and when fused-multiply-and-add is used, and this is a detail that almost all programmers almost always just want to ignore anyway.


Speaking from no experience, but it should be doable. But since in most applications you're going to use floating point anyway (or, if you're using fixed point, maybe right bit shift or maybe even do integer division until it fits the range of your value), the extra precision would be discarded in the end.


So why is that extra term needed here?

w = 1.0-hz;

return w + (((1.0-w)-hz) + (zr-xy));

------------------^ here


The idea is to be able to compute 1.0 - h * z + (z * r - x * y) in high precision.

If h * z is slightly less than 1.0, then w = 1.0 - h * z can lose a lot of bits of precision. To recover the bits that were dropped in subtracting h * z from 1.0, we compute which bits of h * z were used (the high bits) and then subtract those bits from h * z to get only the low bits. We find the bits that were used by simply undoing the operation: 1.0 - w is mathematically the same as h * z, but is not the same as implemented due to limited precision.

So (1.0 - w) - h * z is the extra bits that were dropped from the computation of w. Add that to the final term z * r - x * y before that result is added to w, and we have a more precise computation of w + z * r - x * y.


Also why does it do r = z * ..., and then does another multiplication z * r at the end? Why not leave that z out and do w * r at the end?


Mathematically, we have (1 - (1 - hz)) - hz = (1 + hz) - hz = 1, so the computation seems redundant. But due to floating-point rounding, it is not, and still creates information. As the comment says:

> cos(x+y) = 1 - (x*x/2 - (r - x*y))

> For better accuracy, rearrange to

> cos(x+y) ~ w + (tmp + (r-x*y))

> where w = 1 - x*x/2 and tmp is a tiny correction term

> (1 - x*x/2 == w + tmp exactly in infinite precision)

(hz is x*x/2)

See double-double arithmetic - https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...


Why does he link to his lame fork of musl for the official FreeBSD sources for libm, taken from Sun, used by musl?

https://github.com/freebsd/freebsd-src/blob/main/lib/msun/sr... or at least https://git.musl-libc.org/cgit/musl/tree/src/math/__cos.c


The history of the freebsd file is interesting in itself


This is the kind of thing that triggers my imposter syndrome big time.

I've been coding professionally for almost 30 years, and I see stuff like this and wonder if I'll ever be a real programmer.


This is Numerical Analysis, not "programming". It's a topic in Applied Mathematics that you would need to study separately from programming. https://ocw.mit.edu/courses/18-330-introduction-to-numerical...


You're probably more of a "real programmer" than whomever wrote this code--it's numerics code, after all.

There's no real tricks or anything to this code, it's just a polynomial evaluation of a Taylor series (which you probably learned, and subsequently forgot, from calculus), with some steps done using techniques for getting higher precision out of a floating-point value, and (in some other functions in libm, not here) occasional use of the bit representation of floating-point values. It's standard techniques, but in a field that is incredibly niche and where there is very little value in delving into if you don't have a need for it.


It's not a Taylor series. It's a series, but a minimax approximation, not Taylor.

Taylor series are essentially “Give me a polynomial p(x). At x=a, I want the value to be b, the first derivative to be c, the second derivative to be d, etc. etc.”. (Usually, a=0, also known as a MacLaurin series.) This gives you a _great_ approximation in the immediate vicinity of a, but usually a pretty horrible one the further away you get; in a sense, you spend all your approximation energy on that single point.

Minimax approximations are “Give me a polynomial p(x). Over the area x=[a,b], I want it to never stray far from f(x); give me the one that minimizes the worst-case distance”. (The distance function is usually either the absolute error |p(x)-f(x)|, or the relative error |p(x)-f(x)|/|f(x)|, but you can have an arbitrary weight function in some implementations.) So you consider everything in that area to be important, not just one point.

Taylor series are simple to calculate (which can help if you need to calculate the coefficients on the-fly, or by hand), good for analysis, and easy to understand, so most engineers will have heard of them in calculus. The Remez algorithm, on the other hand, essensially requires a computer, and is significantly more complicated to understand and implement. (Remez is not the only way to generate minimax polynomials; Chebychev polynomials come close, and I believe Sollya uses some other algorithm.) Both will converge to the actual function giving infinite terms, but minimax polynomials will be _much_ closer in most real-life situations. Also, if you are really pressed for it, you can have rational minimax polynomials (p(x)/q(x)) that approximate tricky functions _really_ well even at low degree, but of course at the cost of a division.


If you don't care about "performance":

double my_cos(double x) {

    double sum = 0;

    for (int n = 0; n < 10; n++) { // 10 terms of Taylor Series

        sum += pow(-1, n) * pow(x, 2 * n) / factorial(2 * n);

    }

    return sum;
}

Edit: code formatting


/* This code is silly, not super accurate, but fun /

double cos2(double x, int n) {

    double numerator = 1.0;

    double denominator = 1.0;

    double pi = 3.14159265358979323846;

    int i;

    for (i = 1; i <= n; ++i) {
        numerator   *= pow((x - n * pi / 2.0), 2);
        denominator *= pow((n * pi / 2.0), 2);
    }

    return numerator / denominator;
}

(Edit for format)


Also if you don't care about accuracy at all. my_cos(2 * M_PI) = 0.9965. my_cos(4 * M_PI) = -2917.7144 (!).


Agreed with this. Also outside of very specific applications like graphics libraries, robotics, game engines you would never want to write something like this from scratch and instead rely on an existing library or game engine. So unless you’re working in that area it’s not imposter syndrome to find it foreign, it’s probably good!

I’ve never worked directly on floating point libraries like this, but having dealt with similar stuff, I suspect if this were in production it would end up peppered with IFDEF to support various implementation differences across hardware types, even if it is compliant with eg IEEE standards.


This is a fascinating topic indeed. And the math behind polynomial approximation is not that deep or unapproachable. You just haven't found an explanation that works for you. Some time ago I tried to make one with a sine as an example: https://wordsandbuttons.online/sympy_makes_math_fun_again.ht...

It doesn't mention Taylor series anywhere but if you only make your system out of N differential equations, you'll get N members of the power series.


Just slow down and look at it line by line. Take detours to make sure you understand every part.


I don’t think this would help anyone’s imposter syndrome. Taking hardcore math line by line will yield confusion.

A better approach might be to learn how cosine can be approximated with a Taylor expansion, then realize this is just an optimized version of that. (Even if it’s a different algorithm, it’s still an approximation algorithm, which is the point.)

Staring at math constants won’t get you that insight.


Sure, but that's not what causes it.

I can reason through the code and understand it.

The thing that makes me wonder if I'll ever be a real programmer when I grow up is that someone wrote this, probably on a Thursday after lunch (at least, that's how my brain imagines it).

They were able to conceptualize the math solution to a major optimization problem, deal with all sorts of precision issues with bit manipulation, and come up with a blazing fast bedrock algorithm that's dazzling both in it's simplicity and complexity. And then likely popped out to a pub for a pint and didn't think much about it again.

Intellectually, I understand that this person was specialized, and may struggle mightily navigating the complexity of modern web frameworks, something that I find straightforward because I do it every day.

But for some reason, this stuff always strikes me as being "real" programming, and what I'm doing as just plumbing, piping together the bits that the "real" programmers write.


Your mind is inventing the part about them doing it on a thursday afternoon before popping out to the pub.

This could have taken weeks of research and refinement. You can't tell from just looking at the final result.

Give yourself some credit. If you were tasked with producing a cosine implementation, you'd pull out some textbooks, read specs, research algorithms, try a few ideas, test, iterate. You could come up with this. The only reason you didn't is that you didn't need to make a cos function yet; you had other stuff to do.


Do you ever feel a burning desire to just solve the hell out of a particular problem? Usually this happens in the context of performance optimization, but it can be anything. Maybe you're supposed to be working on something else, but all you can think about is your pet problem. You throw a bunch of stuff at it, and mostly it doesn't work, but the failures don't discourage you at all; making any sort of progress is addicting. You look at other people's code. You read blog posts and research papers. You fill up notepads. None of this feels remotely like work. It's fun, and you simply cannot stop until you finally grind the problem into the dust.

My guess is that the author of this code was possessed by a spirit like this. Maybe I'm wrong; maybe he was just assigned by his manager to write a cosine function that met a certain performance threshold, and he immediately had a general idea of how to do it, and banged it out in an afternoon. But I doubt it.

My point is that, if you have felt this way about programming, you are doing "real" programming. It's not about whether you're writing a web framework or a cosine function. It's about the thrill of the chase.


If it's any consolation, I worked with one of these teams for a few years. Brilliant at the math stuff, and many could not handle git. Or unit tests.

You've spent your time one X, they've spent their time on Y. That's it. Most of these people are math people. They may have CS degrees, but they've spent the bulk of their time in math.

Actually, now that I recall, one of them was a nuclear physicist and the other was an industrial engineer - neither had CS degrees. I don't know about the others.


IMO the part of programming that interfaces with humans and requirements is just as tricky (in a different way) and just as important.

Also, have you ever formally studied maths? People don't just come up with this kind of thing. They read about algorithms in textbooks. They might not have read this specific algorithm (which may be original), but they'll have read lots of similar ones.


I have some math background so it is not difficult to understand.

If you ask me to write a simple API for a CRUD app though I'd cry like babies


There are 1000's of different ways to be an expert in CS. implementing high performance functions in numerical analysis is only one.


Here's someone else's fast-sine code snippet I found and used about a decade ago when I was working on an additive synthesizer I never finished. It basically uses the same method, Taylor series approximation.

Driving it as an oscillator relies on just a single integer addition operation per sample to increment the phase, and relies on integer overflow to reset the phase.

I like the comment from someone asking if it could really be so fast since it has 7 whole multiplications in it. Ah I love the realtime dsp world, so different from my python/JS day job.

https://www.musicdsp.org/en/latest/Synthesis/261-fast-sin-ap...


You might be interested to know about mod 2^m -1. Look at these numbers:

32640, 15420, 39321, 43690

In binary they look like: 1111111100000000, 0111100001111000, 0011001100110011, 0101010101010101

How cute. They are like little one bit precision sine waves.

I assure you that these numbers multiplied by each other mod (2^16 -1) are all equal to zero. When multiplied by themselves they shift position. test it for yourself. Long story short, in the field of mod 2^16 -1, 2^x works just like e^pi i, these numbers function just like sine and cosine, and you can even make linear superposition out of them and use them as a basis to decompose other ints into "frequency space."

Even better, the number i is represented here. There's a number int this modulo system which squares to 1111111111111110 (aka -1). Once you see it, you can implement fast sine and cosine directly through 2^{ix} = cos(x) + i sin(x).

Cool right.

Have fun.


Whoa that is awesome. You and Sesse__ are really inspiring me to get a VST dev environment set up again this weekend.


If you're doing a synth and want a sine wave, you usually won't need to compute cos(x + at) from scratch every time. There's a cheaper and more accurate trick.

First, figure out how much the phase (in radians) is going to increase every sample. E.g., for 440 Hz at 44100 Hz sample rate, you want 2pi * 440 / 44100 = 0.06268937721 (or so). Precompute sin(a) and cos(a), once.

Now your oscillator is represented as a point in 2D space. Start it at [1,0]. Every sample, you rotate this point a radians around the origin by a radians (this is a 2x2 matrix multiplication, using the aforemented sin(a) and cos(a) constants, so four muladds instead of seven). You can now directly read off your desired cos value; it's the x coordinate. You get sin for free if you need it (in the y coordinate).

If you run this for millions or billions of samples (most likely not; few notes are longer than a few seconds), the accumulated numerical errors will cause your [x,y] vector to be of length different from 1; if so, just renormalize every now and then.

Of course, if you need to change a during the course of a note, e.g. for frequency or phase modulation (FM/PM), this won't work. But if so, you probably already have aliasing problems and usually can't rely on a simple cos() call anyway.


That is a really cool trick, I think I'll give this a try over the weekend.

I'm a bit confused about pre-computing the table though, wouldn't it have to be massive since the sample rate and note frequency don't evenly line up? On my phone at the minute but with a 440hz note under 44.1k sr, doesn't one cycle equal 100.227~ samples? I'm guessing the idea is to make a smaller table and interpolate? And I'm guessing you compute these tables for each musical note (or a limited but realistic range) at compile or load time, and not with each note press.

Not being able to change the frequency while the note is running seems like a big deal for a synth though, no? Even without high frequency FM/PM which is often okay to not support, you would still be giving up on things like legato, lfo pitch mod (vibrato) and pitch wheel bending. Or is this only a problem in the higher frequencies of modulation? You know, it sounds like a fun thing to find out myself lol.

The accumulating errors could be an issue with free-running oscillators, especially in a studio system left on for a few days. Free-running oscillators aren't that common in additive synths I believe, but it could be neat to see if it makes a difference like it does in a subtractive synth. Either way, periodic renormalization should be pretty straightforward.

Also I do somewhat remember when I was doing my original research for an additive synth, the word on KVR was a lookup table based sin was fastest as long as you could fit it into L1 cache. That was over a decade ago and caches are a lot bigger now.


I think I can explain the trick even easier and hopefully that clears up your confusion. Simply store a complex number s which represents the current output of the synth (cos on the real part, sin on the imag). At each time update (the sample rate), multiply s by a pre-computed constant c (which is also complex valued). You just need for c^(i/sample_rate) to equal e^(i*freq/2pi). It does not matter that freq/2pi doesn't line up with the sample rate. The trick is wholly dependent on just keeping track of the state s and reliably multiplying it at a known update frequency.


You don't need a table for a simple sine oscillator. You just compute sin(a) and cos(a) on note-on, and that's it. You don't store the output in a table, you just apply that 2x2 matrix every time.

Tables can definitely have their place, depending on your synthesis method. If you want something like pitch bend, going for a sine table may very well be the fastest (it depends a bit on how much cache you want to burn).


If you care about this kind of stuff, the theory lies in numerical methods, for example,

http://www.sci.brooklyn.cuny.edu/~mate/nml/numanal.pdf


I recently had to implement some Bessel functions numerically. It's fantastic to see that someone at Sun Microsystems implemented all this procedures in the early 90's (I'm sure even long before that). You will find this implementations in R, Julia, musl, you name it!

I like it specially the use the "GET_HIGH_WORD" and friends:

https://git.musl-libc.org/cgit/musl/tree/src/math/j0.c

It's a lot of fun to figure out what they meant by things like:

> if (ix >= 0x7ff00000) return 1/(x*x);


you should check out Bessels.jl which has really high performance implications.


https://github.com/JuliaMath/Bessels.jl/blob/master/src/bess...

Thanks! I love it, so easy to understand and follow.

My favourite work on the subject is Fredrik Johansson's:

https://github.com/fredrik-johansson/arb

Whenever I feel down and without energy I just read something in there


yeah, Fredrik's papers are great. A lot of the reason Bessels.jl is so good is because ARB makes it easy for us to have test against a known source of truth (which make things like minimax polynomials and picking optimal tradeoffs a ton easier).


You work with Julia? Nice! I actually use Julia to generate my test cases:

  using Nemo

  CC = AcbField(100)

  values = [1, 2, 3, -2, 5, 30, 2e-8]

  for value in values
      y_acb = besseli(CC(1), CC(value))
      real64 = convert(Float64, real(y_acb))
      im64 = convert(Float64, real(y_acb))
      println("(", value, ", ", real64, "),")
  end


https://www.atwillys.de/content/cc/sine-lookup-for-embedded-...

Have you checked out this?

Its an alternative implementation which uses interpolated table lookup and requires only 66 bytes of constant memory with an error of less than 0.05% compared to the floating point sine.


The (presumably relative) error of 0.05% implies only ~11 significant bits, while 32-bit and 64-bit floating point formats have 24 and 53 significant bits respectively. If your sine has only 11 correct bits and you are okay with that, why do you have much more than 11 bits of input precision after all?

EDIT: Someone (sorry, the comment was removed before I could cite one) mentioned that it is actually for Q1.15 fixed point. So it was not even an "alternative" implementation...


Can someone shed light on these two steps in the approximation?

  *         since cos(x+y) ~ cos(x) - sin(x)*y
  *                        ~ cos(x) - x*y,
For the first, maybe it's an approximate identity I'd never heard of?

For the second, it seems like it relies on X being small, but in this code... Y might be small enough for sin==identity approximation, but X isn't, is it?


The input is a double-double where y << x by definition. So the first line is just using the standard cos identity cos(x+y) = cos(x)cos(y) - sin(x)sin(y). Since y is very small, we can approximate cos(y) ~= 1, and sin(y) ~= y. This yields the first line.

The second line, I believe, is not relying on x being small, but rather relying on sin(x)*y being small. The whole result of the sin(x)*y term is approximately on the order of the LSB of the result, and the relative error of approximating sin(x) using 1 is better than using 0, so this term will help produce more accurate rounding of the bottom bits.


I think the function inputs are in double-double format, so the assumption is that the magnitude of y is significantly smaller than the magnitude of x, ideally |y| < 2^(-52) |x|. So that 1 is a very good approximation for cos(y), since |cos(y) - 1| < |y|^2 < 2^(-104) |x|^2. Similarly, if we assume |x| is also small, |sin(x) - x| < |x|^3 is also a very good approximation (same with sin(y) ~ y).

So using the cosine of sum formula:

  *  cos(x + y) = cos(x) cos(y) - sin(x) sin(y)
  *             ~ cos(x) - x*y.


Looks like angle sum identity.

    cos(x+y) = cos(x)*cos(y) - sin(x)*sin(y)     (always)
             ~ cos(x)*1      - sin(x)*y          (if y small)
Can't speak for the validity of small angle assumptions though.


It is possible to draw a line through n points with an n-1 degree polynomial, and since the sine/cosine curve is so simple you don't need to take that many samples to get a good approximate.

Would it not be faster to just store a lookup table and interpolate is my question. I recall hearing that's how it's done.


CPUs can do dozens of calculations for the cost of one L1 cache fetch so this is still going to be faster probably


Its been a while since I've dealt with this, is this something similar to the CORDIC algorithm ? Couple of years ago I implemented it in VHDL, very fun exercise.

https://en.wikipedia.org/wiki/CORDIC


https://github.com/RobTillaart/FastTrig/blob/master/FastTrig...

If absolute precision is not required, there is many ways to compute cosinus !


Programming accurate floating point calculations is where you need a maths college/uni degree. 99% or the "other programming" (including "AI") does not require skills higher than those from high-school.


AI / Machine Learning requires university level mathematics https://www.deeplearningbook.org


As far as I can remember, most probability functions they use in some neural net models were taught in high-school.

When I coded a neural net, the most important was the size of its connectom and not really related to the mathematical fancy-ness of neurons and synapses operations.


Memory model experts: if you define the constants in the way this code has, will they be contiguous in memory? Or would it be better to allocate these in an aligned(64) array?


C provides no guarantees about the memory layout of distinct globals. If you want a specific contiguous layout, you need to define it as a buffer. In practice though, most linkers will place them contiguously since they are simply appended to the constants section in the order they are parsed.


What is y, "the tail of x"?


You pass an argument, let's call it a, to cos. Because the polynomial approximation is only valid for a relatively small value of the argument, the first step is to compute the remainder of a when divided by pi/2. The result, let's call it x, is as small as possible but will (or should) satisfy cos(x)=cos(a).

For example, let's say you're calculating cos(4). To reduce the argument you want to compute 4-n*pi/2 for some n. In this case n=2 so you want 4-pi, the smallest possible number whose cosine is equal to cos(4). In double-precision floating point, pi is not irrationall it's exactly 7074237752028440/2^51. Thus, x=4-pi will be 7731846010850208/2^53.

However, that's not enough to calculate cos(4) precisely, because pi is actually 1.224... * 10^-16 more than the fraction I gave before and thus cos(x) will not be exactly cos(4). To fix this, the remainder is computed in multiple precision, and that's where y comes in.

y is the result of computing (a-n*pi/2)-x with a value of pi that has a higher precision than just 7074237752028440/2^51. It is very small---so small that you don't need a polynomial to compute the remaining digits of cos(4), you can just use the linear approximation -x*y. But it's still very important in order to get the best possible approximation of cos(a), even if a is outside the interval -pi/2 to pi/2.


Fascinating. But why are cos and sin implemented with different polynomials?


cosine and sine are easy thanks to the taylor series. They are both one liners in haskell. tangent on the other hand....


sincos is better. Two for one deal. :)


double_t hz,z,r,w;

Wtf?

*clicks request changes*


Same vibes:

  float Q_rsqrt( float number )
   {
   long i;
   float x2, y;
   const float threehalfs = 1.5F;

   x2 = number * 0.5F;
   y  = number;
   i  = * ( long * ) &y;                       // evil floating point bit level hacking
   i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
   y  = * ( float * ) &i;
   y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
   // y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed
   return y;
 }
https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...


It may be possible to rewrite this in a more readable way, with fewer magic numbers, using constant expressions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: