Just this week I watched someone discover that computing summary statistics in 3...

bigiain · on June 17, 2021

> Just this week I watched someone discover that computing summary statistics in 32-bit on a large dataset is a bad idea. The computer science curricula needs to incorporate more computational science.

Sadly, I suspect too many "computer science" courses have turned into "vocational coding" courses, and now those people are computing summary statistics on large datasets in Javascript...

bqmjjx0kac · on June 17, 2021

Could you shed some light on what they did wrong, and what would be a better way to do it?

cellularmitosis · on June 17, 2021

not OP, but the hint is in “computing summary statistics in 32-bit on a large dataset”.

A large dataset means lots of values, maybe we can assume the number of values is way bigger than any individual value. Perhaps think of McDonalds purchases nation-wide: billions of values but each value is probably less than $10.

The simplest summary statistic would be a grand total (sum). If you have a good mental model of floats, you immediately see the problem!

The mental model of floats which I use is 1) floats are not numbers, they are buckets, and 2) as you get further away from zero, the buckets get bigger.

So let’s say you are calculating the sum, and it is already at 1 billion, and the next purchase is $3.57. You take 1 billion, you add 3.57 to it, and you get... 1 billion. And this happens for all of the rest of the purchases as well.

Remember: 1 billion is not a number, it is a bucket, and it turns out that when you are that far away from zero, the size of the bucket is 64. So 3.57 is simply not big enough to reach the next bucket.

RhysU · on June 17, 2021

Well explained! All of the later contributions to the sum are effectively ignored or their contributions severely damaged in 32-bit because the "buckets" are big.

It was precisely this problem. The individual had done all data preparation/normalization in 32-bit because the model training used 32-bit on the GPU. It's a very reasonable mistake if one hasn't been exposed to floating point woes. I was pleased to see that the individual ultimately caught it when observing that 2 libraries disagreed about the mean.

Computing a 64-bit mean was enough. Compensated (i.e. Kahan) summation would have worked too.

bqmjjx0kac · on June 17, 2021

Thanks for the explanation!