Optimizing Ray Tracing in Haskell

siraben · on July 19, 2020

Optimizing Haskell for performance can be tricky. When I wrote a functional image library (updated at [0]) I tried to get the frame draw rate down by using strictness, unpacked records, and so on, but one of the changes that made the biggest difference came from changing round to truncate[1], I went from 120 ms/frame to 65 ms/frame! The reason why is because round has quite a bit of logic associated with it[2] in Haskell's standard library, versus a simple one-liner for truncate.

It sort of shows that in even simple functions such as this, underlying implementation details can have an impact on performance.

Also, for some reason filling a C array in Haskell was faster than doing it in C, probably due to the switching between RTS.

[0] https://github.com/siraben/functional-images

[1] https://github.com/sbond75/OpenGLTesting/commit/116384edd6a3...

[2] https://hackage.haskell.org/package/base-4.14.0.0/docs/src/G...

ooobit2 · on July 19, 2020

I just watched a video on branchless programming, and it blew my mind that the guy got a 3.36s function to return in .48s by using more code. The idea of trying to outsmart a compiler or runtime is motivating me to pick back up coding after 8 years away.

nickpeterson · on July 19, 2020

I think it’s a sign our chipsets and tooling have become too complicated. We’re at this weird point where both sides are trying to guess what the other one is doing.

mhh__ · on July 19, 2020

The Mill (seems to be chugging along if any one is looking for an update) is an interesting take on the problem.

If you're not familiar its a (in development) exposed pipeline (static scheduling) VLIW ISA - not unheard of in DSP land but genuinely alien compared to a CISC scalar processor a la X86 or ARM. I'm not particularly optimistic about their chances of making into the shops but they haven't failed yet.

zozbot234 · on July 19, 2020

> seems to be chugging along if any one is looking for an update

Cite? I've been looking for recent updates and haven't found any.

mhh__ · on July 19, 2020

https://millcomputing.com/topic/news/#post-3534

Not exactly busy but there you are

mhh__ · on July 19, 2020

There are some forum posts from a few months ago, and a patent filing from this year(IIRC)

siraben · on July 19, 2020

Do you have a link to that video?

tangjurine · on July 19, 2020

https://youtu.be/bVJ-mWWL7cE

Just googled branchless programming :)

omniscient_oce · on July 19, 2020

I love this guy's channel it's awesome.

jose_zap · on July 19, 2020

I find this article a nice counterpoint to “it is difficult to reason about performance in Haskell”. A person who had never programmed in the language, found all the tools and resources to implement an algorithm from a reference and make it fast.

pjmlp · on July 19, 2020

Indeed, the main problem is that many of these beliefs are based in cargo cult or urban myths, most people that take the effort to actually dive into new areas with open mind instead of looking for reasons to displace it, come out with new perspectives in software development.

amelius · on July 19, 2020

I certainly didn't expect it to be faster than the Rust implementation. Quite a surprise there.

snawaz · on July 19, 2020

To be fair, the Rust program isn't yet optimized fully. We tweaked few things here and there, and the speed improved from 26s to 19s. I believe it can still be improved further. I'm sure it can get better (especially if the issues in Rayon crate are addressed). However, at the same time, I think even Haskell version can be improved further. I'll update the blog post, or might create new blog post, once I have some interesting data to share.

rainbow-road · on July 19, 2020

To be fair, the first sentence on the Rust implementation's GitHub repo is

> A straightforward Rust implementation

(emphasis mine) and the implementation in the article is heavily optimized.

Athas · on July 19, 2020

I don't think this Haskell implementation is heavily optimised per se. The only places where readability is sacrificed for performance is the use of INLINE and the custom tail-recursive function, which is not where most of the performance improvement comes from. The other improvements (an efficient RNG, an efficient vector representation, and extract strictness) are something you get by default in Rust.

I would say that if an experienced Haskell programmer set out to write this program with the intent of making it fast, they'd never use lists in the first place, and not consider their absence an optimisation. Haskell lists are designed for lazily-generated arbitrary-length sequences, and are a bad fit for representing small finite-length sequences, such as vectors.

CyberDildonics · on July 19, 2020

The rust implementation has a lot that is very naive from a performance point of view. Every sphere is allocated separately on the heap, so every sphere hit would have to hop through that sphere's pointer.

There is actually a fairly simple C++ version that was made as part of the original book. It also does a lot that is naive from a performance standpoint, but that's what I would think people would want to use a baseline.

centimeter · on July 19, 2020

GHC has to walk a tight line between A) applying enough optimization passes to reliably remove abstraction overhead and B) compiling fast enough that people won't complain about it.

It's pretty awesome what GHC can pull off in terms of reducing complex programs to a minimal normal form (you can see this with e.g. https://hackage.haskell.org/package/ghc-proofs-0.1.1/docs/GH... ), but often times you do have to convince the optimizer to be more aggressive than it wants to be.

sukilot · on July 19, 2020

B is "compile fast enough that people will tolerate it".

"Fast enough to not be worth complaining about" is nowhere near anyone's TODO list.

platz · on July 19, 2020

I'm not sure it was his first Haskell program, if it was it's pretty impressive that he used INLINE, understood minute differences between Random libraries, profiling, threadscope, etc..

snawaz · on July 19, 2020

It was indeed my first Haskell program. So I'd take your comment as compliment. I have been a long time C++ programmer (https://stackoverflow.com/users/415784/nawaz), so I know what kind of tools one should look for in another language when it comes to making the code performant. So that helped me.

hopia · on July 19, 2020

Pretty amazing how you were able to use the appropriate tooling to a great effect right off the bat!

muglug · on July 19, 2020

I love how the author provides each optimising commit diff – that really helped me understand the improvements.

ncmncm · on July 20, 2020

The hardest thing about optimizing is knowing whether something is fast. It helps if you have the another of the same thing that is already fast, or seems so. When you match it, are you now fast?

You are until somebody makes one faster. Then, you're slow again. That is our deal with the Devil: our machines are filled with dodgy gimcracks that often make our program faster, but we can hardly ever know whether it as as fast as it could be, or if some gimcrack that would make it faster hasn't engaged yet; or, often, even how to go about engaging it.

Usually we stop optimizing when it seems fast enough, usually because something else is (still) slower.

It turns out to be pretty simple to make quicksort go twice as fast on current hardware. Standard libraries and compilers generally don't do those things, often out of fear of making a few programs slower than before. People complain a lot more about a slower program than they thank you for making a hundred others faster--if they even notice the latter.

CyberDildonics · on July 19, 2020

It's interesting that half way through the optimizations haskell is doing 3.3 terabytes of total heap allocation.

When it gets to 93 seconds it is doing 208 GB of heap allocation and in the final version it is still doing over 20 GB of heap allocations. This seems like quite a bit for a 384x216 image.

snawaz · on July 19, 2020

That is puzzling to me as well, particularly because `top` and `htop` on my machine don't even touch `50 MB`. That means, the first figure must be something else and I don't find a good explanation for that. I came across people who suggested to ignore that, e.g https://stackoverflow.com/questions/61666819/haskell-how-to-...

If anyone knows the exact reason for that, then please enlighten me as well. Even this otherwise awesome tutorial ignores that value: https://youtu.be/R47959rD2yw?t=1306

sukilot · on July 19, 2020

The runtime allocates and frees heap memory while performing local calculations.

snawaz · on July 19, 2020

> The runtime allocates and frees heap memory while performing local calculations.

1) If it allocates, why does `top` not show it?

2) Why does it allocate SO MUCH when it uses only a fraction? Looks like a very bad strategy to allocate 2800 times more memory than actually used?

CyberDildonics · on July 19, 2020

You are looking at the current amount of memory that the process is using in top, but your profiling is probably showing the total allocated while the process ran. This means there are probably LOTS of memory allocations that get freed immediately instead of the program allocating memory once and reusing that heap memory. In the case of an image of this size you could probably just put everything on the stack.

When optimizing, excess heap allocations are the first thing to look at, since they are usually inside inner loops, which almost always means they are not necessary. It definitely should not be ignored by someone looking for speed.

snawaz · on July 20, 2020

> In the case of an image of this size you could probably just put everything on the stack.

That sounds interesting and will probably make it more efficient. But how do I put things on the stack? As per this answer on Stackoverflow, the runtime allocates memory on the heap to call functions (and probably programmers don't have any control over that?):

https://stackoverflow.com/a/5132862/415784

CyberDildonics · on July 20, 2020

I don't know how to avoid heap memory allocations in haskell or make sure memory is on the stack, but both are very easy to do in C++.

snawaz · on July 20, 2020

>I don't know how to avoid heap memory allocations in haskell or make sure memory is on the stack, but both are very easy to do in C++.

Yup. I know it's easy and the default behavior in C++.

This Haskell page says, it is pretty common to allocate and deallocate memory immediately, and that GHC handles this pretty efficiently, which probably explains why Haskellers dont seem to care about this (though it seems weird to me).

https://wiki.haskell.org/GHC/Memory_Management

logicchains · on July 19, 2020

I wonder if the author would have achieved similar results just by enabling the -XStrict language extension from the start.

snawaz · on July 19, 2020

I didn't know that. I'll try that. Thanks for mentioning it.

hopia · on July 19, 2020

-XStrict is almost never optimal in Haskell as a lot of the optimizations ultimately rely on lazy evaluation. I think more appropriate would be to try out -XStrictData instead.

_bz2r · on July 19, 2020

i would be interested in follow-up regarding whatever you find.

Asooka · on July 19, 2020

That's really cool. The next step would be using explicit SIMD instructions. Is there a SIMD library for Haskell? Or a library for automatically doing AoS->SoA transformation on some compute code?