Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing Ray Tracing in Haskell (medium.com/swlh)
127 points by jose_zap on July 18, 2020 | hide | past | favorite | 37 comments



Optimizing Haskell for performance can be tricky. When I wrote a functional image library (updated at [0]) I tried to get the frame draw rate down by using strictness, unpacked records, and so on, but one of the changes that made the biggest difference came from changing round to truncate[1], I went from 120 ms/frame to 65 ms/frame! The reason why is because round has quite a bit of logic associated with it[2] in Haskell's standard library, versus a simple one-liner for truncate.

It sort of shows that in even simple functions such as this, underlying implementation details can have an impact on performance.

Also, for some reason filling a C array in Haskell was faster than doing it in C, probably due to the switching between RTS.

[0] https://github.com/siraben/functional-images

[1] https://github.com/sbond75/OpenGLTesting/commit/116384edd6a3...

[2] https://hackage.haskell.org/package/base-4.14.0.0/docs/src/G...


I just watched a video on branchless programming, and it blew my mind that the guy got a 3.36s function to return in .48s by using more code. The idea of trying to outsmart a compiler or runtime is motivating me to pick back up coding after 8 years away.


I think it’s a sign our chipsets and tooling have become too complicated. We’re at this weird point where both sides are trying to guess what the other one is doing.


The Mill (seems to be chugging along if any one is looking for an update) is an interesting take on the problem.

If you're not familiar its a (in development) exposed pipeline (static scheduling) VLIW ISA - not unheard of in DSP land but genuinely alien compared to a CISC scalar processor a la X86 or ARM. I'm not particularly optimistic about their chances of making into the shops but they haven't failed yet.


> seems to be chugging along if any one is looking for an update

Cite? I've been looking for recent updates and haven't found any.


https://millcomputing.com/topic/news/#post-3534

Not exactly busy but there you are


There are some forum posts from a few months ago, and a patent filing from this year(IIRC)


Do you have a link to that video?


https://youtu.be/bVJ-mWWL7cE

Just googled branchless programming :)


I love this guy's channel it's awesome.


I find this article a nice counterpoint to “it is difficult to reason about performance in Haskell”. A person who had never programmed in the language, found all the tools and resources to implement an algorithm from a reference and make it fast.


Indeed, the main problem is that many of these beliefs are based in cargo cult or urban myths, most people that take the effort to actually dive into new areas with open mind instead of looking for reasons to displace it, come out with new perspectives in software development.


I certainly didn't expect it to be faster than the Rust implementation. Quite a surprise there.


To be fair, the Rust program isn't yet optimized fully. We tweaked few things here and there, and the speed improved from 26s to 19s. I believe it can still be improved further. I'm sure it can get better (especially if the issues in Rayon crate are addressed). However, at the same time, I think even Haskell version can be improved further. I'll update the blog post, or might create new blog post, once I have some interesting data to share.


To be fair, the first sentence on the Rust implementation's GitHub repo is

> A straightforward Rust implementation

(emphasis mine) and the implementation in the article is heavily optimized.


I don't think this Haskell implementation is heavily optimised per se. The only places where readability is sacrificed for performance is the use of INLINE and the custom tail-recursive function, which is not where most of the performance improvement comes from. The other improvements (an efficient RNG, an efficient vector representation, and extract strictness) are something you get by default in Rust.

I would say that if an experienced Haskell programmer set out to write this program with the intent of making it fast, they'd never use lists in the first place, and not consider their absence an optimisation. Haskell lists are designed for lazily-generated arbitrary-length sequences, and are a bad fit for representing small finite-length sequences, such as vectors.


The rust implementation has a lot that is very naive from a performance point of view. Every sphere is allocated separately on the heap, so every sphere hit would have to hop through that sphere's pointer.

There is actually a fairly simple C++ version that was made as part of the original book. It also does a lot that is naive from a performance standpoint, but that's what I would think people would want to use a baseline.


GHC has to walk a tight line between A) applying enough optimization passes to reliably remove abstraction overhead and B) compiling fast enough that people won't complain about it.

It's pretty awesome what GHC can pull off in terms of reducing complex programs to a minimal normal form (you can see this with e.g. https://hackage.haskell.org/package/ghc-proofs-0.1.1/docs/GH... ), but often times you do have to convince the optimizer to be more aggressive than it wants to be.


B is "compile fast enough that people will tolerate it".

"Fast enough to not be worth complaining about" is nowhere near anyone's TODO list.


I'm not sure it was his first Haskell program, if it was it's pretty impressive that he used INLINE, understood minute differences between Random libraries, profiling, threadscope, etc..


It was indeed my first Haskell program. So I'd take your comment as compliment. I have been a long time C++ programmer (https://stackoverflow.com/users/415784/nawaz), so I know what kind of tools one should look for in another language when it comes to making the code performant. So that helped me.


Pretty amazing how you were able to use the appropriate tooling to a great effect right off the bat!


I love how the author provides each optimising commit diff – that really helped me understand the improvements.


The hardest thing about optimizing is knowing whether something is fast. It helps if you have the another of the same thing that is already fast, or seems so. When you match it, are you now fast?

You are until somebody makes one faster. Then, you're slow again. That is our deal with the Devil: our machines are filled with dodgy gimcracks that often make our program faster, but we can hardly ever know whether it as as fast as it could be, or if some gimcrack that would make it faster hasn't engaged yet; or, often, even how to go about engaging it.

Usually we stop optimizing when it seems fast enough, usually because something else is (still) slower.

It turns out to be pretty simple to make quicksort go twice as fast on current hardware. Standard libraries and compilers generally don't do those things, often out of fear of making a few programs slower than before. People complain a lot more about a slower program than they thank you for making a hundred others faster--if they even notice the latter.


It's interesting that half way through the optimizations haskell is doing 3.3 terabytes of total heap allocation.

When it gets to 93 seconds it is doing 208 GB of heap allocation and in the final version it is still doing over 20 GB of heap allocations. This seems like quite a bit for a 384x216 image.


That is puzzling to me as well, particularly because `top` and `htop` on my machine don't even touch `50 MB`. That means, the first figure must be something else and I don't find a good explanation for that. I came across people who suggested to ignore that, e.g https://stackoverflow.com/questions/61666819/haskell-how-to-...

If anyone knows the exact reason for that, then please enlighten me as well. Even this otherwise awesome tutorial ignores that value: https://youtu.be/R47959rD2yw?t=1306


The runtime allocates and frees heap memory while performing local calculations.


> The runtime allocates and frees heap memory while performing local calculations.

1) If it allocates, why does `top` not show it?

2) Why does it allocate SO MUCH when it uses only a fraction? Looks like a very bad strategy to allocate 2800 times more memory than actually used?


You are looking at the current amount of memory that the process is using in top, but your profiling is probably showing the total allocated while the process ran. This means there are probably LOTS of memory allocations that get freed immediately instead of the program allocating memory once and reusing that heap memory. In the case of an image of this size you could probably just put everything on the stack.

When optimizing, excess heap allocations are the first thing to look at, since they are usually inside inner loops, which almost always means they are not necessary. It definitely should not be ignored by someone looking for speed.


> In the case of an image of this size you could probably just put everything on the stack.

That sounds interesting and will probably make it more efficient. But how do I put things on the stack? As per this answer on Stackoverflow, the runtime allocates memory on the heap to call functions (and probably programmers don't have any control over that?):

https://stackoverflow.com/a/5132862/415784


I don't know how to avoid heap memory allocations in haskell or make sure memory is on the stack, but both are very easy to do in C++.


>I don't know how to avoid heap memory allocations in haskell or make sure memory is on the stack, but both are very easy to do in C++.

Yup. I know it's easy and the default behavior in C++.

This Haskell page says, it is pretty common to allocate and deallocate memory immediately, and that GHC handles this pretty efficiently, which probably explains why Haskellers dont seem to care about this (though it seems weird to me).

https://wiki.haskell.org/GHC/Memory_Management


I wonder if the author would have achieved similar results just by enabling the -XStrict language extension from the start.


I didn't know that. I'll try that. Thanks for mentioning it.


-XStrict is almost never optimal in Haskell as a lot of the optimizations ultimately rely on lazy evaluation. I think more appropriate would be to try out -XStrictData instead.


i would be interested in follow-up regarding whatever you find.


That's really cool. The next step would be using explicit SIMD instructions. Is there a SIMD library for Haskell? Or a library for automatically doing AoS->SoA transformation on some compute code?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: