From 60 Frames per Second to 500 in Haskell

fsloth · on Oct 15, 2014

Oh dear. I love F# and would really want to motivate myself to Learn me some Haskell but these sort of articles discourage me enormously.

Please, if there are any people here who have written successfully production code in Haskell that needs to have as low memory footprint as possible and be as performant as possible and have reached something like 1.5 x C level memory consumption, 0.75 x C level performance I would really be grateful for any references. I cannot wait to shed myself from the chains of C++ but stories like these, that follow patterns like these, give me the impression we're not there yet:

1. Awesome language <3 <3 <3 2. Write production code 3. Hit a mysterious performance problem (too much memory, too slow, or both)

Note: My interest is in dense computational kernels that are not I/O bound.

Edit: The language is obviously great for problems akin to syntax translation, I'm just constantly hoping someone would do something industrial strength with it and tell about it. Jane Street are extremely bullish about Ocaml (https://blogs.janestreet.com/). I've yet to face a similar perpetual love letter to Haskell as a language from industrial users.

conistonwater · on Oct 15, 2014

There is a simple way to implement a dense computational kernel in haskell: write it in C++ and call it using the foreign function interface.

Seriously, this is not (as I understand it) a problem that Haskell is designed to solve. Haskell is expressive, functional, lazy, etc., and that is all great, but if you care about actual instructions that get executed on the cpu, haskell doesn't care, and that's supposed to be like that. For me, writing performant haskell always descends into a game of figuring out why the compiler didn't unbox something or other, or why it didn't understand that a certain variable can be made strict.

My advice would be: if performance is really the main thing you need, study c++ and learn how to write clean code in c++. [0] And in the end, you can always call c++ functions from haskell.

If you really need to use haskell, then just learn in detail how the runtime and the compiler work, so that their output is predictable.

The chains of c++ is what makes sure that c++ code can be mapped almost directly to machine-level instructions.

[0] Maybe you already know c++ better than I do, in which case ignore me.

fsloth · on Oct 15, 2014

Thanks. This was what I was expecting.

Although, I was hoping FFI was not the answer http://rwmj.wordpress.com/2011/09/21/which-foreign-function-...

Richard Jones is fairly erudite in the art and industry (one of the authors of http://gchandbook.org/ and quite proficient around various programming languages in general) and if he says Haskell FFI is painful to work with I get a really bad feeling.

I was hoping the answer would have been "There is this stream computation library that is really neat and just works...".

tome · on Oct 15, 2014

That's strange to read. The Haskell community believes GHC's FFI is very good. Perhaps things have changed in the last three years. (I've never used the FFI so cannot respond directly.)

conistonwater · on Oct 15, 2014

> There is this stream computation library

Sometimes Haskell allows you to easily implement algorithms that would be, comparatively, much more painful to write in c++. There are stream fusion libraries, and it's important to know them. You can do a lot with plain Haskell before you need to start worrying about CPU instructions. Things like linear algebra, matrix multiplication kernels are exceptions.

> Haskell FFI is painful to work

That link is three years old, and I think I disagree with some of things he says. I agree that haskell ffi is poorly documented. But he also says it is "deeply strange". Without using any extra tools, I would say it is a little laborious, but conceptually very simple. He also says he couldn't figure out how to return anything other than integers.

Think about it this way. Haskell makes no mention of how its objects are laid out in memory. If you pass a native Haskell object (a boxed int, or a struct) to C, C code will not ever be able to unwrap it without making guesses about memory layout. What Haskell expects you to do is to do this yourself: if a structure consists of an integer and a double, you define (yourself) a function that takes a pointer, and lays out the integer and the double in the memory as your C code would expect to find them there.

Then every FFI call takes the form

1. Allocate a chunk of memory.

2. Store your Haskell object into that memory, using your layout. This is what the `poke` function does.

3. Pass the pointer to a C function.

4. C function reads and writes to that chunk of memory.

5. When C function returns, read from the memory and pack the results into a Haskell object. This is the `peek` function.

5a. Quite often a C function would return a success code anyway, and the actual return value will be written to a pointer passed to the C function. This is really common.

6. Deallocate the chunk of memory.

6a. Notice that no assumptions were made about internal object layout, except that you have to know what layout the corresponding C structure is expected to have.

This might also be the reason why it's unintuitive how to return complicated objects from C. The complicated objects need to be explicitly unpacked and stored into a Haskell object. There are tools that automate this process, but the tool-less FFI is straightforward once you figure it out.

Here is the layout GHC actually uses: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage...

This is maybe a bit unintuitive, but it makes sense to me.

fsloth · on Oct 15, 2014

Thanks for your overview. I had not done yet a deep analysis of the FFI myself, your summary clarified things.

Ericson2314 · on Oct 15, 2014

http://benchmarksgame.alioth.debian.org/ has some pretty performant Haskell, though it uses some real dirty tricks.

It's generally more straight-forward to optimize CPU-bound Haskell because there are less external moving parts to wrestle with (i.e. OS's graphics / networking stack).

Any code that doesn't allocate ought to be as fast as C with clang as it's pretty much up to LLVM to optimize those inner-loops. GHC's deforestation is the big killer feature to make number-crunching code more elegant.

Once one goes the way of fancy tree structures and all that, well one is writing very different code than they would in C, so expect different performance tradeoffs. GHC should still do a very good job, however, and Haskells various synchronization primitives are quite good too, if your domain requires that.

fsloth · on Oct 15, 2014

Thanks... but the alioth benchmarks do not really measure production readiness, although they are very educational.

To me it seems that for writing high performance dense numerical code the tradeoff still is: either write idiomatic code in a horrible language, or write horrible kludges in an beautiful language. Sigh.

Would there be any "Scientific Haskell" resources available that would straightforwardly explain how to write performant code that wrangles data from one data sequence to another?

codygman · on Oct 15, 2014

Have you checked out ATS[0] and the ATS book[1]?

0: http://www.ats-lang.org/

1: Sorry, I can't find the link at the moment. I'll try to update this later.

fsloth · on Oct 15, 2014

Is this the book you are referring to? http://www.ats-lang.org/DOCUMENT/INT2PROGINATS/HTML/book1.ht...

Thanks, interesting. Hopefully it will not become yet-one-more dead and forgotten ML dialect.

yawaramin · on Oct 15, 2014

Perhaps this will interest you: https://www.youtube.com/watch?v=McFNkLPTOSY

fsloth · on Oct 16, 2014

Thanks for the reference! Seems to be precisely the sort of material I was looking for.

jimbokun · on Oct 15, 2014

Isn't Rust pretty much designed exactly to fulfill what you're hoping to achieve?

Functional programming language features, with predictable performance.

fsloth · on Oct 15, 2014

The design intent is that but the design effort itself seems to be ongoing along with the implementation. So I would say its design once complete is probably aligned with the constraints I implied. But it's not ready yet...

carterschonwald · on Oct 16, 2014

I write performance sensitive haskell all the time.

In fact my main focus the past 2 years has been designing better array computation tools (for dense and sparse matrix computation). (some of which i'm in the process of finally open sourcing)

I've tools and tricks that would drive humans mad before they could figure out how to port the full awesomeness of what i've done to C++. The main place where I'll use C in my haskell is for writing SIMD kernels or wrapping up some unrolled ASM from openblas/blis like this monstrosity https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_6...

for throughput cpu/ram throughput bounded workloads that aren't allocation heavy, thats pretty much the only time I'll break out C, for wrapping up those crazy kernels that are tuned to a specific cpu microarchitecture.

I also know quite a few people who use haskell as a meta language for some domain specific EDSL they compile via using LLVM as a library, I know of people who've proprietary tool chains using modern GHC haskell for really really performance sensitive workloads in this fashion in high frequency trading, computer vision, and scientific computing. (and I've the fortune to consider them my friends)

btw, over on reddit, the amazing austin seipp (who does a LOT of work on GHC) has a super articulate post about performance engineering in haskell http://www.reddit.com/r/haskell/comments/2jbl78/from_60_fram...

Anyways, at the end of the day, engineering is about building software that works in finite time. The performance bits will only bit in your inner most loop, and the ffi overhead in haskell for c that takes < 10µs should be about 2-5ns (when an ffi call takes less than ~ 10µs, its safe to use the "unsafe" ffi, for operations that may take > 10µs, you should always use the default/"safe" ffi convention or someone will come knocking at your door late at night quite angry).

I should also disclose I'm slowly working out a few crazy extensions to ghc for low level performance engineering, though those might not make it into GHC till 7.12 at the current rate things are going in my life :)

the one area where GHC doesn't shine is in mega core allocation heavy workloads, but i've yet to see anyone do well by default in that regime in any language! :)

EDIT: also, profile before doing performance engineering. And before that choose the right algorithms! :)

EDIT: to further elaborate on some of the tech i've got, I've a way of doing array computation that gives me a very nice blend of guaranteed good memory locality + extensibility that i've not seen in any other array computation tooling i've been able to lay my hands on, at least on this planet :)

fsloth · on Oct 16, 2014

Thank you, your post was most encouraging! I think I'll have to presume that the adept Haskell devs are mostly developing and not popularizing that much of their advancements.

carterschonwald · on Oct 16, 2014

well, they do whatever they have to do to pay the bills! I think a lot of these tool chains, or fragments thereof, are going to be open sourced eventually, but all engineering (open or not) is being paid for by someone (at least indirectly). So a lot of these tools are only going to be make that move once the authors can "pay" for the time to do so. ('cause life is more than just software i'm told!)

You can talk with a lot of people doing numerical computingy things in haskell on the #numerical-haskell channel on freenode.

advocaat23 · on Oct 15, 2014

Is it concerning that most Haskell optimization guides talk about adding ! to enforce evaluation? It may be a sign that pure functional programming is not quiet there in the real-time world (e.g. real-time rendering) and that some abstractions (declarative instead of imperative) break for these use-cases.

What I want to say is that you really have to know a lot about Haskell to predict a program's runtime performance which is essential in the context of real-time. In my opinion these optimizations seem a little bit hacky: unsafePerformIO is per definition hacky, she-bangs seem sometimes like guess-work and concurrency is always hard to get right (even with MVars).

I really like Haskell but do not consider it yet for latency-sensible tasks as it feels like writing unidiomatic Haskell. In particular I find it concerning that one has to exploit concurrency and parallelism for such a simple game.

Are their ambitions to make GHC (hard) real-time friendly? This will probably require a real-time GC optimized for latency. I would like to read more about this, in particular about using pure (strongly typed) functional languages in a low-level/real-time context. Would it be possible to exploit the type-system there?

PaulAJ · on Oct 15, 2014

In a conventional language you have to decide when something gets evaluated, and whether to cache the value or recompute it every time. These decisions get baked into your code early on because the control and data flow are intertwingled.

In Haskell you just specify the data flow and the compiler looks after the control flow. Sometimes the compiler's default idea is less than optimal, so you can tweak it by adding bangs and other strictness annotations to the code. The semantics change slightly because maybe now you evaluate an expression that throws an exception or goes into an endless loop, whereas before it wasn't evaluated, but apart from that the changes are guaranteed not to affect the meaning of your code.

Sure, garbage collected languages are not suitable for hard real time, but interactive apps and games are soft real time (a 100 ms delay once per hour isn't going to break the game). GHC is already there.

Ericson2314 · on Oct 15, 2014

Your first paragraph is about laziness-vs-strictness, not functional vs imperative. Most Haskellers would agree that laziness is probably the least important benefit of the language.

It is true that the flip side of GHC's impressive optimizations is the performance of code is harder to reason about just by looking at it. One could argue that this is an simply unavoidable downside of having powerful optimizations -- the simpler the compilation process the easier it is to reason about performance.

Last I checked one can simply import a foreign function as pure, so they wouldn't need unsafePerformIO. Still, I wouldn't consider applying unsafePerformIO to a function that actually is pure a hack.

Regarding (hard) real time, check out http://stackoverflow.com/a/1267814 . Keep in mind that is 2009 so the state of things today could well be a lot better.

lomnakkus · on Oct 15, 2014

As it turns out, in practice, you usually want your data structures to be as strict as possible, but your control flow to be as lazy as possible. Further, some code just isn't well-suited to laziness -- so much so that there's work afoot to have a "strict-by-default" LANGUAGE pragma added in GHC 7.10.x[1].

[1] https://ghc.haskell.org/trac/ghc/wiki/StrictPragma

tome · on Oct 15, 2014

> you usually want your data structures to be as strict as possible, but your control flow to be as lazy as possible

That's a very nice slogan. I'll have to remember that. It certainly covers the case of generally wanting your record fields to be strict and lists to be lazy when used for control flow.

lomnakkus · on Oct 21, 2014

> That's a very nice slogan.

Yeah, I borrowed it from some Haskell luminary. My search-fu failed me so I didn't source it as I should have. :)

tome · on Oct 15, 2014

It's not a sign about pure versus impure functional programming. It may be a sign about strict versus lazy evaluation.

advocaat23 · on Oct 15, 2014

Nevertheless I heard (I should investigate this rumour) that being lazy enabled an easier integration of the IO system so there seems to be a connection between pure (i.e. side-effect free) and laziness.

tome · on Oct 15, 2014

Perhaps you're thinking of the observation that laziness essentially forced Haskell to be pure (otherwise the order of side effects is very difficult to control) and thus led to the invention of monadic IO.

If you're thinking of something else please do report back if you investigate as I would be interested to know.

mightybyte · on Oct 15, 2014

They chose to make Haskell a lazy language and that forced them to make it pure. You cannot have laziness without purity.

coolsunglasses · on Oct 15, 2014

You can, but you'll really hate it :)

mightybyte · on Oct 15, 2014

> Is it concerning that most Haskell optimization guides talk about adding ! to enforce evaluation?

No, it's not. It's a perfectly reasonable consequence of Haskell's design choices.

> What I want to say is that you really have to know a lot about Haskell to predict a program's runtime performance

And you really have to know a lot about CPUs to predict cache locality issues in C/C++ programs. This is to be expected.

Austin Seipp, one of the core contributors to GHC, has a really good writeup about this in the comments over at the haskell subreddit for this same article.

http://www.reddit.com/r/haskell/comments/2jbl78/from_60_fram...

rrradical · on Oct 15, 2014

I'm also pursuing mobile games in Haskell, so I'm very happy to see them working on this. That said, less than 30 fps for a simple game seems very poor to me. For comparison, my previous game ran easily at 60fps on an iPhone 5 w/Gambit Scheme.

So, I'm wondering where the culprit is. Is the android phone fairly old? What does a basic rendering test run at? I.e., w/o yampa or SDL or other game logic? I hope to have my own answers to these questions soon, but bravo to Keera for blazing the trail!

taeric · on Oct 15, 2014

It is somewhat mind blowing that the original version of the demo game was done with ridiculously rudimentary logic gates. If you haven't read the history of how Wozniak did this with little over 40 chips, you should look it up.

And yes, I realize the graphics are different. Probably even the game play.

bjterry · on Oct 15, 2014

At UC Berkeley one of the classes (EECS150) focused on creating an old-school arcade game in an FPGA using (virtual) logic gates. It is one of the best classes I ever took and really gives you an understanding of how these things were possible. It seems impossible when you first see it, but by the end it all clicks and you realize how much is possible in pure silicon.

taeric · on Oct 15, 2014

Sounds incredibly fun. Know if any of the resources you used are available in the wild?

vmind · on Oct 15, 2014

Not who you're replying to, but there's a similar class at Cambridge where you implement pong/game of life on an FPGA board. Some of the practical course notes are available publicly, though the full computer design notes referenced are not. There may be some interest in the basic sources and approach however: http://www.cl.cam.ac.uk/teaching/0910/ECAD+Arch/

(Linked to the course I know from a few years ago, the course has since changed)

maxcan · on Oct 15, 2014

As a long time haskeller and lover of the language this article really brings out the warts in the language.

I'm super excited about rust http://www.rust-lang.org because it retains much of the safety of Haskell, excluding effect tracking, while potentially offering C/C++, bare metal speed.

evincarofautumn · on Oct 15, 2014

Which warts, exactly? It’s a happy account of simple code changes that led to dramatic speed improvements, found using GHC’s great profiler. I’m excited about Rust as well (and working on a similar language) but GHC’s performance is quite competitive. And I don’t find a few strictness annotations and MVars so onerous; in fact, I usually make my data structures strict by default, making fields lazy when I actually need knot-tying or streaming or what have you.

This is also a good example of how easy concurrency and parallelism are in Haskell—I didn’t really get how to effectively program with concurrency until I learned Haskell, then ported the knowledge to more challenging environments in imperative-land.

efnx · on Oct 15, 2014

If you're interested in game programming with Haskell I invite you to the #haskell-game IRC channel (irc.freenode.net) and our sub-reddit http://reddit.com/r/haskellgamedev :)

MrBuddyCasino · on Oct 15, 2014

I appreciate this article for demonstrating Haskell optimization techniques. But honestly, I kind of expected a 3D game with impressive graphics to finally prove FP can used for these purposes. This is just a Breakout clone that runs below 30fps on Android.

angersock · on Oct 15, 2014

Yeah, it seems real impressive until you realize that you could duplicate it, faster, with only a small bit of C or C++.

Until I see a real-time software rendering project in Haskell doing full 3D, don't really think it's worth half the fanboyism it gets.

dyarosla · on Oct 15, 2014

I have to agree. I wanted to see something spectacular and I wanted to see Haskell explored as a real option. But running 25fps on Android?! Seriously? It can't hit 30fps or today's accepted video game speed of 60fps on an extremely trivial Breakout game?? As I was reading, all I could think was that ONLY after a lot of optimizations requiring hacks or deep knowledge of how Haskell operates you hit a low slightly higher but still low benchmark..that's.. nice(?). Not the 'Wow, I should start using Haskell' argument at all.

I don't need to see full 3D to convince me, but the average game on any platform is tens to hundreds of times more asset-heavy, resource-heavy and computation heavy than this example. It's just not convincing to see it perform UNDER the expected 30-60fps.

angersock · on Oct 15, 2014

So, the reason I mention full software 3D, is this:

Computers are, at heart, numerical engines. 3D rendering, in software, is the perhaps the purest form of exploiting that. If your language can't do that well, then it's a high-level language and is suited for gluing stuff together. I specify software 3D because any fool can pull in some OpenGL bindings and talk about framerates, but in that case their language is only glue for hardware APIs.

And there's nothing wrong with that, mind you, but then you can't get by on "But think of the performance this could have, one day, with magical compiler technology from the futuuuuuure!". You have to get by only on claims of safety, or developer ergonomics, or provability, or something else.

mightybyte · on Oct 15, 2014

That's because you seem to be focused on extremely performance-sensitive coding. The vast majority of code out there is not that performance-sensitive. This is a big part of where your perceived "fanboyism" comes from. In this domain Haskell's vastly higher level abstractive abilities, safety, code conciseness, etc are light years beyond what C and C++ have to offer.

mrhyperpenguin · on Oct 15, 2014

I believe functional programming does have its place in games and graphics, but I don't think it should completely replace imperative/OO programming (yet). Some aspects of game/graphics development map better to OOP (e.g. graphics apis) and should be handled with such tools.

I think, for now, a hybrid OOP/FP approach would be best. Right now, I'm prototyping a game engine in F# and C# to see if one could be viable. Incorporating FP has definitely simplified and allowed me to reason better about parts of code but I'm not sure if the tools and languages work well with each other enough yet.

Also see [0] for some thoughts about FP and game development.

[0] - https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced...

tome · on Oct 15, 2014

Do you know about LambdaCube, for example?

http://lambdacube3d.wordpress.com/2012/09/08/some-eye-candy/

JonAtkinson · on Oct 15, 2014

Cached version: http://webcache.googleusercontent.com/search?q=cache:http://...

keera_studios · on Oct 15, 2014

Sorry about this. Thanks for posting this link. The server had KeepAlive on and couldn't handle the load. Solved.

orlp · on Oct 15, 2014

For the love of god, use frametime to measure improvements, not FPS. "A 15 FPS increase" is entirely meaningless, and can mean you doubled your performance or a 1% improvement.

keera_studios · on Oct 15, 2014

Someone else pointed this out. You are right, for comparison of each improvement independently, it's better to use frame time.

In this particular case, we were after a very specific framerate, and it helps to think in FPS. The game outputs both frame time and framerate (currently of each subsystem independently).

Narishma · on Oct 15, 2014

When profiling games, using FPS to measure performance increases/decreases is kinda silly. For example, going from 25 to 30 fps (5 fps increase) is a bigger performance improvement than going from 90 to 120 fps (30 fps increase). Much better to use frame times.

keera_studios · on Oct 15, 2014

Agreed. The game outputs both. In this particular case, we are after a very specific framerate.

LukeShu · on Oct 15, 2014

    > C does not annotate pure functions as such, but the
    > semantics of SDL make some data conversions
    > referentially transparent

FWIW, GCC supports such annotations with `__attribute__ ((pure))`.

cordite · on Oct 15, 2014

But is that data that ends up in the lib*.so's?

I can see that possibly optimizing within GCC (or LLVM if it supports it) but I have no clue for library shared/static objects.

LukeShu · on Oct 15, 2014

The annotation would go in the header files, so that it works with lib*.so's. For example, glibc's strlen is marked this way (as well as a whole host of other functions).

jgalt212 · on Oct 15, 2014

Haskell and Clojure are both functional languages with Haskell being purer. That being said, I fairly regularly see posts about how fast Haskell is (Bryan O'Sullivan has a great blog[1]) and I rarely if ever see posts about how fast Clojure is.

What is/are the reason(s) for this? Is it the fault of the JVM, or the bytecode the Clojure compiler emits?

[1] http://www.serpentine.com/blog/

skittles · on Oct 15, 2014

Haskell is strongly and statically typed which means its compiler can generate more optimized code. The compiler also creates programs that run without a virtual machine. Clojure is dynamically typed and runs on the JVM. It is still really fast compared to many other dynamic languages.

wereHamster · on Oct 15, 2014

> The compiler also creates programs that run without a virtual machine.

To be honest, the whole GHC runtime is sortof a VM. It manages thunks, evaluates them etc. It may not have the same overhead as JVM, but there definitely is some.

rwosync · on Oct 15, 2014

Static typing is mostly orthogonal, Haskell is mostly blazing fast because of all the aggressive compiler optimizations and the last decade of hard work by Simon Marlow and others on the parallel runtime.

tome · on Oct 15, 2014

Static types provide guarantees that help the compiler know when optimizations are valid.

Ericson2314 · on Oct 15, 2014

The second part is true, but that optimization would have been much harder, if not impossible, to implement had they had to rely on some sort of flow analysis instead of static typing.

jgalt212 · on Oct 15, 2014

so is Typed Clojure much faster than regular dynamic Clojure?

taeric · on Oct 15, 2014

I would think that few are interested in speed tests of clojure directly.

Specifically, I don't think it is a belief that clojure makes things slower than not using clojure on the jvm. At that point, the concern is jvm versus native. And there are plenty of those comparisons.

Quickly googling, looks like I may be a wrong, though. So, tough to say.

dkersten · on Oct 15, 2014

An aside: this article has a lot of bold text, which I find quite distracting. I've seen it said that over-reliance on formatting to make your point is a sign of poor writing.

Most of the bold text could have been removed without any drop in impact (in fact, it would have greater impact because whatever bold text is left would actually stand out).

mpweiher · on Oct 16, 2014

Am I missing something here? Breakout ran in real time without overlaps etc. on the Apple ][+ with a 1 MHz 6502.

rlpb · on Oct 15, 2014

The link just shows a blank page. Zero frames per second here.

tome · on Oct 15, 2014

The web server is having trouble. Reloading worked for me.

keera_studios · on Oct 15, 2014

Sorry to both of you for this. Thanks for pointing it out.

The server was misconfigured (KeepAlive) and couldn't handle the load. It's been fixed now.

tempodox · on Oct 15, 2014

Apparently, the web server doesn't share the speed gain. Switching to the cached version...

imaginenore · on Oct 15, 2014

A game that simple should run at 1000 fps in any modern language.