Hacker News new | past | comments | ask | show | jobs | submit login
What's new in purely functional data structures since Okasaki? (2010) (cstheory.stackexchange.com)
75 points by profquail on Jan 18, 2014 | hide | past | favorite | 17 comments



Out of curiosity, how to purely functional data structures compare with classic mutable data structures in terms of real hardware performance? From what I hear, they are useful if you are into functional purity, and sometimes if you are doing certain parallel programming tasks, but suffer significant overhead outside that domain, but I would like to hear of any specific evidence people have in either direction.


Since the other comments are platitudes, I'll try to give some concrete information. Well implemented purely functional data structures are usually anywhere from just as fast to about 20x slower. If the imperative version is tree like, the cost for lookups in the functional version is the same, but the cost for updating is a bit higher because a path needs to be copied rather than just 1 node modified in place. For a data structure like an unordered map, you can use a hash table backed by an array for the imperative version, but you'll have to use a tree for the purely functional one. This can cause about a 10x-20x slowdown.

For example in this post http://donsbot.wordpress.com/2009/09/26/very-fast-scalable-m... Haskell's Data.Map is compared with Judy arrays, and found to be 11x slower. Judy arrays themselves are about 2x slower than a basic hash table http://preshing.com/20130107/this-hash-table-is-faster-than-...

Is this slow? Perhaps, but in most cases the difference is smaller. Compare this to the speed of Ruby vs C, that can easily be a 100x difference yet people use Ruby a lot because of productivity reasons.


Importantly, comparing "ephemeral" structures - those that destroy previous states through direct hardware mutation - with persistent structures (such as all pure data structures) - is a fraught topic.

You can surely gain speed ups on real hardware directly mutating memory you know to be reusable (linear) - but proving that precondition is very hard, and we often screw it up.

Persistent data is a safe default. Mutation/destruction is an optimisation.


The biggest benefit of a purely functional data structure for performance is the same as functional programming in general - they capture the problem in a way that is easier for a compiler to reason about and rewrite to hardware-optimized forms; a computation problem addressed in functional style is going to have fewer ordering dependencies, and (depending on the language) a potentially richer, more expressive type system enabling more aggressive optimizations. Functional code happens to be particularly amenable to parallel optimization, so it's one of the go-to examples of what can be done.

In a practical benchmarking scenario like gcc vs. GHC on the benchmarks game[0], gcc is still going to win most of the time. This is mainly a reflection of where the bulk of effort in compiler technology has gone, though, not a definitive "a purely functional data structure is going to be slower" statement. You're leaning a lot more on your compiler when trying to make functional languages run fast, and the biggest benefits of compiler optimizers are going to appear in the overall performance of applications, not benchmarks of an individual data structure.

That said, there are also plenty of examples where a mutable structure just wins and nothing with equivalent performance is likely to appear in the functional world anytime soon. If you want to get into the details of that you have to look at each category of data structure case-by-case.

[0] http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?t...


The tremendous win in avoiding mutable data structures is in implementing persistent data structures, where appropriate. This isn't something you get automatically (usually), but the leap from "efficient purely-functional data structure" is usually much smaller than the leap from "efficient mutable data structure". Persistent data structures retain the same amortized performance even if you replay them from arbitrary old versions. There is also sharing of unchanged portions of the data, which can shrink your memory footprint if you're saving some old versions, and which you'll usually get some of in any moderately efficient functional data structure, even if it's not truly persistent.

This, of course, doesn't apply to all situations - if all you care about is the latest version and you're working in a single thread, a mutable data structure with destructive updates will quite frequently be fastest.


If I may make a sweeping overgeneralisation and phrase it poorly, immutable data structures tend to be logarithmically worse (esp. in terms of allocations) than mutable structures. But if your language or runtime can prove that a reference is unique, then that overhead can be eliminated because the runtime can simply mutate with no one the wiser. And these languages tend to use GCs in which allocations are cheap, besides the fact that sharing saves you on copying. There are many factors at play.

Source: professional Haskell and C++ programmer extraordinaire.


For people who've read it, do you have a review of Okasaki's book? Also, what background should someone have before reading it (eg should you be able to read it after reading Learn You a Haskell? Should you already be familiar with the imperative equivalents?)


Short answer: I don't think LYAH would be entirely sufficient background, and I think familiarity with imperative equivalents is a prerequisite, but along with some other background.

Long version: I stumbled a bit when trying to read this book, because I'm not super comfortable with the formal techniques for analyzing algorithms. I'm not talking about the kind of fluffy-tech-company-interview-"knows what big Oh is and can estimate it roughly on a whiteboard" ability but actually something much more concrete and formal, the kind taught in an undergrad algorithms class or in CLRS. I've taken those classes, I have CLRS on my shelf and have read some of it, but it's been a long time and it's not at my fingertips. There are exercises in the book ("prove that this data structure has this amortized performance", "prove that it does if you change this one thing", etc...) that challenged me quite a bit, and I ended up putting it down after a while thinking I would revisit it after refreshing my fundamentals.

The ML code is easy enough to follow if you know Haskell, but since the point is often to understand the subtlety of the performance characteristics of the data structures and algorithms, just reading and playing with the code wasn't enough for me. After all you can trivially implement functional data structures naively just by copying everything on every operation.

Frankly, Rich Hickey's videos on Clojure's data structures have done a lot more for me in terms of understanding how functional data structures work, even though they are much less formal in presentation. To be clear: I think Okasaki's book is probably a very good book and an important work, but am not so sure it is crucial for the working functional programmer to internalize all the mathematical details.

Anyway, YMMV. Check out his thesis and see what you think before buying it.


It's the best book about data structures in functional programming, as relevant today as when it was published almost fifteen (gasp) years ago.

You can get a flavor of the book by reading Okasaki's PhD thesis: http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf

The book is essentially a fleshed-out version of this thesis.


Why there's no updated book out there (in the cold!) ?



"That changed one day during a walk in the snow."

The paragraph that follows this phrase in the OA is straight out of The psychology of invention in the mathematical field by Jacques Hadamard.


It's outstanding. My favorite bit is the section on building data structures with various characteristics inspired by obscure number systems (zeroless binary numbers, skew binary numbers...).

I'd recommend some experience with complexity analysis (though "ages ago in school" is probably fine). I found most of the SML examples to be reasonably transparent, though I did some OCaml ages ago and have been working with Haskell more recently. He uses laziness heavily, but it's explicit. Knowing your way around a few imperative data structures is almost certainly worthwhile; I'm not positive it's strictly necessary.


Note that this thread has been submitted many times to HN, but it never generated generate many comments:

https://www.hnsearch.com/search#request/submissions&q=what's...


I think posting one algorithm you really like and explaining why works better than a laundry list.


The importance of the Phil Bagwell/Rich Hickey* vector can't be understated. AFAIK, pretty much all standard immutable lists provided by platforms use this work.

*Bagwell designed the tries as a mutable data structure. Hickey saw how to apply that to a workable immutable list implementation.


Even as a rabid user of Scala and sometimes Erlang, it strikes me that "exciting purely functional data structures" may be an oxymoron.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: