R uses a pass-by-value semantic for function calls. This means that when a function modifies the contents of one of its arguments, it is a local copy of the value which is changed, not the original value. This has many desirable properties, including aiding reasoning about and debugging code, and ensuring precious data is not corrupted. However, it is very expensive as many more computations need to be done to copy the data, and many computations require excessive memory due to the large number of copies needed to guarantee these semantics.
I don't know what implementation R uses for its data frames, but building efficient collections with functional semantics is now a solved problem, as demonstrated by the functional collections libraries supplied by or available for many languages -- Clojure, Scala, ML, Haskell, and no doubt many others I'm unaware of (along with my own for Common Lisp [0] and Java [1]). It is no longer necessary for the important benefits cited by the author to be considered at odds with an efficient implementation.
I think it's unfortunate that the Julia creators didn't realize this, and have gone back to imperative semantics for their collections. Imperative collections are harder to reason about, especially for inexperienced programmers, and are more bug-prone.
Your work on FSet looks excellent! Working with purely functional collections is certainly a lovely, coherent design (and, for what it's worth, something we were aware of when we started Julia). I can't find any benchmarks on the FSet site comparing performance of functional collections with their standard mutable counterparts. As the most basic example, I'd be interested in seeing a comparison of getindex and setindex on functional vectors as compared with the same operations on simple contiguous-memory arrays.
The asymptotic performance of FSet operations is well documented – and as the site says, quite "acceptable". However, every time I've investigated purely functional data structures in the past, I've found that while the asymptotic behavior is fine, the constant-factor overhead for some critical operations is just too high. Retrieving a single element from an array, for example, was about 10x slower using Clojure's functional vectors than Java arrays when I benchmarked it some years ago. Perhaps the gap has been closed much further since then, and if so I'd would definitely love to see benchmarks to that effect. Did I miss benchmarks on the FSet site?
Thanks! No, I don't have benchmarks. My guess is that FSet is not as fast as the Clojure collections, which use a newer data structure (Phil Bagwell's Hash Array-Mapped Trie) that I didn't know about when I wrote FSet. -- Um, that's for sets and maps. For seqs (functional vectors), I'm not sure but I think the Clojure data structure is very similar to FSet's. I think the odds that FSet is significantly faster than Clojure are small, though I suppose it's worth checking.
For the domain I work in -- compilers and static analysis -- the additional overhead of functional collections has only rarely been an issue in practice. It's different from numerics, though, in that a lot of our collections are maps keyed by nonnumeric objects; the time spent hashing the object and testing it for equality with other objects is relatively much larger than when the keys are simply integers, and therefore the time spent traversing the data structure itself is a smaller part of the total. I concede that designing functional collections for use in high-performance numeric work is a harder problem, though I still suspect that solutions may exist. See my reply to your other post.
I'm not one of the creators of Julia (though am a contributor), but I would argue that there are many algorithms which are much simpler when one can mutate in-place. For example
- many low-level linear algebra operations, such as Householder reflections or Givens rotations.
- sorting: a true implementation of QuickSort relies on being able to mutate the list. This is why naive implementations in Haskell are punitively slow (or at least they were last time I checked).
Now of course, a good compiler may be able to identify when it can safely mutate an existing structure vs copy to a new one. But this is far from easy to do, and can result in users writing complicated code in order to optimise against the compiler (which may no longer be optimal with the next version of the compiler).
In the specific case of Julia, I think it has settled to a fairly good compromise. It provides lots of immutable structures (structs, tuples, etc.), and has a convention of suffixing mutating functions with an exclamation mark (e.g. sort!(X)). I haven't seen too many cases where users have had problems with accidental mutation.
I agree that a language with Julia's goals needs to provide mutable vectors, matrices, etc. I just don't agree that these should be the most commonly used versions, or the ones that a newcomer to Julia learns about first.
Certainly, in my own work (admittedly in a different domain) I occasionally need a mutable map or set or list for some particular algorithm, but this is the relatively rare case. Most of the time, functional collections work just fine and make the algorithms more elegant rather than less so.
The most commonly used functions/methods are the non-mutating versions, that newcomers to Julia do learn about first. (The exception being assigning values to vectors and dictionaries using `setindex!`, as in `vector[i] = x` or `dict[key] = value`.)
A lot of user code just uses functional methods on types which have the potential to be mutated. OTOH, library methods that are called by the user often involves optimized code that might be doing mutation internally but presents a functional interface, such as `sort(vector)` (which basically uses `sort!(copy(vector))` internally).
In principle, I'm very much predisposed to collections that are purely functional. I know that Apache Arrow (https://arrow.apache.org/) aims to create an interoperable in-memory dataframe structure that tries to take into account CPU caches and other dark-magic-from-my-perspective things.
Any reason why a purely functional data structure shouldn't be able to achieve the aims that something like Arrow is attempting?
"I think it's unfortunate that the Julia creators didn't realize this, and have gone back to imperative semantics for their collections."
I'm a fan of immutable containers. Makes all sorts of things just easier. But I was under the impression there is no known way to implement efficient graph algorithms using immutable containers - with performance on par with a mutable implementation that is.
This alone is a sufficient reason not to default on immutable for a language with performance as one of the key requirements, IMO.
The research I did was from a few years back and things might have changed. I'd like to know if this is the case :)
I disagree since it's suppose to be a scientific language. IMO it should aim to be a DSL to that effect; ergonomic for the purpose of analysing and calculating. Kind of like a declarative language; the hard stuff should be left to the compiler where it's possible. Again, I think Clojure does a good job at this.
"Kind of like a declarative language; the hard stuff should be left to the compiler where it's possible. "
That's the holy grail of numerics. I don't think that's feasible if they want to cater to researchers who need to hand tune their algorithms.
We are quite far off from "do what I mean" semantics. Clojure's approach works for a grammar of very limited set of very well understood container operations.
If you want symbolic computations use e.g. Mathematica, Macsyma, etc.
A while back I remember someone tried to write an alternative R implementation.
Can you call R from Julia yet? You can do it from Python, but converting between R and Python data structures can be painful. I imagine it would be less so with Julia being "more vectorized" than Python.
There's also the issue that vectorized operations tend to be very _fast_ but algorithmically inefficient. Even if expressions are evaluated lazily, there is no compiler to recognize that computing the value of
any(is.na(x))
can short-circuit as soon as a single NA value is found. Instead all of `is.na(x)` is computed first, and then `any()` is computed on the result.
I imagine this a problem in any non-compiled (ahead-of-time or just-in-time) language. Wondering what your thoughts were, since it seems like a related issue.
> Imperative collections are harder to reason about, especially for inexperienced programmers.
That's a blanket statement that is completely unsubstantiated.
In expect that, in reality both have strength and weaknesses and it depends very much on the person. Myself I always find imperative style much more manageable when programming in the small. And before you claim I was biased, I was exposed to functional programming almost from the start (and frankly I'm quite fond of it, but it's just another tool not a religion).
Thanks, I had seen the original link before, but not this. As a Julia developer, I like that his 5 lessons are ones that Julia has successfully addressed. In particular:
1. Julia had the benefit of clever design, as well as not needing to support legacy interfaces that are difficult to optimise. Python has had similar problems (which needs to support all the different low-level interfaces), whereas JavaScript, which provides fewer ways to "mess with the internals", now has several high-performance engines.
4. Although Julia does provide the ability for type annotation, it is actually rather rarely necessary, due again to good language design which makes automatic type inference feasible.
5. Julia originally discouraged use of vectorised operations for this reason. But now we've gone the other way, with special "dot" syntax which avoids allocating intermediate arrays. See https://julialang.org/blog/2017/01/moredots for more details.
As a Julia developer, and for the benefit of someone (me!) who doesn't know anything about such matters, are you in a position to respond to ScottBurson's post (https://news.ycombinator.com/item?id=14801186) on efficient collections with functional semantics?
No reply from Simon, but for the record, here's how I would do it. The Julia representation of a matrix would be a pair of an array and an override map, a functional map which is initially empty. Pointwise functional update returns a new matrix with a new value in one cell; the implementation returns a new pair of the same array and a new override map, computed as the previous override map functionally updated with the new mapping ((i, j) -> x). Pointwise access looks in the override map first, and if it finds no mapping there, looks in the array.
When you want to perform a matrix operation, like multiplication, that for efficiency needs to operate on a plain array, here's what you do. First, if the override map is empty, you can use the array as it is. Otherwise, you make a new copy of the array, then iterate through the override map, storing each value in the array at the specified coordinates. Then you write the pointer to the copied and updated array into the array slot of the original pair. Yes, this modifies a nominally immutable object, but critically, it does so in a way that doesn't change its semantics. Finally you write, into the override-map slot of the original pair, a pointer to the canonical empty map. Now you can pass the array to your matrix multiply routine, or whatever.
The two writes don't even require locking in a multithreaded situation, though on many architectures there will need to be a memory barrier instruction between them, so another thread can't read the two writes in the wrong order. Given that, the worst that can happen is that another thread sees the updated array but the original override map, whereupon it will make a redundant copy of the array and redundantly apply the updates to it, doing a little extra work but producing an identical array.
This design seems workable for vectorized operations where the overhead of the override map can be amortized away (like matmul). The problem lies with scalar operations on individual array elements. Most high-level, dynamic languages don't bother themselves with those operations and instead encourage programmers only to use predefined, vectorized kernels. Julia isn't like that, however. A key property of the language is that you can write low-level C-style code with for/while loops, operating on individual array elements – and get C-like performance. The overhead of an overlaid structure like you describe becomes fatal in this situation since you pay for it with every element you touch. There may be clever compiler optimizations that can claw back some of that overhead – but one of the design principles of Julia is to avoid needing clever tricks in the first place. Instead, do the straightforward fast thing. In this case, that dictates standard C-style contiguous-memory mutable arrays – there's just nothing that can match their raw performance.
Ah, I see. Yes, I was imagining something more like Numpy.
So -- just thinking out loud here -- in order to use a design like the one I've proposed, you would have to make the "flatten" operation user-visible; it would look like a type conversion from a functional matrix to a bare array. Then the user's algorithm could run on the array, and once they were out of the hot section, they could, if they so chose, convert it back to a functional matrix.
If I personally were going to write Julia code, I think I would like things to work this way. Not only am I accustomed to functional semantics, and generally prefer it, but I'm also used to switching modes (even writing mixed algorithms in which some collections are functional and some are imperative). But I can see the argument that that might be a little too much complexity for your target audience.
> Ah, I see. Yes, I was imagining something more like Numpy.
Understandable. The approach and philosophy is very different, however. Julia feels much more like dynamically typed C++ than it does like Python with heterogeneous arrays added on.
> you would have to make the "flatten" operation user-visible
This seems like it would be a method of the `convert` function. Built-in types are barely special at all in Julia, so you could – and people have [1] – make purely function analogues of any data structures you like and use all the usual operations on them. One can even allow mutation when necessary, but just make the non-mutating operations sufficiently efficient that mutation is rarely necessary.
A fair amount of mathematical code does typically work in two distinct phases:
1. construction/mutation phase – vectors and matrices are constructed and mutated to setup the problem;
2. computation phase – vectors and matrices are used in computations, typically in a purely functional manner, where each operation creates a new objects (e.g. `Z = sqrtm(2X + Y^2)`).
Unfortunately, purely functional collections don't help much with either of these phases: the first phase is inherently full of mutation and typically quite local, so the drawbacks of mutation are limited; the second phase is already inherently functional but doesn't need functional data structures. What you really need in the second phase is escape analysis that is clever enough to figure out that it can reuse temporaries.
Julia has been designed for single core performance fullstop. Functional collections may work well with a state of the art GC, with Julia's not so much. The fact that Julia can interop seemlessly with C code (easily) kind of bounds the design of the GC.
I think it is a little disingenuous to say that a Julia programmer does not have to worry about types. Type inference alleviates many burdens, but correct typing of arguments is essential (and hidden promotions or casting can kill performance). So while you can write correct programs easily, for efficient programs you end up worrying about this quite a lot.
> Functional collections may work well with a state of the art GC, with Julia's not so much.
I don't know the specifics of Julia's GC, but this seems a strange thing to say in 2017. Douglas Crosher's conservative generational collector for CMUCL (also used in SBCL AFAIK) supports C interoperation and is entirely adequate for handling the extra garbage that (admittedly) is generated when using functional collections. I don't recall exactly when he wrote that collector, but it must have been 20 years ago at least. It would be strange if Julia weren't using something at least as sophisticated.
I'm not sure I understand: why would the C interop limit the design of the GC?
I didn't say Julia programmers don't have to worry about types, my comment was only about type annotations, though they are obviously related. Perhaps a better way to phrase it is: by worrying a little bit about types, you can generally avoid the need for type annotations. Hopefully as the tooling around the language matures, we can reduce the amount of worrying required.
I'm a big R user; but it's true what the guy says. When you step off the happy path because the data is an awkward size or because you can't apply, aggregate or tidyr your way to a good solution, what results is often slow and hideous.
I've switched to Clojure for data work. It's fast, data centric and you can usually work out an elegant solution no matter how grizzly the problem, and in truth, despite the huge number of R packages, I use only a tiny portion.
I still use R lots and lots but mostly in the way that I use awk, and for shorter scripts. I don't use Incanter because it's been abandoned for so long. I do feel that Clojure could definitely use a better notebook story than Gorilla REPL.
I've been considering Clojure recently for data science. In your experience, could you tell me how Clojure stacks up against the likes of R and Python?
I guess the performance should not be much better (thanks to JVM), and the learning curve is a bit steep. Oh and one question about Lisp macros: in your line of work, have you ever confronted a situation where a macro saved the day?
I am not the addressee of your question but I use clojure as well for DS and I thought I extend on a few aspects of my sibling post.
- Performance 1: It's really fast iff you avoid boxing, reflection and so on. Even without this optimization it's mostly fast enough. However, 1 out of 10 times, it get's too slow and you need to change a few things in the code so it gets faster. It's easy and never requires a lot of time (after the first :P)
- Performance 2: Memory consumption is sometimes quite high, in this cases you should use arrays and records instead of vectors of maps. Also JVM args (e.g. -DXmx8G) is your friend.
- macros I don't ever use.
- filtering, aggregating, .. is a breeze and usually a line of code. E.g.
(->> a-vector-of-records
(filter #(> (:id %) 100) ;; all ids above 100
(remove (comp #{:a :b} :category) ;; without categories :a or :b
(map :value) ;; take the value
(reduce + 0) ;; and sum it
- Tools: I use incanter.stats (for statistical things I didn't yet implement), incanter.charts for the visualization and incanter.optimize for linear models. Other stuff (GLM, FFT, ...) I implemented myself or directly use the libs incanter uses.
- For report generation (notebook-like) I typically use clj-pdf which is usually the first deliverable of every DS task I have.
- For learning: I came from Javaland, and I benefited heavily from the 4clojure koans for the practical stuff and "The Joy of Clojure" for the "clojure way of thinking".
- For using it: I had to "learn emacs" (and paredit and so on) at the same time, it complicated everything and also was a drain on my motivation. It's good if you have a colleague who uses the setup already, alternatively youtube has many videos on this. Today, I'd never switch back because just using eclipse (or something else) makes me feel that I need 10s to execute a thought instead of keyboard shortcut. When you're there you also might want to switch to i3wm (if you don't have it already).
RE performance: it's JVM vs R interpreter vs Python interpreter. JVM takes it by far.
RE ETL: lists and maps are basically the main data structures in Clojure and resultantly you have a very large collection of functions for working with them efficiently. ETL generally is easy as a result. I don't think either Python nor R can compare in terms of expressiveness and LOC efficiency. Further, lazy sequences make it really easy to work with massive datasets because it doesn't require memory buffering. Working in parallel is really easy in Clojure. In particular fold,pmap and async channels are extremely easy to use and understand.
RE analysis: Clojure is not vectorised like R for example. Meaning syntax is not as terse for sub-setting, aggregating, etc. It doesn't support formulae as first-class citizens like R does.
E.g. filtering in R vs Clojure.
X <- D[D$type == "Seismic",]
vs
(def X (dt/where (comp #(= % "Seismic") :type) D))
E.g. aggregating in R vs Clojure.
X <- aggregate(count ~ type, data=D, sum)
vs
(def X (dt/aggregate :type #(reduce + (map :count %)) D))
Having said that it's functional, and doesn't require one to loop or to mutate, it's just not as terse.
Clojure sucks at Viz. There's no compelling notebook solution (Gorilla REPL...) that supports mod-cons like exporting or easy sharing.
However, I really like orgmode + babel. It makes it easy to grab data from Clojure and push it into an R block which generates graphs for example and it exports into everything.
RE macros: I haven't written any custom macros at the moment tbh. If I do in the near term it'll probably be for logging or timing. I find that the function format is extremely flexible.
RE general: Clojure has a steep and short learning curve. As in you cry for a few days but then suddenly understand loads. I also think the way of thinking it encourages is more mathematical. If you stay away from swap!, refs, agents, etc (which you can), then your world is immutable which makes testing so much easier. True error messages suck but as soon as you get a handle for how to write the code interactively with the REPL it's pretty easy to track down what causes an issue.
Then there's also the Axiom system, which I guess is primarily for symbolic computation, but I don't know if it might make sense to use it as a building block for statistical software?
When discussing python, the authors missed the elephant in the room: numpy.
When almost any statistical work is done in python, we use numpy arrays, sometimes via the pandas or statsmodels etc. libraries. We seldom use native python types.
This thread is fascinating to me as someone who learned stats as an undergrad using Lisp, and then learned S, and then learned about and starting using R, and read all of these forewarnings by Ihaka and Tierney.
The current era is both exciting and dispiriting from this perspective. It seems like there's a lot of traction in this area with languages like Julia, OCaml, Nim, to name just a few, which is wonderful. The discussions here are great in this regard. However, it's somewhat frustrating that warnings like the linked piece -- that have been around for a long time -- seem to have been ignored. My personal experience, too, is that many of the claims of recent languages regarding "C-like" speed are maybe overstated; my sense is that the slowest toy benchmarks are accurate reflections of the bottlenecks that will slow down a large program. For small programs, they are C-like; as program/library length increases, you start to approach Python or R speeds, which makes me wonder if it's better to just use C to begin with.
Relying on wrapped C, like in Numpy, is also misleading because eventually you end up bumping up against part of the program that can't be pushed down into lower-level code.
I often wish that x-lisp-stat had taken off rather than R. I love both in their syntax, but in retrospect, it seemed like there was a fork in the path of numerical computing, either toward a more "native" approach, the other toward a model where a high-level language is used to interface with a low-level language, to abstract away some of the complexity. I understand the rationale for both, but kinda feel like the latter approach, which has become more dominant, isn't really sustainable. Moreover, this is all occurring while I've watched C++ become impressively abstracted--if things like Eigen had been around earlier, I'm not sure Python or R would have ascended as much as they have.
The "new" issue that seems to be arising all the time in these discussions is parallel GC models and implementations. Not sure where this will all lead. If it's lisp I'm going to spit out my drink.
I don't know what implementation R uses for its data frames, but building efficient collections with functional semantics is now a solved problem, as demonstrated by the functional collections libraries supplied by or available for many languages -- Clojure, Scala, ML, Haskell, and no doubt many others I'm unaware of (along with my own for Common Lisp [0] and Java [1]). It is no longer necessary for the important benefits cited by the author to be considered at odds with an efficient implementation.
I think it's unfortunate that the Julia creators didn't realize this, and have gone back to imperative semantics for their collections. Imperative collections are harder to reason about, especially for inexperienced programmers, and are more bug-prone.
[0] https://github.com/slburson/fset
[1] https://github.com/slburson/fset-java