I recently ported a reinforcement learning algorithm from PyTorch to Julia. I did my best to keep the implementations the same, with the same hyperparameters, network sizes, etc. I think I did a pretty good job because the performance was similar, solving the CartPole environment in the a similar number of steps, etc.
The Julia implementation ended up being about 2 to 3 times faster. I timed the core learning loops, the network evaluations and gradient calculations and applications, and PyTorch and Julia performed similar here. So it wasn't that Julia was faster at learning. Instead it was all the in-between, all the "book keeping" in Python ended up being much faster in Julia, enough so that overall it was 2 to 3 times faster.
(I was training on a CPU though. Things may be different if you're using a GPU, I don't know.)
Similar experience over here. (G)ARCH models are severely underserved in Python, and I could not be bothered to learn a Probabilistic programming abstraction like Pyro or Stan just to build a quick prototype myself.
Chose Julia instead. Took 4 hours to get everything sorted out (including getting IT to allow Julias package manager to actually download stuff) and have the first model running just putting a paper into code. Since code is just writing the math, this is a vast communication improvement.
After fiddling around withit at home for a week, this was the first professional experience and I'm blown away.
Julia is such a wonderful language. There are many design decisions that I like, but most importantly to me, its ingenious idea of combining multiple dispatch with JIT compilation still leaves me in awe. It is such an elegant solution to achieving efficient multiple dispatch.
Thanks to everyone who is working on this language!
Julia is the first language to really show that multiple dispatch can be efficient in performance-critical code, but I'm not really sure why: JIT concepts were certainly familiar to implementors of Common Lisp and Dylan.
The combination. E.g multiple dispatch without JIT would be really slow as you are picking a method to run at runtime based on the type of all the function arguments.
That requires a linear search through a list of all possible combinations of input arguments.
In a single dispatch language like most object oriented languages, you can do a simple dictionary/hash table lookup. Much faster.
With the JIT Julia is able the optimize away most of these super slow lookups at runtime. Hence you get multiple dispatch for all functions but with fantastic performance. Nobody had done that before.
FWIW, Julia does segment its method tables into multiple layers depending upon size and type. Multiple dispatch is a strict superset of single-dispatch, and indeed the first layer is just a dictionary/hash table lookup on the first argument. If there's only one result there, you're done (and have the same ~cost for the same ~complexity).
Thanks, I didn't know that! Has it always been like that? I wondering where I got the idea that it was always a linear search from? Maybe that is just the conceptual way of explaining it.
It's like C++ template specialisation, but it happens when the compiler realises you need a particular version. Which may be at runtime, if you changed something.
Except the language can choose from suitable templates (eg instead of a generic matrix multiply template for floats, it can use a library like LAPACK) and does so in a systematic way.
It also has a feature (I can’t recall the name) which is a bit like fexprs (let’s say macros who’s inputs are the types of the arguments of a function) that can generate customised code (eg an FFT depending on the input size) on the fly.
(but I don't find it helpful to compare to fexprs, which I think of as more about deferring evaluation, whereas generated functions are about "staged programming".)
What the OP is talking about is julia's method-based JIT strategy coupling very well to multiple dispatch.
JIT is not new, multiple dispatch is not new, and multiple dispatch + JIT also isn't new, but nmo existing langauges combined them in a way that allows for the fantastic, efficient devirtualization of generic methods that julia is so good at.
This is why things like addition and multiplication are not generic functions in Common Lisp, it's too slow in CL because the CLOS is not able to efficiently devirtualize the dispatch. In julia, everything is a generic function, and we use this fact to great effect.
CLOS and Dylan laid a ton of important groundwork for these developments, but they're also not the same.
It’s not true that CLOS generic dispatch is slow: Robert Strandh and other have done a bunch of work showing that it’s possible to implement it efficiently without giving up the dynamic redefinition features that make CLOS such a nice system. There’s at least one video game project (Kandria) that’s been funding an implementation of these ideas so that generic functions can be used in a soft real-time system like a video game.
The really nice thing about CLOS, though, is that the meta-object protocol lets you choose an implementation of OOP that makes sense for your use-case.
Yes, I don't doubt at all that it's possible to make CLOS dispatch fast. What I'm saying is that because historically people using CLOS had to pay a (often negligible) runtime cost for dispatch, it limited the number of places developers were willing to allow generic dispatch.
Julia makes the runtime cost of (type stable) dispatch zero, and hence does not even give julia programmers an *option* to write non-generic functions (though it can be hacked in like with FunctionWrappers.jl). I'm not familiar with Strandh's work, but has it made the overhead of generic functions low, or has it completely eliminated it?
Another thing I'll mention is that Julia's type system is parametric, and we allow values (not just types) in our type parameters which is immensely useful for writing generic high performance code. You can specialize methods on matrices of complex integers separately from rank 5 arrays of rational Int8s for instance. This is not a capability that CLOS or Dylan has as far as I'm aware, though the common refrain is that you can do it with macros, but that neglects that it's rather hard to get right, and will have limited use because such macro implementations of type parameters won't be ubiquitious.
________________________________
To be clear though, I'm not hating on Common Lisp or CLOS. The Common Lisp ecosystem is awesome and can do all sorts of really cool things that I wish we had in julia. I'm mostly just pushing back on the notion that Julia doesn't do anything new or interesting.
In this context, I think there might be an argument to be made that Julia is to multiple dispatch (or multiple dispatch + JAOT) as the iPhone is to “touchscreen computers that can make phone calls”.
It’s not that it’s the first, but it seems to be the first where the use of multiple dispatch throughout the community was sufficiently pervasive to kick-start the emergence of the strong network effects we’re now seeing w/r/t composability.
I would not be surprised to see more languages working to emulate this kind of combination of multiple dispatch and JAOT compilation in the future.
Except everyone is forgetting that I also mentioned Dylan, from Apple, and whose goal was to be a system programming language for the Newton OS, with the Dylan team winning over the C++ one, but internal politics made the decision to go with the outcome of the C++ team alongside NewtonScript.
I directed most of my comments towards CL because I know more about it than Dylan. My understanding is that Dylan lacks parametric types, so that comment can be straightforwardly applied to Dylan. IMO Parametric types are a really important part of this.
Regarding performance, I don't know much about this in Dylan. Was Dylan able to completely remove the runtime overhead of multiple dispatch for type stable code?
I think Dylan would be (perhaps ironically) the Newton in this analogy. (or maybe General Magic?)
Pioneering and ahead of its time in many ways, but for whatever reason it seems that the use of multiple dispatch in Dylan seems to have not (yet?) led to the same level of ecosystem-wide composability.
Languages with multiple dispatch aren't rare, but a language having it as the core language paradigm, combined with a compiler capable of completely resolving the method calls during compile time, and therefore able to remove all runtime costs of the dispatch, and a community that fully embraced the idea of creating composable ecosystems is something unique to Julia. I don't think anyone has scaled multiple dispatch to the level of Julia's ecosystem before.
Common lisp’s typesystem is just not really as useful for this sort of thing. In particular it doesn’t have parameter used types so you can’t make eg a matrix of complex numbers. This breaks (1) a lot of the opportunity for optimisation by inlining (because you can’t assume that all the multiplications in your matrix{float} multiplication are regular float multiplications) or generic code (because you can’t have a generic matrix type and need a special float-matrix); and (2) opportunities for saving memory with generic data structures because the types must be associated to the smallest units and not the container (eg every object in a float matrix must be tagged with the fact that it is a float because in theory you could put a complex number in there and then you’d need to know to do a different multiplication operation).
I guess you could try to hack together some kind of templating feature to make new type-specific classes on the fly, but this won’t work well with subtyping. Your template goes system could probably have (matrix float) as a subclass of matrix, but not of (matrix real) or (matrix number). I think you’d lose too much in Common Lisp’s hodge-podge type system.
A big innovation of Julia was figuring out how to make generic functions and multiple dispatch work in a good way with the kind of generic data structures you need for good performance. And this was not a trivial problem at all. Julia’s system let’s you write generic numeric matrix code while still having float matrix multiplication done by LAPACK, which seems desirable.
The other thing is that Julia is a language where generic functions are a low-level thing all over the standard library whereas Common Lisp has a mix of a few generic functions (er, documentation is one; there are more in cltl2), a few “pre-clos” generic functions like mathematical functions, sequence functions and to some extent some array functions, and a whole lot of non-generic functions.
Wikipedia has a nice table [1] on the Multiple Dispatch page, that describes one studies' findings about the use of multiple dispatch in languages supporting it, in practice.
Although CLOS and others do support it, Julia seems to take the cake by most metrics, highlighting that it is a core paradigm of the language, more so than in the others.
Even better check Stanza language for modern version and interpretation of Lisp, Scheme and Dylan. It supports multi-method/multiple dispatches, hybrid dynamic and static typing, high and low level programming to name a few productive features.
I've been running the 1.6 release candidates, and the compilation speed improvements have been massive. There have been plenty of instances in the past where I've tried to 'quickly' show off some Julia code, and I end up waiting ~45 seconds for a plot to show or a minute for a Pluto notebook to run, and that's not to mention waiting for my imports to finish. It's still slower than Matlab for the first run, but it's at least in the same ballpark now.
I agree, this is a game changer. Previously time to first plot (TTFP) was >1 minute for me, which made julia completely unusable for my day-to-day exploratory data analysis, visualisation, quick random number experiments etc. Now TTFP is less than 10 seconds. I'm now ready (and excited) to jump ship from R and python!
I wonder how much Julia could be helped with some uneval/image-saving magic. So when you run the repl you instead get a pre-built binary with plot already loaded and several common specialisations already compiled.
We call these "system images" and you can generate them with PackageCompiler [0]. Unfortunately, it's still a little cumbersome to create them, but this is something that we're improving from release to release. One possible future is where an environment can be "baked", such that when you start Julia pointing to that environment (via `--project`) it loads all the packages more or less instantaneously.
The downside is that generating system images can be quite slow, so we're still working on ways to generate them incrementally. In any case, if you're inspired to work on this kind of stuff, it's definitely something the entire community is interested in!
I’ve also been running the release candidates, and I get something like 6 seconds to first plot on my 2013 laptop, including the time for `using Plots` and the time to actually draw the first plot. A huge improvement; kudos to the developers.
On the package ecosystem side, 1.6 is required for JET.jl [0]. Despite being a dynamic language, the Julia compiler does a lot of static analysis (or "abstract interpretation" in Julia lingo). JET.jl exposes some of this to the user, opening a path for additional static analysis tools (or maybe even custom compilers).
Whatever improves loading times is more than welcome. It's not really acceptable to wait because you import some libraries. In understand Julia makes lots of things under the hood and that there's a price to pay for that but being a python user, it's a bit inconvenient.
But I'll sure give it a try because Julia hits a sweet spot between expressiveness and speed (at least for the kind of stuff I do : matrix, algorithms, graphs computations).
I like Julia (mostly because of multiple dispatch). The only thing that's lacking is an industry strength Garbage Collector, something that can be found in the JVM.
I know that you shouldn't produce garbage, but I happen to like immutable data structures and those work better with optimised GCs.
> I know that you shouldn't produce garbage, but I happen to like immutable data structures and those work better with optimised GCs.
If you use immutable data-structures in julia, you're rather unlikely to end up with any heap allocations at all. Unlike Java, Julia is very capable of stack allocating user defined types.
Not just floats, and I'm not sure they have to be that small. All sorts of structs containing bitstypes/value types can be stack allocated. In fact, even some structs with pointers to heap-allocated memory can be stack-allocated (such as array views.)
It doesn’t, it just doesn’t have a $100B GC like Java does. Rather than spending that kind of money trying to compensate for a language design that generates massive amounts of garbage (ie Java), Julia takes the approach of making it easier to avoid generating garbage in the first place, eg by using immutable structures that can be stack allocated and having nice APIs for modifying pre-allocated data structures in place.
Obviously a 10-million-element array doesn't get stack allocated. But if individual objects of some type are immutable, then they can be stack allocated, or maybe not allocated at all (kept in registers).
Edit: reading your other post, it seems like you may mean persistent data structures, a la Clojure, rather than immutable structures, which are quite different. The former would indeed always be heap-allocated (it's necessary since they are quite pointer-heavy). Immutable structures, on the other hand are detached from any particular location in memory.
Moreover, if the elements in an array are mutable, eg Java objects, then each one needs to be individually heap allocated with a vtable pointer and the array has to be an array of pointers to those individually allocated objects. For pointer-sized objects (say an object that has a single pointer-sized field), that takes 3x memory to store x objects, so that's already brutal, but worse is that since the objects are all individually allocated, the GC needs to look at every single one, and freeing the space is a fragmentation nightmare. If the objects are immutable (and the type is final; btw all concrete types are final in Julia), then you can store them inline with no overhead and GC can deal with them a single big block.
Btw, I had to vouch for you to undead your posts in order to reply. Looks like you got downvoted a bunch.
I mean I’m not trying to hate on Java — pointer-heavy programming was all the rage when it was designed, and GC was a hot research topic, so there was good reason to be optimistic about that approach. But it turns out that it’s very hard to make up for generating tons of garbage and pointer-heavy programming hasn’t aged well given the way hardware has evolved (pointers are large and indirection is expensive).
You're mixing two things here: memory management and memory layout. This "pointer-heavy programming" is, indeed, a bad fit for modern hardware in terms of processing speed due to cache misses, which is why even Java is now getting user-defined primitive types (aka inline types, aka value types), but in terms of memory management, in recent versions OpenJDK is pretty spectacular, not only in throughput but also latency (ZGC in JDK 16 has sub millisecond maximum pause time for any size of heap and up to a very respectable allocation rate: https://malloc.se/blog/zgc-jdk16 and both throughput and max allocation rate are expected to grow drastically in the coming year with ZGC becoming generational). As far as performance is concerned, GC can now be considered a solved problem (albeit one that requires a complex implementation); the only real price you pay is in footprint overhead.
I'm not — memory layout and memory management are (fairly obviously, I would think) intimately related. In particular, pointer-heavy memory layouts put way more stress on the garbage collector. Java's choice of making objects mutable, subtypeable and have reference semantics, basically forces them to be individually heap-allocated and accessed via pointers. On the other hand, if you design your language so that you can avoid heap allocating lots of individual objects, then you can get away with a much simpler garbage collector. Java only needs spectacular GC technology because the language is designed in such a way that it generates a spectacular amount of garbage.
I would say no. To have stellar performance, you'll need compaction, you'll need parallelism (of GC threads), and you'll need concurrency between the GC threads and mutator threads; and for good throughput/footprint tradeoff you'll need generational collection. True, you might not need to contend with allocation rates that are that high, but getting, say, concurrent compaction (as in ZGC) and/or partial collections (as in G1), requires a sophisticated GC. E.g. Go isn't as pointer-heavy as (pre-Valhalla) Java, and its GC is simple and offers very good latency, but it doesn't compact and it throttles, leading to lower throughput (I mean total program sluggishness) than you'd see in Java, even with a much higher allocation rate. The thing is that even with a low allocation rate, you'd get some challenging heaps, only later, say, every 10 seconds instead of every 5.
It's true that a simpler GC might get you acceptable performance for your requirements if your allocation rate is relatively low, but you still won't get OpenJDK performance. So I'd say that if you design your language to require fewer objects, then you can get by with a simple GC if your performance requirements aren't too demanding.
All that dereferencing puts a higher load on data structure traversal (which is why Java is getting "flattenable" types) than on the GC. The main reason for Java's particular GC challenges isn't its pointer-heavy (pre-Valhalla) design but the mere fact that it is the GCed platform that sees the heaviest workloads and most challenging requirements by far. Java's GC needs to work hard mostly for the simple reason that Java is asked to do a lot (and the better some automated mechanism works, the more people push it).
Go has only concurrency of the features you talk about but is often competitive with Java in benchmarks. In my experience I’ll take administration of go processes any day of the week over Java — I’ve lost count of the number of hours lost to debugging gc stalls, runaway heaps and other garbage collector related issues never once had those problems in go.
Go even reverted a generational collector because it had no performance benefits since most generational objects would be stack allocated anyway — Julia’s JIT and way more advanced llvm backend should do even better than go in keeping objects stack local and inline.
It's competitive in pretty forgiving benchmarks. And LLVM is way more advanced than Go's compiler, but not OpenJDK's. I'm not saying you have to prefer Java to Go, but its throughput is better. As to the stack-allocation claim, young generations might be hundreds of MBs; that might correspond to the stacks of 100K goroutines on some server workloads, but not of a few threads.
So I'm not saying you must prefer Java to Go (even though GC tuning is a thing of the past as of JDK 15 or 16), or that Go's performance isn't adequate for many reasonable workloads, only that 1. a flatter object landscape might still not match Java's memory management performance without sophisticated GCs, and 2. I wouldn't extrapolate from Go to Julia, as they are languages targeting very different workloads. E.g. Julia might well prefer higher throughput over lower latency, and Go's GC's throughput is not great.
Having a Lamborghini racing a Toyota Corolla is of course going to show the Lambo winning. But if I need to maintain a fleet of them to move 1000 passengers around a city with certain availability guarantees, I'm going with the Toyotas every time.
In other posts you actually argue that GCs help you reduce complexity because manual memory management is too much of a hassle.
May be immutable is not the correct term - persistent data structures is what I like support for: that is my use-case.
I think you can have efficient persistent data structures without a GC, but that requires fast reference counting and in turn, that requires a lot of work to be competitive with the JVM.
I also understand that my use-case is not Julia's focus. That's perfectly fine.
That's a major oversimplification. GC is good for ease of use and safety of a high level language. GC is never as performant as not requiring heap allocations at all. Julia has a GC, but also provides a lot of tools to avoid needing the GC in high performance computations. This combination gives ease of use and performance.
Java sacrifices some performance for having this "one paradigm" of all objects, and then heavily invested in the GC, but in many cases like writing a BLAS it still just will not give performance exactly matching a highly tuned code, where as in Julia for example you can write really fast BLAS codes like Octavian.jl.
Julia is multi-paradigm in a way that is purposely designed for how these features compose. I think it's important to appreciate that design choice, in both its pros and cons.
To tie Octavian.jl into this memory allocation discussion:
Octavian uses stack-allocated temporaries when "packing" left matrix ("A" in "A*B").
These temporaries can have tens of thousands of elements, so that's a non-trivial stack allocation (the memory is mutable to boot). No heap allocations or GC activity needed (just a GC.@preserve to mark its lifetime).
If I understand correctly, this isn't something that'd be possible in Java?
To be fair, you can also just use preallocated global memory for your temporaries, since the maximum amount of memory needed is known ahead of time.
I don't know that the object model is why writing a BLAS in Java doesn't make sense. After all they special case `float` and `double` as primitives, which bifurcates the whole type system and is its own whole issue, but means that you can store them efficiently inline. I'm actually not sure what stops someone from writing a BLAS in Java except that it would be hard and there's no point.
I like your response, and yes, it was a major oversimplification and I'm sorry for that.
Indeed, it is always about design choices and trade-offs. I can see why BLAS code is important and why Julia is an optimal choice for computation heavy problems.
I love GC — it solves a ton of nasty problems in a programming language design with a single feature that users mostly don't have to worry about. Just because you have a GC, however, doesn't mean that it's a good idea to generate as much garbage as you can — garbage collection isn't free. That's where Java IMO went wrong. Java's design — objects are and subtypeable (by default) and mutable with reference semantics — generates an absolute epic amount of garbage. It seems like the hope was that improvements in GC technology would make this a non-issue in the future, but we're in the future and it hasn't turned out that way: even with vast amounts of money that have been spent on JVM GCs, garbage is still often an issue in Java. And this has given GC in general a bad name IMO quite unfairly. It just happens that Java simultaneously popularized GC and gave it a bad name by having a design that made it virtually impossible for the GC to keep up with the amount of garbage that was generated.
It is entirely possible to design a garbage collected language that doesn't generate so goddamned much garbage — and this works much, much better because a relatively simple GC can easily keep up. Julia and Go are good examples of this. Julia uses immutable types extensively and by default, while Go uses value semantics, which has a similar effect on garbage (but has other issues). With a language design that doesn't spew so much garbage, if you only care about throughput, a relatively simple generational mark-and-sweep collector is totally fine. This is what Julia has. If you also want to minimize GC pause latency, then you need to get fancier like Go (I think they have a concurrent collector that can be paused when it's time slice is up and resumed later).
Persistent data structures are a whole different question that I haven't really spent much time thinking about. Clojure seems to be the state of the art there but I have no idea if that's because of the JVM or despite it.
> If you also want to minimize GC pause latency, then you need to get fancier like Go (I think they have a concurrent collector that can be paused when it's time slice is up and resumed later).
How possible would it be for Julia to add this? I keep thinking Julia would be great for graphical environments and gaming, but high GC latency won't work there.
Very doable, “just” a bunch of moderately tricky compiler work. Will happen at some point. Things that would make it happen sooner: someone interested in compiler work decides to do it; some company decides to fund it.
About the refcounting approach, you may want to look at the Perceus paper. It's refcounting with dynamic reuse of memory that isn't shared (like a sort of runtime linear typing), and it's used in Koka for functional programming.
No, but there's finally a JEP, which normally means that release is imminent (I'm not involved with that project, so I have no inside information): https://openjdk.java.net/jeps/401
As part of this change, existing built-in primitive types (like int or double) will retroactively become instances of these more general objects.
There's all sorts of limitations right now. E.g. you can't allocate an array right now, dynamic dispatch is prohibited, there are some bugs too.
Most of this is just a relic from StaticCompiler.jl being a very straightforward slight repurposing of GPUCompiler.jl. It will take some work to make it robust on CPU code, but the path to doing so it pretty strightforward. It just requires dedicated work, but it's not a top priority for anyone who has the know-how currently.
I think this isn't really a great place for beginners though unfortunately. This project is tightly coupled to undocumented internals of not only julia's compiler, but also LLVM.jl and GPUCompiler.jl. It'll require a lot of learning to be able to meaningfully contribute at this stage.
Currently you can make a relocatable “bundle” / “app” with PackageCompiler.jl, but the bundle itself includes a Julia runtime.
Making a nice small static binary is technically possible using an approach similar to what GPUCompiler.jl does, but the CPU equivalent of that isn’t quite ready for primetime.
I think something to that effect was implicit in "the bundle itself includes a Julia runtime," but I vouched for this comment anyway since it's an important limitation and the parent comment evidently wasn't explicit enough to prevent confusion.
They are talking about two different systems. Static compilation is a separate project which is trying to include only those compiled code that is required. That isn't ready yet for normal people like me, but if you have the knowhow and your program meets certain requirements you can get a tiny binary.
PackageCompiler.jl just compiles everything and packages it up. It generates huge files, because it doesn't discriminate on which compiled stuff to include.
I see, thanks. Looks like static compilation will only work if the entire program is “type stable”, which AFAICT means that the type of every variable can be deduced statically.
> Its a suggestion to fix the awkwardness, one that will never get approved
You were courageous to even try :-)
From their refusal to see any use in explicit variables declarations, their (somewhat related) huge scope debacle, to its strange and irregular 'resolution', not to mention the original absurdly weird propositions they had made to resolve it: the scope and variable declaration subject is pretty hopeless in Julia land. I quickly gave up on it years ago (long before the scope debacle), as I had no intention of losing my time, when I saw the arguments and the logic they used.
This is just a disagreement over basic design: should variable declarations be explicit or not. It is a choice, and something that reasonable people can disagree on.
Framing this as a case of irrational and illogical behaviour is unnecessary and unreasonable in my opinion. A lot of serious thought and debate went into the resolution. There is no need to disrespect and badmouth people because they have different priorities than you.
I don't have a strong opinion on this, one way or the other. It's less verbose (which I like), and more familiar to those used to dynamic languages like python, matlab, etc. But this isn't my decision, I'm ok with either.
The feature I'm most excited about is the parallel — and automatic — precompilation. Combined with the iterative latency improvements, Julia 1.6 has far fewer coffee breaks.
Ohh, is that what the programmers were doing all through Halt and Catch Fire? Waiting for compilation? I couldn't understand how they got away with acting like naughty 5 year olds, throwing things at each other constantly.
I think so - Julia master branch (1.7 precursor) works on M1, but not all the dependencies that some packages require have been built for M1. Though, I understand that the wonderful packaging system and the folks who work on it are working on it.
Yeah, we've managed to get Julia itself running pretty well on the M1, there are still a few outstanding issues such as backtraces not being as high-quality as on other platforms. You can see the overall tracking issue [0] for a more granular status on the platform support.
For the package ecosystem as a whole, we will be slowly increasing the number of third-party packages that are built for aarch64-darwin, but this is a major undertaking, so I don't expect it to be truly "finished" for 3-6 months. This is due to both technical issues (packages may not build cleanly on aarch64-darwin and may need some patching/updating especially since some of our compilers like gfortran are prerelease testing builds, building for aarch64-darwin means that the packages must be marked as compatible with Julia 1.6+ only--due to a limitation in Julia 1.5-, etc...) as well as practical (Our packaging team is primarily volunteers and they only have so much bandwidth to help fix compilation issues).
I think it's more interesting to see what people do with the language instead of focusing on microbenchmarks. There's for instance this great package https://github.com/JuliaSIMD/LoopVectorization.jl which exports a simple macro `@avx` which you can stick to loops to vectorize them in ways better than the compiler (=LLVM). It's quite remarkable you can implement this in the language as a package as opposed to having LLVM improve or the julia compiler team figure this out.
And then replacing the matmul.jl with the following:
@avx for i = 1:m, j = 1:p
z = 0.0
for k = 1:n
z += a[i, k] * b[k, j]
end
out[i, j] = z
end
I get a 4x speedup from 2.72s to 0.63s. And with @avxt (threaded) using 8 threads it goes town to 0.082s on my amd ryzen cpu. (So this is not dispatching to MKL/OpenBLAS/etc). Doing the same in native Python takes 403.781s on this system -- haven't tried the others.
I've rewritten two major pipelines from numpy-heavy, fairly optimized Python to Julia and gotten a 30x performance improvement in one, and 10x in the other. It's pretty fast!
looks like they're just multiplying two 100x100 matrices, once? (maybe I'm reading it wrong?) in Julia, runtime would be dominated by compilation + startup time.
A fair comparison with C++ would be to at least include the compilation/linking time into the time reported.
Ditto for Java or any JVM language (you'd have JVM startup cost but that doesn't count the compilation time for bytecode).
Generally, for stuff (scientific computing benchmarks) like this you want to run a lot of computation precisely to avoid stuff like this (i.e you want to fairly allow the cost of compilation & startup amortize)
This appears to be a set of benchmarks of how fast a brainfuck interpreter implemented in different programming languages is on a small set of brainfuck programs? What a bizarre thing to care about benchmarks for. Are you planning on using Julia by writing brainfuck code and then running it through an interpreter written in Julia?
Seems like you're the founder of Julia. Why such a knee jerk reaction? Did you read the benchmark page? The table of content is right at the top.
Optics of this type of reaction is seen everywhere in the Julia community. My advice is to embrace negativity around the language, try to understand if it is fabrication or legitimate, and address the shortcomings.
Julia is a beautiful language and hope some of the warts of the language gets fixed.
When I wrote that I was under the impression that the brainfuck interpreter implementations were the only benchmarks in the repo. There are, however (I now realize), also benchmarks for base64 decoding, JSON parsing, and writing your own matmul (rather than calling a BLAS matmul, which is not generally recommended), so this is more reasonable than I thought but still a somewhat odd collection of tasks to benchmark. Of course, microbenchmarks are hard — they are all fairly arbitrary and random.
In a delightful twist, it seems that there is a Julia implementation of a Brainfuck JIT that is much faster than the fastest interpreter that is benchmarked here, so even by this somewhat esoteric benchmark, Julia ends up being absurdly fast.
I'm a daily Julia user but tbh I've gotta agree with parent commenter. I think Jeff's attitude in the "What's bad about Julia" talk is the right way to handle criticism: listen to the person, ask about their use cases, understand how Julia could be improved for that user. Accepting criticism makes a good product, and seeing project leaders do it makes a good impression.
It's not that we're delicate, it's that poor communication between users and maintainers leads causes problems. As for "one comment", OP already mentioned that defensiveness is becoming an issue in the community.
Idk, but just a few weeks ago I started looking at Julia, partly because of the performance claims. I wanted to write a program a bit heavier than your average starter program, so I wrote a back-tracker (automatic layout for stripboards, to be precise). It was
* interesting (not fun) to find out how Julia works
* annoying AF to discover that much of the teaching material was hidden behind some 3rd party website, presumably in videos (I didn't bother to register, but started browsing the manual instead). What's wrong with text?
* unnecessarily complex because the documentation for the basic functions is nearly inaccessible to beginners.
But, I managed to get a simple layout system up and running, and it wasn't fast. I rewrote it in Go (the language in which I'm currently working most), and it was literally >100x faster. And that should not be due to the startup costs, because a backtracker shouldn't have that much overhead JIT-ing.
I think I can now say that I can't see the use case for Julia. "Faster than Python" is simply not good enough, and for the rest there are no redeeming features. Perhaps the fabled partial differential equation module is worth it, but that can get ported to other languages, I guess.
Your relative skill and time invested in Julia vs Go makes that a not very fair comparison, I think. A 100x difference in performance is probably a sign of something that could be fixed in your code (common one: type instability). In general, Julia is being used to implement things like competitive versions of BLAS. Your Julia code can almost certainly be made much faster.
Coming from a Python and C++ background, I found it sufficient to just read the docs and do some Advent of Code problems to get productive in Julia. What videos are you talking about? https://docs.julialang.org/en/v1/manual/performance-tips/ I found to be a pretty good document on why and when Julia can be slow.
I simply do not understand how some people are able to form so strong opinions in such a short time, and spew out disdain and negativity on the most flimsy basis. It's a matter of temperament, I guess.
Julia performance should be on par with Go, if it's slower, read the performance tips in the manual. As for teaching material on 3rd party websites, I don't know what you mean. The Julia manual is available from the julialang.org website.
As for re-writing DifferentialEquations, that is extremely strongly tied to the multiple dispatch paradigm, re-writing it would be hard. What you can get is wrappers like diffeqpy and diffeqr, which call out to Julia.
You can verify that the teaching materials are not really up to scratch. Even nim and zig, which have less resources behind them, I think, do a better job there. The manual is a reference manual, and it was difficult to find all the operations on arrays. E.g., the difference between Array{Int} and Array{Int,1} is not clarified from the start.
And as I said: I wrote a straight-forward backtracker. It just recursive function calls: check a possible state for the current item, and when successful, update the overall state and move on to the next item; on return, try another state for the current item, until the search space is exhausted. There's not a lot to optimize, nor is there a lot of work for a JIT compiler.
> on the most flimsy basis
I've got more gripes. Forward type declaration to name one. But I'm not spewing disdain: I just don't see Julia take a larger role in general software development.
I have no particular opinion on the teaching materials, I just use the manual and the discussion fora, so I don't know. But if a third party offers teaching materials, it's not so strange if it resides on their third party website.
As for performance, I'm not really talking about 'optimization'. Your implementation may simply have used some pattern that should be avoided, such as global variables, type instabilities, abstract types in structs, or some inappropriate data structures. If it's a microbencmhark, then there are some things to keep in mind.
These are not really optimizations, but basic performance principles. I cannot know that you are unaware of them, but your statement that 'there's not a lot to optimize' make me suspect that this could be the case. The unusual thing about Julia is that it's both dynamic and compiled, so that code that would simply not compile in static languages instead ends up slow.
If I had to guess, your problem is type stability. Are you using NamedTuples to store your state and items you’re iterating over? If the keys and are not all the same and value types don’t stay the same (e.g. something initialized as zero(Int) and then accumulated into with Float64s) then performance will suffer. Another possibility is that you have a data type not is not concrete in an inner loop. For example, Array{Real} will be slower than Array{Float64} because an array of Reals has to support arrays mixing Float32 and Float64. If you had this in a function definition the likely correct thing to do is Array{<:Real}, which means the element type of the array must be a subtype of Real. Maybe even better, just drop the type annotations. They very, very rarely improve performance because Julia relies on compile time type inference.
Failed or bad type inference is almost always the cause of performance issues in Julia. Getting a feel for when the compiler can infer things or not takes practice, but it’s a lot easier than the semantics of generic programming systems IMO.
The REPL is really great for learning. If you type “Array{Int} == Array{Int, 1}” the result is false. If you type “?Array” it prints the docstring which gives some guidance on how to use one versus the other.
I think this particular Julia code is pretty misleading, and I'm (probably) one of the most qualified people in this particular neck of the woods. I wrote a transpiler for Julia that converts a Brainfuck program to a native Julia function at parse time, which you can then call like you would any other julia function.
Here's code I ran, with results:
julia> using GalaxyBrain, BenchmarkTools
julia> bench = bf"""
>++[<+++++++++++++>-]<[[>+>+<<-]>[<+>-]++++++++
[>++++++++<-]>.[-]<<>++++++++++[>++++++++++[>++
++++++++[>++++++++++[>++++++++++[>++++++++++[>+
+++++++++[-]<-]<-]<-]<-]<-]<-]<-]++++++++++."""
julia> @benchmark $(bench)(; output=devnull, memory_size=100)
BenchmarkTools.Trial:
memory estimate: 352 bytes
allocs estimate: 3
--------------
minimum time: 96.706 ms (0.00% GC)
median time: 97.633 ms (0.00% GC)
mean time: 98.347 ms (0.00% GC)
maximum time: 102.814 ms (0.00% GC)
--------------
samples: 51
evals/sample: 1
julia> mandel = bf"(not printing for brevity's sake)"
julia> @benchmark $(mandel)(; output=devnull, memory_size=500)
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 3
--------------
minimum time: 1.006 s (0.00% GC)
median time: 1.009 s (0.00% GC)
mean time: 1.011 s (0.00% GC)
maximum time: 1.022 s (0.00% GC)
--------------
samples: 5 evals/sample: 1
Note that, conservatively, GalaxyBrain is about 8 times faster than C++ on "bench.b" and 13 times faster than C on "mandel.b," with each being the fastest language for the respective benchmarks. In addition, it allocates almost no memory relative to the other programs, which measure memory usage in MiB.
You could argue that I might see similar speedup for other languages on my machine, assuming I have a spectacularly fast setup, but this person ran their benchmarks on a tenth generation Intel CPU, whereas mine's an eighth generation Intel CPU:
julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info: OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
But note that OP uses larger cells (`int` = 32 bit in the C version, `Int` = 64 bit in the Julia version) while GalaxyBrain seems to use 8 bit cells. Not that I expect this to make a major difference (but perhaps a minor one?)
The real issue is that the original brainfuck spec (as given by the Wikipedia entry) explicitly sets the sizes of each cell to a single byte —- which means many of the interpreters used for this benchmark are using incorrect cell sizes!
Is that truly accurate though ? I could see them comparing say load time of data files plus execution time but combining compile times in there doesn't make much sense. You always have to pay for it in julia but not with a statically compiled file.
I'm not a huge Julia user, but typically if they don't specifically mention they're segmenting runtime from compilation time with Julia, that's a bit of a red flag, because unlike Rust, Go, or C++ the compilation step isn't separate in Julia. To the user it just looks like it's running, when in reality it's compiling, then running, without really letting you know in between.
In the matrix multiplication example, the measurement is done via a simple
t = time()
results = calc(n)
elapsed = time() - t
So startup time at least isn't included.
One might argue that this is still biased against Julia due to its compilation strategy, but fixing that would mean you'd have to figure out what the appropriate way to get 'equivalent' timings for any of the other languages would be as well - something far more involved than just slapping a timer around a block of code in all cases...
edit: As pointed out below, the Julia code should indeed already have been 'warmed up' due to a preceding sanity check. My apologies for 'lying'...
Ah, I have to take that back, since benchmarks run in the order of seconds and they use sockets to start and stop the timer, which likely means compilation time is not included.
I think i can answer that, first of all Julia isnt as fast as C/C++/Nim etc. in most cases Julia is just fast in scientific computing that's all. (there is only one "scientific" benchmark on kostya benchmarks)
Second to write very fast julia u need to knew a lot of "tricks" and in most cases u won't be doing it as easy as writing normal code.
And all people writing this benchmark is measuring compilation time (XD?) or not including jitting time they could just look at code/readme for 5s before commenting.
Julia is fast and can be as fast as C but not in all cases and not as easy at it seems.
> Second to write very fast julia u need to knew a lot of "tricks" and in most cases u won't be doing it as easy as writing normal code.
That's true in literally any language. Some languages require inlined assembly. Others require preprocessor directives. In almost all languages, you need to understand the difference between stack and heap, know how to minimize allocations, know how to minimize dynamic dispatch, know how to efficiently structure cache-friendly memory layouts. And of course, data structures & algorithms 101.
In terms of performance, Julia provides the following:
1. Zero-cost abstractions. And since it has homoiconic macros, users can create their own zero-cost abstractions, e.g. AoS to SoA conversions, auto-vectorization. Managing the complexity-performance trade-off is critical. But you don't see that in micro-benchmarks.
2. Fast iteration speed. Julia is optimized for interactive computing. I can compile any function into its SSA form, LLVM bytecode, or native assembler. And I can inspect this in a Pluto notebook. Optimizing Julia is fun, which is less true in other languages.
> That's true in literally any language. Some languages require inlined assembly. Others require preprocessor directives. In almost all languages, you need to understand the difference between stack and heap, know how to minimize allocations, know how to minimize dynamic dispatch, know how to efficiently structure cache-friendly memory layouts. And of course, data structures & algorithms 101.
I think what s/he meant to say is that Julia is not "magically" faster than other languages. The real questions are:
1. Can unoptimised Julia code run as fast as unoptimised c/c++ code? I think the linked benchmark suggests this is not really the case.
2. Can optimised Julia code run faster than comparably (i.e. requiring similar amount of effort and expertise) optimised c/c++ code? If not, then why use Julia?
> Julia is not "magically" faster than other languages
That's somewhat true, and is at the end-point of some mismatched expectations when folks come to Julia. Julia is a high-level dynamic language whose semantics are conducive to creating the ~same performance as static languages.
So if your unoptimized Julia program relies upon traditional "dynamic" features like `Any[]` arrays, then you should expect to see dynamic- (read: python-) like performance out of Julia. Julia should match performance of other dynamic languages here, but the complier doesn't have all the typical dynamic optimizations because, well, it's often easy to write your code in a manner that ends up hitting the happy path that gets the static-like performance.
Conversely, if your dynamic language baseline is just glue to an optimized static library, then you should expect to see static-like (read: C/C++-like) performance out of your dynamic language. Julia really should match performance here, and if it doesn't, open an issue: it's a bug.
Where Julia truly excels are the cases where you don't have a library implementation (like numpy) to lean on and find yourself writing a hot `for` loop in a dynamic language. Further, it excels at facilitating library creation, leading to more and more first-class ecosystems that are best-in-class like DiffEq.
> So if your unoptimized Julia program relies upon traditional "dynamic" features like
Dynamic dispatch is slow in any language, including C/C++ (provided that the compiler can't devirtualize the method). This is why such things are never done in an inner loop.
In C++, its harder to "accidentally" use dynamic dispatch because you have to explicitly annotate a function as being virtual. In Julia, which is much more concise, type stability or instability is implicit. But it can be inspected statically via @code_warntype. Good IDE plug-ins can make it easier.
Julia optimizes for a different thing. You can get your result, as in the actual useful thing that the code does/produces, much faster than with C/C++. You can skip type annotations, not worry about the memory usage, and write your code interactively using REPL or the excellent Revise.jl package.
If you have saved a couple of minutes or hours of coding and are only going to run that code a handful of times, it should not matter if it runs a second or two slower than C/C++. This is the same rationale that Python and other scripting languages have. But unlike Python, you should be able to match the speed of C/C++ or get pretty close by optimizing your code.
Yes I get your point. I guess I should have phrased my first question like the following
1. Can unoptimised Julia code run faster than unoptimised Python code (with numpy being used to do the heavy lifting)?
Let's say one is prototyping some algorithm so iteration speed is more relevant than running speed. Then one can choose either Julia or Python (with the help of numpy perhaps) and get an implementation in similar timeframes. So Julia won't necessarily be more attractive here.
Now if the prototype proved that running speed is very critical to the successful application of the algorithm, then it would mean the developer now has to optimise the hell out of it. One can either:
1. Optimise the Julia codebase, if Julia was used to prototype, following the many tips and tricks available (e.g. type stability, various macros, etc.).
2. Port the algorithm to C/C++, applying the many performance best practices that people have accumulated over the years.
So if the optimised C/C++ port is capable of being any faster than the optimised Julia code, then the rational choice would be to port the implementation using C/C++; it would also mean Python would have some advantage over Julia in the prototyping phase too due to its popularity. Otherwise I'd agree that using a single language to both do prototyping and production is the best.
This depends on what you mean by '(un)optimized code'. Because there's a difference between unoptimized and naive code.
'Unoptimized' code should still observe most of the performance tips in the manual (such as avoiding globals and type instability), while 'naive' code frequently does not. With some experience, you never write naive code, even for quick prototypes.
In those cases, Julia should outperform other dynamic lanuages significantly, and approach static languages in most cases.
Proper optimization means going in and removing allocations, ensuring that operations vectorize (simd), tailoring data structures for performance, adding parallelism etc. In the latter case Julia should virtually _always_ match static languages closely, otherwise it merits investigation.
Well there is no type stability or scoping rules to worry about in Python, so just for the sake of comparing the two, I was indeed thinking of 'naive' Julia code vs 'naive' Python code.
The thing with Python is that 'naive' Python code is already pretty close to 'unoptimised' Python code, so one can write naive Python code with numpy and still ends up with not-too-shabby performance, provided they chose an efficient algorithm, of course. In other words, there are not as many performance mistakes one can make with Python (perhaps because it can't get any worse). I imagine that's also why so many Python users who tried Julia were disappointed that direct translations of their Python program fail to perform as fast as advertised.
The point is, once you've gotten used to Julia you tend to write good code most of the time without even thinking about it. And that good code still "looks good," meaning it takes advantage of Julia's expressiveness and brevity. Understandably, newcomers make many more performance mistakes.
So there's often a huge difference between "unoptimized code" (something written by an experienced developer who's deliberately taking the easy way out) and "naive code" (something a newcomer might write). There can literally be orders-of-magnitude performance difference.
I agree that there isn't as much to learn about Python. But of course that's largely because of the gap in opportunities.
To be fair Julia gives you better tools to analyze your code and figure out how to write more efficient. Being able to look at all the steps a JIT compiler will perform on an individual function helps a lot in building an intuition about what you should and should not do while writing high performance Julia code
Is there a per-project way to manage dependencies yet? I find global package installation to be the biggest weakness of all the R projects out there. Anaconda can help, but it’s not widely used for R projects. And Docker... well, don’t get me started.
Yeah. Julia's had that since (at least) 1.0. Environments are built-in, and you specify project dependencies in a Projects.toml file https://pkgdocs.julialang.org/v1/toml-files/.
Since 0.7 (which was 1.0 with deprecations)
In julia 0.6 and before it was exactly as bad as described.
(though there were things like Playground.jl to kind of work around it)
I have heard it is to do with how windows antivirus works.
Since the registry is like 10,000 seperate files.
It chokes on them.
I have heard there is an upcoming feature to allow the package manager to work with the registry being kept inside a tarball, which is specifically being added to deal with this
This is a Windows issue, I'm pretty sure. My solution is easy, install Julia under WSL. In fact, after moving from MacOS to Windows, this is my goto solution: install as much as I can under WSL.
Many years of being on the receiving end of issues shows that there are a large number of different potential problems which manifest themselves in similar ways. Some are fixed, some persist, some may be new. If people don’t file issues describing exactly what they tried and what happened, this vague complaint just goes into the bucket of “Who knows? Hopefully someone else files a proper bug report.”
I’m not sure what issue you think you identified by googling such a short problem description, but it seems like it could be any of:
- slow internet connection
- firewall / proxy issues
- antivirus gumming things up
- file system being slow when dealing with lots of small files (Windows mostly)
- precompiling Plots took longer than expected
- precompiling Plots hit a deadlock
- loading Plots took longer than expected
- loading Plots hit a deadlock
- something else?
Worse, what “stuck” means is also ambiguous. Does that mean it failed with an error? Does that mean a download started but then was too slow for the user’s taste? Does it mean that a download started but never got any data at all? How long did the user wait?
My best guess is that git cloning the registry on Windows is taking a long time and isn’t actually stuck. There’s a fix for that being worked on for 1.7 (don’t unpack registries).
maybe i misread this, but milestone "1.6 blockers" still has 3 open with "1.6 now considered feature-complete. This milestone tracks release-blocking issues." - so how can 1.6 be ready?
You guys are doing great. Julia is really taking shape. Can't wait to jump ship from Python.
It's also nice to see that you (personally) are sponsoring zig development. There is so much more room for improvement in the arena of programming languages. Infrastructure like this is a huge multiplier.
I'm excited about Zig and I've spent many evenings looking at Andrew's streams. I haven't gotten the time to try it out properly myself but I'm looking forward to it.
The Julia implementation ended up being about 2 to 3 times faster. I timed the core learning loops, the network evaluations and gradient calculations and applications, and PyTorch and Julia performed similar here. So it wasn't that Julia was faster at learning. Instead it was all the in-between, all the "book keeping" in Python ended up being much faster in Julia, enough so that overall it was 2 to 3 times faster.
(I was training on a CPU though. Things may be different if you're using a GPU, I don't know.)