Why Julia's DataFrames Are Still Slow

jpfr · on Nov 29, 2015

Relevant video from juliacon on the type system core: https://www.youtube.com/watch?v=xUP3cSKb8sI

Julia's type system is geared towards JIT compilation. Methods are compiled when they are first called. Then, of course, full type information for the arguments is available. That's quite enough for Matlab-style code with the occasional JITted method. But Julia has one glaring disadvantage for everything beyond that: The return type of methods cannot be specified / enforced.

1) With "black box" methods (where the return type cannot be inferred as in this DataFrames article) the code becomes horribly slow. And you have to dig into the internal method representations to become aware of the type inference results.

2) It hurts the ability of Julia to produce binary executables. When the types are not 100% inferrable, the entire JIT infrastructure needs to be dragged along.

3) Types are not only an aid for the compiler, but also an aid for the programmer. With SIUnits [1] and method return types, Julia could even tell when the physics represented in the code is flawed!!

If Julia's type system were stronger, it could become a prime platform to develop Computer Algebra Systems (CAS). That could lead to a great unification of symbolic and numerical "computation platforms". However, current Julia is unable to represent the mathematics encoded in the type system of open source CAS like Axiom [2]. Also note the github issue on Julia and dependent typing [3].

Imho, there is still a great potential for the Julia type system without breaking existing code.

[1] https://github.com/Keno/SIUnits.jl

[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27....

[3] https://github.com/JuliaLang/julia/issues/6113

thenoether · on Nov 29, 2015

> The return type of methods cannot be specified / enforced.

That's true in one sense - typed functions aren't really a thing yet (functor types are, but they're more clumsy to use in a lot of cases).

With regards to your first point, you can actually specify the types of returned values by utilizing assertions:

  julia> h{T}(x::T) = f(g(x))::Complex{T}
  h (generic function with 1 method)

Type inference should totally pick up on the return type declared here, and an error is thrown if `f(x)` does not return something of type `Complex{T}`.

Assertions obviously don't address all of the concerns you listed, but I find they still help a lot to address type inference "failures" in some places, and seem to be underutilized in a few cases.

This doesn't help any memory boxing/slowness inherent to `f` or `g`, but downstream methods (e.g. callers of `h`) can definitely benefit from the explicit information.

thenoether · on Nov 30, 2015

> if `f(x)` does not return something of type `Complex{T}`

Sorry, meant to say "if `f(g(x)` does not return something of type `Complex{T}`"

unfamiliar · on Nov 29, 2015

Personally I would like to see Julia go full static typing. The dynamicness of its current type system seems to offer me no advantages and is constantly causing me problems with performance. My code ends up covered in type declarations because I'm never sure when types are concrete or not,which kind of defeats the point of implicit typing. A lot of the time a seamingly innocuous change will result in Array{Float64,1} getting changed to Array{Any,1} which then gets propagated automatically though the whole program, since functions can't have return types.

Julia claims to be the future of numerical computing, but they seem to put more effort into coming up with fancy abstractions that no engineer or scientist will ever understand than working on basic numerics. For example, the array implementation is very hit or miss and much much slower than and has much more complicated semantics than arrays in Fortran. I think this is due to the fact that arrays are not first class objects in the language, and probably because the authors don't do much multidimensional high performance work.

johnmyleswhite · on Nov 29, 2015

It would be very helpful if you provided some specific examples of code in both Julia and Fortran that is under-performant in Julia. As it stands, your concerns about things like "fancy abstractions that no engineer or scientist will ever understand" are rather too vague to be effectively resolved.

unfamiliar · on Nov 30, 2015

The "fancy abstractions" criticism is kind of orthogonal to the performance one.

Specific complaints about arrays are the design choice to not include dimensions in the type parameters. It means that simple errors only show up at runtime, sometimes hours into a simulation. The ability to specify not enough/too many indices is also a problem for the same reason.

It's hard to provide specific examples because a lot of the time I don't understand why something is happening. For example,

    n = 10; [i for i=1:n]

gives an Array{Any,1}, but when I try to figure out why in the interactive prompt,

    [i for i=1:10]

gives an Array{Int64,1}. In my experience the Any version is slower, and they crop up all over, deep my programs. So I have to manually use

    n = 10; [i::Int64 for i=1:n]

The type system is generally quite nice and I like working with it a lot. But none of my colleagues know what it is and don't care how it works, they just want to add doubles together. So the fact that things like

    x = Array(1)

give error messages like

    ERROR: MethodError: `convert` has no method matching convert(::Type{Array{T,N}}, ::Int64)
    This may have arisen from a call to the constructor Array{T,N}(...),
    since type constructors fall back to convert methods.
    Closest candidates are:
      call{T}(::Type{T}, ::Any)
      convert(::Type{Array{T,N}}, ::SharedArray{T,N})
      convert{T,N}(::Type{Array{T,N}}, ::AbstractArray{T,N})
      ...
     in call at essentials.jl:56

scares the living bejesus out of anyone who wants to simply use arrays without knowledge of types, parametric types, constructors, and the call and convert functions (not to mention that SharedArray and AbstractArray). All of this is too much abstraction for a normal scientist or engineer to want to have to deal with. What the error message really should just say is

    Missing type parameter in method: Array(<type>, <dimension>)

Arrays are implemented as abstract types and now constructors fall back to convert, which is all very clever and fits into the grand type system view, but the result is that the debug information is hugely complicated.

As for the performance, I've been trying to implement my own FixedSizeArrays type thing, where arrays have dimensions as type parameters. My code for 1D looks like this:

    immutable Float1D{nx}
        data :: Array{Float64,1}
        Float1D() = new(Array(Float64,nx))
        function Float1D(xs::Array{Float64,1})
            size(xs) == (nx,) || error("Size of initialization array ($(size(xs))) does not match type parameter ($nx)")
            new(xs)
        end
    end
    size{nx}(xs::Float1D{nx}) = (nx,)
    getindex(xs::Float1D, i::Int) = xs.data[i]
    setindex!(xs::Float1D, v::Real, i::Int) = xs.data[i] = v

And it's performance is much worse than the native arrays. I don't know why, since the methods seem simple enough to inline, but I'm sure there is a lot I'm missing. I think maybe it is partly because the Julia Array type and its methods seem to be implemented in C. Either way, having thought about this a lot and understand the compiler and type system reasonably well (for an end user) I still don't know how to get it to go faster. I think that is unreasonably complex for people who just want code to work fast.

johnmyleswhite · on Nov 30, 2015

I understand your concerns better now. Here's my input for what it's worth:

(1) List comprehensions are a mess and a recognized problem with the existing compiler. There's a lot of work that's been done to fix the erratic typing of comprehensions. As such, I am sure that this concern will be resolved before 1.0. I would bet that it will be resolved much sooner than that.

(2) I don't really believe that scientists can't learn that Array(Int, 1) and Array(Float64, 1) do different things. This seems like a place where education can be very effective.

(3) Julia's error messages are indeed generally poor. I'm hopeful that we'll see a push towards Elm-quality error messages in the future: http://elm-lang.org/blog/compilers-as-assistants

(4) The core problem with your fixed size array example is that you're building it on top of the wrong primitives. Julia's Array isn't a suitable data structure for building up fixed-length arrays: Julia's Array is more like C++ std::vector than like C's arrays. To build a real fixed length array, you need to build it on top of lower-level primitives. Many people are already working on that problem.

Putting it all together, I think your concerns all reflect real problems with the language. You may be better off waiting until Julia 1.0 comes out.

unfamiliar · on Dec 1, 2015

Waiting for 1.0 isn't an option for me; I love the language too much already, even if that wasn't apparent from my generally negative comments. Currently I can work around all the corner cases myself, but encouraging/convincing others of Julia's merits is a difficult task at the minute. Thanks for your feedback on my fixed size arrays code.

sevensor · on Nov 29, 2015

Is there no way to take advantage of the fact that most columns, most of the time, are filled with doubles? This is both the expected case and the thing we want to go faster. I don't know compiler design, which is why I ask.

johnmyleswhite · on Nov 29, 2015

I don't see how this could be done without putting information about the internal representation of DataFrames into the compiler. Such an approach would seem to require breaking important abstraction barriers and would couple the language to a specific back-end for data representation that might need to be replaced in the future to handle things like out-of-core data frames.

IndianAstronaut · on Nov 29, 2015

One thing I would really like to see happen is out of core data and statistics in Julia just like SAS. Not possible in either R or Python.

heydenberk · on Nov 29, 2015

If I understand you, this is indeed possible and has been recently implemented in Python. Take a look at dask: http://dask.pydata.org/en/latest/

> Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:

digitalzombie · on Nov 29, 2015

Just to clarify, OP statement of "out of the core" means concurrent processes?

chubot · on Nov 29, 2015

I think he means operating on data that doesn't fit in memory, and as mentioned I think R and Python both have packages for that.

IndianAstronaut · on Nov 30, 2015

Dask looks quite interesting. Though I am running into some issues with merges and such.

My temporary fix is to use the chunking functionality of Pandas.

johnmyleswhite · on Nov 29, 2015

There's work being done to wrap Dato's SFrames library, which would solve that problem.

tadlan · on Nov 29, 2015

That's just the data structure. Fitting models would be much tougher.

johnmyleswhite · on Nov 29, 2015

That really depends on how you fit models. Maximum likelihood estimation (or MAP estimation) for models of IID database rows is often trivial to implement using an iterative algorithm that applies to both in-memory and out-of-core data stores.

Nrpf · on Nov 30, 2015

You mean IRLS or SGD or something like that?

johnmyleswhite · on Nov 30, 2015

I would use SGD when I can tolerate an inexact estimator. For problems when I need an exact MLE (because, for example, I need to invoke asympotic normality to get CI's), I can usually do exact computations using less memory than IRLS would require.

In particular, what I had in mind is that most black-box optimization tools (e.g. L-BFGS) only need an implementation of a function and its gradient -- both of which can be accumulated iteratively using O(P) memory and O(N * P) time per iteration for any data store that can stream all N observations. Given code that can accumulate function values and gradients, an optimization method like L-BFGS can estimate parameters for many kinds of models. In my experience, there's often no need to materialize the kinds of dense matrices that are typically used in R's default approach to GLM model fitting.

jbssm · on Nov 29, 2015

Is there any good alternative library to DataFrames?