Hacker News new | past | comments | ask | show | jobs | submit login

While this is a good concept in theory, I'd be skeptical about building on top of such a system. The primary reason is the slowness of R. I built heavy duty data mining systems using a stack of kdb/q and R. In my experience, R, when used for simple clustering algorithms like k-means and k-medoids slowed down my system by nearly 70 times. This is despite running parallelized versions of these algorithms (by means of the SPRINT R package) using mpiexec.

IMO, there is a very big gap in this space. There is an urgent need for high performant data inference languages. MATLAB is decent, but is still clunky for my taste. Plus, I prefer the simplicity of a file mapped column oriented database like the one offered by kdb. As KDB is too expensive for me right now, I'm considering building on top of the excellent J language/JDB database stack for my big data needs.




Blog post author here. SQL Server will be embedding Revolution R Enterprise, which eliminates the slowness and memory constraints of base R for many machine learning and statistical modeling algorithms. It runs parallel (not single-threaded) algorithms in a dedicated sandbox that stream data from SQL Server (and so eliminating RAM constraints). You can see some benchmarks in this white paper http://www.revolutionanalytics.com/sites/default/files/revol... -- for example it can run stepwise linear regression on 5M rows and 100 variables in 14 seconds. (Benchmark scripts at: https://github.com/RevolutionAnalytics/Benchmark )


Julia fills this need. Also python with numba. Julia is the future, takes the best syntax from python, matlab couples that with a beautiful and extensible type system and uses this to drive JIT llvm with speed on the order of C.


This will all be down to the implementation of DataFrame no?

In other words if you put q ontop of a R DataFrame would you expect the same performance? Or if you ran R ontop of K.


I'm not sure if I understand your question correctly. kx provides shared libraries to make R and q talk to each other. The specific approach I followed used a shared library that brings R into the kdb+ memory space, meaning all the R statistical routines and graphing capabilities can be invoked directly from kdb+. Using this method means data is not passed between remote processes.


Ah I assumed you were loading a vanilla DataFrame from the column store. My bad :-)


How does Julia work in this space? What's it lacking?


I do not have experience with Julia. I'm interested in languages built for time series data specifically. APL derivatives shine in this regard, in the sense that they provide the bare metal language composed of a few very powerful verbs and adverbs. For example, I could whip up a simple linear regression package in about 10 lines of tight code. These languages also lend themselves to parallelism in a very straightforward manner by means of the "parallel apply" verb. In addition , the brevity that languages like q offer is simply unparalleled.


Julia is a little friendlier for people who aren't used to APL's point-free style, but it is designed for performance, going as far showing you how functions are compiled to LLVM and native assembler in the REPL.

http://www.evanmiller.org/why-im-betting-on-julia.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: