While this is a good concept in theory, I'd be skeptical about building on top of such a system. The primary reason is the slowness of R. I built heavy duty data mining systems using a stack of kdb/q and R. In my experience, R, when used for simple clustering algorithms like k-means and k-medoids slowed down my system by nearly 70 times. This is despite running parallelized versions of these algorithms (by means of the SPRINT R package) using mpiexec.
IMO, there is a very big gap in this space. There is an urgent need for high performant data inference languages. MATLAB is decent, but is still clunky for my taste. Plus, I prefer the simplicity of a file mapped column oriented database like the one offered by kdb. As KDB is too expensive for me right now, I'm considering building on top of the excellent J language/JDB database stack for my big data needs.
Blog post author here. SQL Server will be embedding Revolution R Enterprise, which eliminates the slowness and memory constraints of base R for many machine learning and statistical modeling algorithms. It runs parallel (not single-threaded) algorithms in a dedicated sandbox that stream data from SQL Server (and so eliminating RAM constraints). You can see some benchmarks in this white paper http://www.revolutionanalytics.com/sites/default/files/revol... -- for example it can run stepwise linear regression on 5M rows and 100 variables in 14 seconds. (Benchmark scripts at: https://github.com/RevolutionAnalytics/Benchmark )
Julia fills this need. Also python with numba. Julia is the future, takes the best syntax from python, matlab couples that with a beautiful and extensible type system and uses this to drive JIT llvm with speed on the order of C.
I'm not sure if I understand your question correctly. kx provides shared libraries to make R and q talk to each other. The specific approach I followed used a shared library that brings R into the kdb+ memory space, meaning all the R statistical routines and graphing capabilities can be invoked directly from kdb+. Using this method means data is not passed between remote processes.
I do not have experience with Julia. I'm interested in languages built for time series data specifically. APL derivatives shine in this regard, in the sense that they provide the bare metal language composed of a few very powerful verbs and adverbs. For example, I could whip up a simple linear regression package in about 10 lines of tight code. These languages also lend themselves to parallelism in a very straightforward manner by means of the "parallel apply" verb. In addition , the brevity that languages like q offer is simply unparalleled.
Julia is a little friendlier for people who aren't used to APL's point-free style, but it is designed for performance, going as far showing you how functions are compiled to LLVM and native assembler in the REPL.
IMHO scripts running in a database server never work all that well - debugging is a nightmare. At least this has been my experience from trying PG plpython a few years ago.
This article[0] shows exactly this problem; and it links to patches that handle this issue. The article is 4.5 years old, not sure if the patches have been upstreamed.
It's not actually in the database. It's a lousy implementation which does not use the distributed nature of HANA. You fetch the data first and then you do your analyses in R which is hosted near the DB. If you want parallel processing in R, you're on your own.
IMO, there is a very big gap in this space. There is an urgent need for high performant data inference languages. MATLAB is decent, but is still clunky for my taste. Plus, I prefer the simplicity of a file mapped column oriented database like the one offered by kdb. As KDB is too expensive for me right now, I'm considering building on top of the excellent J language/JDB database stack for my big data needs.