In-database R coming to SQL Server 2016

bladecatcher · on May 16, 2015

While this is a good concept in theory, I'd be skeptical about building on top of such a system. The primary reason is the slowness of R. I built heavy duty data mining systems using a stack of kdb/q and R. In my experience, R, when used for simple clustering algorithms like k-means and k-medoids slowed down my system by nearly 70 times. This is despite running parallelized versions of these algorithms (by means of the SPRINT R package) using mpiexec.

IMO, there is a very big gap in this space. There is an urgent need for high performant data inference languages. MATLAB is decent, but is still clunky for my taste. Plus, I prefer the simplicity of a file mapped column oriented database like the one offered by kdb. As KDB is too expensive for me right now, I'm considering building on top of the excellent J language/JDB database stack for my big data needs.

nonfamous · on May 18, 2015

Blog post author here. SQL Server will be embedding Revolution R Enterprise, which eliminates the slowness and memory constraints of base R for many machine learning and statistical modeling algorithms. It runs parallel (not single-threaded) algorithms in a dedicated sandbox that stream data from SQL Server (and so eliminating RAM constraints). You can see some benchmarks in this white paper http://www.revolutionanalytics.com/sites/default/files/revol... -- for example it can run stepwise linear regression on 5M rows and 100 variables in 14 seconds. (Benchmark scripts at: https://github.com/RevolutionAnalytics/Benchmark )

Lofkin · on May 17, 2015

Julia fills this need. Also python with numba. Julia is the future, takes the best syntax from python, matlab couples that with a beautiful and extensible type system and uses this to drive JIT llvm with speed on the order of C.

gaius · on May 16, 2015

This will all be down to the implementation of DataFrame no?

In other words if you put q ontop of a R DataFrame would you expect the same performance? Or if you ran R ontop of K.

bladecatcher · on May 16, 2015

I'm not sure if I understand your question correctly. kx provides shared libraries to make R and q talk to each other. The specific approach I followed used a shared library that brings R into the kdb+ memory space, meaning all the R statistical routines and graphing capabilities can be invoked directly from kdb+. Using this method means data is not passed between remote processes.

gaius · on May 16, 2015

Ah I assumed you were loading a vanilla DataFrame from the column store. My bad :-)

steego · on May 16, 2015

How does Julia work in this space? What's it lacking?

bladecatcher · on May 16, 2015

I do not have experience with Julia. I'm interested in languages built for time series data specifically. APL derivatives shine in this regard, in the sense that they provide the bare metal language composed of a few very powerful verbs and adverbs. For example, I could whip up a simple linear regression package in about 10 lines of tight code. These languages also lend themselves to parallelism in a very straightforward manner by means of the "parallel apply" verb. In addition , the brevity that languages like q offer is simply unparalleled.

steego · on May 17, 2015

Julia is a little friendlier for people who aren't used to APL's point-free style, but it is designed for performance, going as far showing you how functions are compiled to LLVM and native assembler in the REPL.

http://www.evanmiller.org/why-im-betting-on-julia.html

gtrubetskoy · on May 16, 2015

For what it's worth, PostgreSQL had this since 2003. http://www.joeconway.com/plr/

IMHO scripts running in a database server never work all that well - debugging is a nightmare. At least this has been my experience from trying PG plpython a few years ago.

Link to the original announcement email: http://www.postgresql.org/message-id/3E514A46.2040604@joecon...

hadley · on May 16, 2015

R does have some pretty cool tools for after the fact debugging, like dump.frames, but few people know about them.

fs111 · on May 16, 2015

How does that work when R is GPL licensed? Doesn't that make SQL server a derived work?

toddkazakov · on May 16, 2015

Revolution have their own implementation of R called OpenR.

hadley · on May 16, 2015

No, that's not their own implementation. It's GNU r bundled with some extras

olh · on May 16, 2015

Not the same R implementation.

fs111 · on May 16, 2015

I see. Did they re-implement it from scratch?

benologist · on May 16, 2015

MS owns Revolution Analytics, can't they just ignore the license [MS now] provided for public use?

hadley · on May 16, 2015

Revolution analytics isn't a copyright holder of R, so no.

Jake232 · on May 16, 2015

I'm always reluctant to these kinds of ideas, of executing code on/within my database server.

I know it's apparently sandboxed, but that didn't work out too well for ElasticSearch recently: https://jordan-wright.github.io/blog/2015/03/08/elasticsearc....

SigmundA · on May 16, 2015

SQL is code running in the database server. Most SQL implementations, TSQL, PL/SQL, PGSQL are turing complete as well.

darksaints · on May 16, 2015

I hope it's not as bad as PL/R. Seemed like a good idea, but the performance was so terrible that it was essentially useless.

jeltz · on May 16, 2015

Any idea what makes PL/R slow? Is it time spent doing data conversions from PostgreSQL types to R types?

tankenmate · on May 16, 2015

This article[0] shows exactly this problem; and it links to patches that handle this issue. The article is 4.5 years old, not sure if the patches have been upstreamed.

[0] http://www.credativ.co.uk/credativ-blog/2010/07/postgresql-t...

sorokod · on May 16, 2015

How does Java inside Oracle DB is doing nowadays?

modarts · on May 16, 2015

Shudder

atorralb · on May 16, 2015

just so you know, is already in SAP HANA... is nice to see that programming languages are part of a database a not just an extension

toddkazakov · on May 16, 2015

It's not actually in the database. It's a lousy implementation which does not use the distributed nature of HANA. You fetch the data first and then you do your analyses in R which is hosted near the DB. If you want parallel processing in R, you're on your own.

vittore · on May 16, 2015

Why Microsoft? Why? Why not python?

M8 · on May 16, 2015

They are not gluing layers, they are adding statistical capabilities. Each language surves it's purpose.

SigmundA · on May 16, 2015

I would say why not improve the CLR support that has been there since SQL 2005?

BrentOzar · on May 16, 2015

> I would say why not improve the CLR support that has been there since SQL 2005?

Because it won't cause new license sales, whereas integrating R might.