Hacker News new | past | comments | ask | show | jobs | submit login
In-database R coming to SQL Server 2016 (revolutionanalytics.com)
64 points by Hansi on May 16, 2015 | hide | past | favorite | 31 comments



While this is a good concept in theory, I'd be skeptical about building on top of such a system. The primary reason is the slowness of R. I built heavy duty data mining systems using a stack of kdb/q and R. In my experience, R, when used for simple clustering algorithms like k-means and k-medoids slowed down my system by nearly 70 times. This is despite running parallelized versions of these algorithms (by means of the SPRINT R package) using mpiexec.

IMO, there is a very big gap in this space. There is an urgent need for high performant data inference languages. MATLAB is decent, but is still clunky for my taste. Plus, I prefer the simplicity of a file mapped column oriented database like the one offered by kdb. As KDB is too expensive for me right now, I'm considering building on top of the excellent J language/JDB database stack for my big data needs.


Blog post author here. SQL Server will be embedding Revolution R Enterprise, which eliminates the slowness and memory constraints of base R for many machine learning and statistical modeling algorithms. It runs parallel (not single-threaded) algorithms in a dedicated sandbox that stream data from SQL Server (and so eliminating RAM constraints). You can see some benchmarks in this white paper http://www.revolutionanalytics.com/sites/default/files/revol... -- for example it can run stepwise linear regression on 5M rows and 100 variables in 14 seconds. (Benchmark scripts at: https://github.com/RevolutionAnalytics/Benchmark )


Julia fills this need. Also python with numba. Julia is the future, takes the best syntax from python, matlab couples that with a beautiful and extensible type system and uses this to drive JIT llvm with speed on the order of C.


This will all be down to the implementation of DataFrame no?

In other words if you put q ontop of a R DataFrame would you expect the same performance? Or if you ran R ontop of K.


I'm not sure if I understand your question correctly. kx provides shared libraries to make R and q talk to each other. The specific approach I followed used a shared library that brings R into the kdb+ memory space, meaning all the R statistical routines and graphing capabilities can be invoked directly from kdb+. Using this method means data is not passed between remote processes.


Ah I assumed you were loading a vanilla DataFrame from the column store. My bad :-)


How does Julia work in this space? What's it lacking?


I do not have experience with Julia. I'm interested in languages built for time series data specifically. APL derivatives shine in this regard, in the sense that they provide the bare metal language composed of a few very powerful verbs and adverbs. For example, I could whip up a simple linear regression package in about 10 lines of tight code. These languages also lend themselves to parallelism in a very straightforward manner by means of the "parallel apply" verb. In addition , the brevity that languages like q offer is simply unparalleled.


Julia is a little friendlier for people who aren't used to APL's point-free style, but it is designed for performance, going as far showing you how functions are compiled to LLVM and native assembler in the REPL.

http://www.evanmiller.org/why-im-betting-on-julia.html


For what it's worth, PostgreSQL had this since 2003. http://www.joeconway.com/plr/

IMHO scripts running in a database server never work all that well - debugging is a nightmare. At least this has been my experience from trying PG plpython a few years ago.

Link to the original announcement email: http://www.postgresql.org/message-id/3E514A46.2040604@joecon...


R does have some pretty cool tools for after the fact debugging, like dump.frames, but few people know about them.


How does that work when R is GPL licensed? Doesn't that make SQL server a derived work?


Revolution have their own implementation of R called OpenR.


No, that's not their own implementation. It's GNU r bundled with some extras


Not the same R implementation.


I see. Did they re-implement it from scratch?


MS owns Revolution Analytics, can't they just ignore the license [MS now] provided for public use?


Revolution analytics isn't a copyright holder of R, so no.


I'm always reluctant to these kinds of ideas, of executing code on/within my database server.

I know it's apparently sandboxed, but that didn't work out too well for ElasticSearch recently: https://jordan-wright.github.io/blog/2015/03/08/elasticsearc....


SQL is code running in the database server. Most SQL implementations, TSQL, PL/SQL, PGSQL are turing complete as well.


I hope it's not as bad as PL/R. Seemed like a good idea, but the performance was so terrible that it was essentially useless.


Any idea what makes PL/R slow? Is it time spent doing data conversions from PostgreSQL types to R types?


This article[0] shows exactly this problem; and it links to patches that handle this issue. The article is 4.5 years old, not sure if the patches have been upstreamed.

[0] http://www.credativ.co.uk/credativ-blog/2010/07/postgresql-t...


How does Java inside Oracle DB is doing nowadays?


Shudder


just so you know, is already in SAP HANA... is nice to see that programming languages are part of a database a not just an extension


It's not actually in the database. It's a lousy implementation which does not use the distributed nature of HANA. You fetch the data first and then you do your analyses in R which is hosted near the DB. If you want parallel processing in R, you're on your own.


Why Microsoft? Why? Why not python?


They are not gluing layers, they are adding statistical capabilities. Each language surves it's purpose.


I would say why not improve the CLR support that has been there since SQL 2005?


> I would say why not improve the CLR support that has been there since SQL 2005?

Because it won't cause new license sales, whereas integrating R might.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: