Hacker News new | past | comments | ask | show | jobs | submit login

I would highly recommend the use of the package data.table over tibble or the basic data.frame if you are doing any type of modeling in R with larger datasets. Yes R has many data structures but knowing how to use data.table will blow your mind in term of efficiency. Matt and other contributors have built something extremely fast and flexible.

I get that R is not for everyone but used correctly it is a beast.

Now this is anecdotal, but we have in the insurance industry what we call on level premium calculators. It is basically a program that will rerate all policies with the current set of rates.

Our current R program can rate 41000 policies a second fully vectorized on a user laptop that has a an i5 from 2015.

In contrast, the previous SAS program could do 231 policies a minute on xeon 64 core processor from 2017.

For our workload and type of work, R has been a godsend.

Bonus, we can put what our data scientist develop in R directly in production. (after peer review, testing, etc, not different than any other production code)

Back when I started in 2005, we modeled in some proprietary software like Emblem, used Excel to build a first draft premium calculator, rebuilt the computation in SAS for the onlevel program and sent specs to IT to rebuilt the program again for production. All three had to produce the same results.

I've tried Python, Go, Rust, Julia. I'd say Python could be a good alternative but speed of data.table, RStudio IDE and ease of package management in R makes R an obvious choice for us. I believe Julia to be the future but so far the adoption rate in house has been low.




As someone "fully fluent" in both, for many workflows that can be properly implemented in SAS, you would expect on a technical level the SAS program could be faster. It's a fully compiled language, it's a "simple" compilation model (compared to R), and the interaction between incremental compilation and the macro system allows you to do some really good blurring between run-time and compilation when performance matters. Plus, by abusing the fact you can define both sql and data step views to further minimise disk read/write, database pass through on certain procedures, and allowimg for in-memory operations (like R) with the sasfile command, from a purely technical point of view, an experienced user of both should be able to beat R in SAS.

But... and here's the big but...I almost never actually meet anyone capable of putting all these steps together in SAS these days that actually understands the SAS computation model end to end.

And SAS's strength, a computation model not being limited by memory by default, becomes a performance weakness when everyone reads/writes every step out to disk and programs without understanding all those little intricacies. SAS hasn't helped any of this by trying to move its eco system away from "programmer" to "application users", so now "programmers" can pick up an interpreted language like R with in-memory default vectorised operations and beat SAS.

Course, I'd still recommend places move to python/R these days because of the broader ecosystems, university talent pool, and avoiding the extensive lock in of proprietary software, but I still feel I have to reflexively respond to "R faster than SAS" claims :p


Believe me, I know. The code just becomes unreadable when you put all execution inside the same data step and use hash table to do fast small to big merging. And not to mention debugging that mess when you have a macro layer on top of it. Not having access to function source code, installation process being what it was. I do not miss it.

And yes technically SAS is faster than R but part of the equation is how many people can make SAS code faster than R/python. I had maybe, 1-2 people that could write efficient SAS code.

One version we had was a bunch of macro producing hash merge plus the whole how can I do something without having to get out of the data step. Just horrible. Number of characters in a line of code? You forgot your quote somewhere and now you have to run the magic line.

I hope I'm not too emotional when I say I hope SAS disappears from my industry and we embrace less adversarial licensing.


I don't think that's being emotional at all.

I'm being emotional when I say I have a soft spot for it because of some nostalgia and occasionally dropping in to do some "rock star" programming moments with it. But that's the opposite of what I'd want if/when I was running my own ship.

I too almost always try to steer myself and others away from it now because of the licensing/customer hostility. It's absolutely ridiculous...


Do you have any resources that help explain these SAS performance measures? A book perhaps?

I have been trying to help with exactly this (and your breadcrumbs help) but it is tricky for me since I am used to open source/*nix environment where you can use much different tools and also information and tutorials are distributed much more widely.


Unfortunately not. With SAS I never used books and relied solely on having access to the fully licensed system at a previous job and all of the SAS PDFs floating around the internet and findable with specific searches.

That combined with a general computer science background and you can start to put the whole thing together.

I'd be lying if I said I hadn't considered writing one, but at my age I'd honestly ask why write one for an old proprietary system and make business for someone else when, if I'll ever go back long term, they can pay me an exorbitant amount as a consultant. Might as well start writing 'the dark arts of COBOL' :p


Just use https://diskframe.com and you will not limited by memory!!


For larger-than-RAM data I would recommend diskframe.com

It uses dplyr and data.table syntax to manipulate data on disk


I've not used diskframe.com, but from experience can recommend the 'fst'[1] file format with 'fsttable'[2] for reading on disk data tables.

[1] https://github.com/fstpackage/fst

[2] https://github.com/fstpackage/fsttable


disk.frame uses fst as the underlying format


Thanks, so far we just scaled up our vm ram but i might find a use for it.


This may be useful. I prefer dplyr's syntax. https://github.com/tidyverse/dtplyr


One of the reason we use data.table is that it reduces the depencies when building custom images and its stability has been better than the tidyverse in the past. It might not be the case in the future, but that is how we made our choice initially.


> I believe Julia to be the future but so far the adoption rate in house has been low.

Why do you believe it will be the future, and what do you see as the barriers to roll-out? I ask as someone who is curious about when/whether to start investing in Julia competence


> We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled. (Did we mention it should be as fast as C?)

https://julialang.org/blog/2012/02/why-we-created-julia

I've been playing around with it. As a Python/MATLAB guy, the syntax is very friendly. I can see it displacing Python in production code where you need speed and might avoid some of the heavy Python DS libraries. Overall it seems like a thoughtful combo of a lot of good numerical programming features.


Debugging is something I could not do as easily in Julia vs debugonce and trace in R. Compiling takes time.

But so far we have seen great development. Flux is a truly beautiful ml library. Being a compiled language remove a lot of headache when building production images. The syntax, the full utf support in variable name. Package management is great. Having that abstraction layer between CPU, GPU so you don't have to rewrite code. Dispatch based on signature, type management. I don't see it going away soon. It took me 13 years to make them transition out of SAS, good thing cloud computing come around and someone realised the clusterfuck of having to manage SAS licence in the cloud.


Both R and pandas force you to wrap your problem around dataframes and vectorized operations. But sometimes you really do just want to write a loop that iterates over the data.

Right now the only way to do that without significant performance costs is to drop down into C or avoid the problem completely by using Julia.

Having worked with both R and Python on large datasets, I think both languages are really easy until they aren’t. Eventually you hit a performance wall.


You can increase the speed of loops in Python using Numba. It's really a great performance booster with just a few decoraters added.


You can drop down into the Numpy values array in Pandas to get your performance gain when iteration is otherwise slow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: