I would highly recommend the use of the package data.table over tibble or the ba...

ACow_Adonis · on Dec 28, 2019

As someone "fully fluent" in both, for many workflows that can be properly implemented in SAS, you would expect on a technical level the SAS program could be faster. It's a fully compiled language, it's a "simple" compilation model (compared to R), and the interaction between incremental compilation and the macro system allows you to do some really good blurring between run-time and compilation when performance matters. Plus, by abusing the fact you can define both sql and data step views to further minimise disk read/write, database pass through on certain procedures, and allowimg for in-memory operations (like R) with the sasfile command, from a purely technical point of view, an experienced user of both should be able to beat R in SAS.

But... and here's the big but...I almost never actually meet anyone capable of putting all these steps together in SAS these days that actually understands the SAS computation model end to end.

And SAS's strength, a computation model not being limited by memory by default, becomes a performance weakness when everyone reads/writes every step out to disk and programs without understanding all those little intricacies. SAS hasn't helped any of this by trying to move its eco system away from "programmer" to "application users", so now "programmers" can pick up an interpreted language like R with in-memory default vectorised operations and beat SAS.

Course, I'd still recommend places move to python/R these days because of the broader ecosystems, university talent pool, and avoiding the extensive lock in of proprietary software, but I still feel I have to reflexively respond to "R faster than SAS" claims :p

meztez · on Dec 28, 2019

Believe me, I know. The code just becomes unreadable when you put all execution inside the same data step and use hash table to do fast small to big merging. And not to mention debugging that mess when you have a macro layer on top of it. Not having access to function source code, installation process being what it was. I do not miss it.

And yes technically SAS is faster than R but part of the equation is how many people can make SAS code faster than R/python. I had maybe, 1-2 people that could write efficient SAS code.

One version we had was a bunch of macro producing hash merge plus the whole how can I do something without having to get out of the data step. Just horrible. Number of characters in a line of code? You forgot your quote somewhere and now you have to run the magic line.

I hope I'm not too emotional when I say I hope SAS disappears from my industry and we embrace less adversarial licensing.

ACow_Adonis · on Dec 28, 2019

I don't think that's being emotional at all.

I'm being emotional when I say I have a soft spot for it because of some nostalgia and occasionally dropping in to do some "rock star" programming moments with it. But that's the opposite of what I'd want if/when I was running my own ship.

I too almost always try to steer myself and others away from it now because of the licensing/customer hostility. It's absolutely ridiculous...

jrumbut · on Dec 28, 2019

Do you have any resources that help explain these SAS performance measures? A book perhaps?

I have been trying to help with exactly this (and your breadcrumbs help) but it is tricky for me since I am used to open source/*nix environment where you can use much different tools and also information and tutorials are distributed much more widely.

ACow_Adonis · on Dec 28, 2019

Unfortunately not. With SAS I never used books and relied solely on having access to the fully licensed system at a previous job and all of the SAS PDFs floating around the internet and findable with specific searches.

That combined with a general computer science background and you can start to put the whole thing together.

I'd be lying if I said I hadn't considered writing one, but at my age I'd honestly ask why write one for an old proprietary system and make business for someone else when, if I'll ever go back long term, they can pay me an exorbitant amount as a consultant. Might as well start writing 'the dark arts of COBOL' :p

xiaodai · on Dec 28, 2019

Just use https://diskframe.com and you will not limited by memory!!

xiaodai · on Dec 28, 2019

For larger-than-RAM data I would recommend diskframe.com

It uses dplyr and data.table syntax to manipulate data on disk

phillc73 · on Dec 28, 2019

I've not used diskframe.com, but from experience can recommend the 'fst'[1] file format with 'fsttable'[2] for reading on disk data tables.

[1] https://github.com/fstpackage/fst

[2] https://github.com/fstpackage/fsttable

xiaodai · on Dec 28, 2019

disk.frame uses fst as the underlying format

meztez · on Dec 28, 2019

Thanks, so far we just scaled up our vm ram but i might find a use for it.

petulla · on Dec 28, 2019

This may be useful. I prefer dplyr's syntax. https://github.com/tidyverse/dtplyr

meztez · on Dec 28, 2019

One of the reason we use data.table is that it reduces the depencies when building custom images and its stability has been better than the tidyverse in the past. It might not be the case in the future, but that is how we made our choice initially.

609venezia · on Dec 28, 2019

> I believe Julia to be the future but so far the adoption rate in house has been low.

Why do you believe it will be the future, and what do you see as the barriers to roll-out? I ask as someone who is curious about when/whether to start investing in Julia competence

starpilot · on Dec 28, 2019

> We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled. (Did we mention it should be as fast as C?)

https://julialang.org/blog/2012/02/why-we-created-julia

I've been playing around with it. As a Python/MATLAB guy, the syntax is very friendly. I can see it displacing Python in production code where you need speed and might avoid some of the heavy Python DS libraries. Overall it seems like a thoughtful combo of a lot of good numerical programming features.

meztez · on Dec 28, 2019

Debugging is something I could not do as easily in Julia vs debugonce and trace in R. Compiling takes time.

But so far we have seen great development. Flux is a truly beautiful ml library. Being a compiled language remove a lot of headache when building production images. The syntax, the full utf support in variable name. Package management is great. Having that abstraction layer between CPU, GPU so you don't have to rewrite code. Dispatch based on signature, type management. I don't see it going away soon. It took me 13 years to make them transition out of SAS, good thing cloud computing come around and someone realised the clusterfuck of having to manage SAS licence in the cloud.

CoolGuySteve · on Dec 28, 2019

Both R and pandas force you to wrap your problem around dataframes and vectorized operations. But sometimes you really do just want to write a loop that iterates over the data.

Right now the only way to do that without significant performance costs is to drop down into C or avoid the problem completely by using Julia.

Having worked with both R and Python on large datasets, I think both languages are really easy until they aren’t. Eventually you hit a performance wall.

amrrs · on Dec 28, 2019

You can increase the speed of loops in Python using Numba. It's really a great performance booster with just a few decoraters added.

goatlover · on Dec 28, 2019

You can drop down into the Numpy values array in Pandas to get your performance gain when iteration is otherwise slow.