Goodbye R scripts, hello R packages

vikp · on Nov 4, 2013

I love R, and it was actually the first language I really learned to program with (for obvious reasons, I wouldn't ever recommend this). I can identify a lot with the "one-off scripts" problem of R. When I look back at some of my R code, I find that it is just a giant mess of commands mixed together semi-randomly.

As I learned to program properly, I solved the reuse and "good code" problem more by moving towards using Python than by making R packages. I occasionally use R for data exploration and visualization, but Python has support for almost all of the machine learning and statistical functions that I need.

I am very interested to hear how other people have solved this problem. Do you only use R, or R in combination with Python/Julia/Java? If you only use R, are you using it purely academically?

chubot · on Nov 4, 2013

IMO the Unix philosophy of reuse is vital for data analysis. I recommend that people in this space learn to use the interactive shell and shell scripts well. I saw a comment that said "Shell is a REPL for C", which I think is quite pithy.

People think too much about reusing R packages or Python packages. In my experience you can get things done a lot faster if you factor things into programs and not libraries. This lets you use multiple languages, and all real world analysis problems need multiple languages. (If you're only using one language, then you're likely only working on part of the problem).

Right my toolset is Python, C++, and R, coordinated with shell scripts. I still like R for quick plotting, and of course data frames are essential, but I'm playing with Pandas now, which seems impressive. I don't think Python will ever catch up to R in terms of statistical functions, but in terms of plotting/munging/data frames it might.

In the future there will be more languages, not less (Julia will add to the number of languages, not replace any). So being able to decompose a problem into separate, reasonably generic, programs is an important skill IMO. Most data analysis pipelines are a huge mess, but they don't have to be.

vikp · on Nov 4, 2013

That's an interesting way to go about it. Any reason for using shell scripts to coordinate the flow instead of things like Cython and RPy? (I don't shell script a lot, so this may be a silly question)

These days, I mostly seem to be able to get away with using single languages for applied machine learning, like a Python webserver that runs background machine learning tasks, or an Android app in Java that connects to a webserver written in Python. But back when I was doing less applied side stuff, I was similar to you in that things were spread across R and Python.

chubot · on Nov 4, 2013

In my experience, RPy is annoying to get working, because you're dealing with Python versions and R versions together. It's brittle and usually unnecessary, as it's you can just serialize to CSV or JSON.

If you can get away with using a single language that's good, and Python is probably the only one where that is possible (i.e. can write both prototype and production code, and data ingestion and machine learning). But I often have to use C++ because of the data size, and I think R's plotting is more convenient than anything in Python now.

The shell scripts have their messiness and sharp edges, but they definitely save me many many lines of code. There is always some weird thing that needs to be integrated/automated and shell is almost always the right tool for that.

LukeShu · on Nov 4, 2013

The Unix Shell is basically an IPC language. It's sole purpose is to be able to combine programs, and allow them to communicate. Instead of trying to make a package in one language work with another, it is a lot simpler to use Shell to connect together programs in any language.

RA_Fisher · on Nov 4, 2013

This is an amazing suggestion. I hadn't thought of using shell scripts to modularize cross-language. Thank you.

chubot · on Nov 4, 2013

There have been some posts on Hacker News describing this philosophy (see below, can't find the HN comment links unfortunately).

Another advantage is that it encourages separating your policy from mechanism. For example, you can make your sampling rate or number of buckets a command line flag, and have all your parameters in the shell scripts, rather than having various constants strewn about the code.

And obviously you can use each language where it's stronger. Python is better (and much faster) for the data cleaning part, whereas you might want to use ggplot in R.

http://jeroenjanssens.com/2013/09/19/seven-command-line-tool...

http://www.drbunsen.org/explorations-in-unix/

http://strata.oreilly.com/2011/04/data-hand-tools.html

oddthink · on Nov 4, 2013

I'm slowly moving my elaborate data-cleaning system from R to python. I never got the hang of R packages and ended up rolling my own "import" scheme, but now it all seems like I've passed some critical complexity point.

I'm still doing all of my model estimation in R, and data.table + ggplot2 is still my go-to solution for interactive exploration and plotting, but I'm putting more and more of the process-csv, write-hdf5, remove out-of-bounds values, backfill, stratified sampling/bootstrap, etc., logic into python.

I may end up moving more of the estimation itself into python, now that statsmodels and sklearn are more mature, but R does that pretty well.

I've not tried Julia, but I have experimented with Java for some of the data-processing, but it's never seemed worth it, over python. Basic CSV processing is faster in python than in Java (although the python version uses much more CPU), so I haven't felt the need. (I'm probably limited more by disk bandwidth than CPU, but I've not done the tests required to prove that.)

Julia seems nice, but it seems intent on copying all the questionable design choices of matlab, rather than using ideas from the numpy/kdb+/apl worlds.

RA_Fisher · on Nov 4, 2013

I'm just now getting into Python. Primarily because I'm abandoning Fisherian methods for Bayesian methods and all of the books seem to have examples in Python.

I use R in a business setting. I'm a data scientist at Treehouse (teamtreehouse.com)

jasonpbecker · on Nov 4, 2013

There are great tools that interface R with BUGS that may be worth checking out.

I would highly recommend getting this for the shelf: http://www.amazon.com/Analysis-Regression-Multilevel-Hierarc...

It's one of the most readable books on data analysis I've come across and does a great job presenting both frequentist and Bayesian techniques with tons of R sample code.

There are a lot of advantages and nice things in Python, but I do think folks tend to toss out R a bit too casually. Each tool has areas they excel in. I don't even do particularly complex analysis, but have run into areas where Python is woefully lacking in fairly common (social science) models.

RA_Fisher · on Nov 4, 2013

Thanks! I'm planning on digging into Cam Davidson-Pilon's Bayesian Methods for Hackers soon:https://github.com/CamDavidsonPilon/Probabilistic-Programmin...

Thanks for the link to the book. I could see it being really helpful to me to have frequentist examples alongside Bayesian examples.

I'm hoping to learn Python so I have both tools. R is so domain-specific that I doubt I'll ever completely stop using it in my career. Who knows, though!?

hadley · on Nov 4, 2013

I think R is actually a rather great first language, as the focus is usually on doing cool stuff, rather than developing a rigorous framework for programming. It's more important to put motivation before rigour, since if you focus only on rigour most people will lose interest before getting to the fun stuff.

vikp · on Nov 4, 2013

Great point. I still wish I coded in R more, just because the MTTC (mean time to cool) is so much lower.

But, I feel like you can do the same with sklearn/ipython notebook/pandas if you ignore the parts of the "rigorous framework", and just focus on the syntax for interesting things.

vdaniuk · on Nov 4, 2013

Should one learn R if one already has some experience in data analysis with Python?

collyw · on Nov 4, 2013

I would like to know the same.

I work with a lot of bioinformaticians. The ones from a biology background (i.e. code quality is not usually important) seem to like R. The ones with a computer science background seem to go for Python, or Perl if they are old school. So far that has put me off R.

elohesra · on Nov 4, 2013

Purely out of interest, what's so wrong with Windows as a development environment? I've seen this claim touted before on HN, but I've never seen an explanation (not doubting that there is one; I just haven't seen one). I develop on Windows every day, and I'm always happy to learn a new way to improve my workflow, so does Ubuntu (or Mac) actually make it easier to develop software, and if so then how?

I'd also be interested to hear what kinds of software you've found it hard to develop on Windows.

CraigJPerry · on Nov 4, 2013

>> what's so wrong with Windows as a development environment?

Command line.

When you're doing a task on a computer, some paradigms are better suited than others depending on the task.

E.g. imagine delicately touching up pixels in a photo, guided by your artistic eye yet instructing the computer by laboriously typing many commands into a command line. There's no reason this couldn't work, it's just way more natural to use a pen & digtiser, or a mouse.

When we're developing we're editing text and invoking commands. Editing text is covered by any major platform just fine, a lot of popular editors & IDEs are even cross platform.

Invoking commands are sometimes covered by the IDE, sometimes the commands covered are pretty comprehensive (i'm thinking of Tom Christiansen's quip "Emacs is a nice operating system, but I prefer UNIX").

There's a limit to what tasks your IDE can cover, even though some do try really hard to cover all your needs (Eclipse has a web browser in it!) and when you hit that limit as we so often do in development, you need a command line.

At that point, Windows kind of whistles while sheepishly looking off to one side.

There's a related point about development dependencies, some libraries are hard and / or time consuming to build. A great package manager is a brilliant productivity booster. Mostly though, it's the point about the command line.

(just to be clear, i have given powershell a good shot, i estimate i've authored almost 1k lines of it).

elohesra · on Nov 4, 2013

Thanks for the reply.

What sort of command line tasks are you trying to invoke on both systems that are easier to invoke on unix? I can't think of any commands off the top of my head that I've had to invoke recently, other than IIS commands. Maybe it's just the fact that the C# developer's workflow is so heavily IDE based. I can imagine that if you were having to manually invoke the compiler etc, then a useful command line would be a must.

CraigJPerry · on Nov 4, 2013

No problem at all!

Searching is a big one, looking through my shell history i frequently use egrep although I think egrep is a bad example here since I could use the IDE to do a search. The problem is that I frequently want to do something with the results - maybe make a substitution in each of the result files, or compress the files, or copy them to another host.

I overuse the find command, locate would run faster in many cases but find is just a reflex. E.g. answering:

    find data/ -type f -mtime -1 # get me the working dataset from today

I see a lot of source control commands, This is something all the IDEs do but in my experience it's much more robust from the command line than an IDE.

There's a few 3 or 4 line scripts for various tasks I was doing manually. To give an example one is to kick off a rebuild of my KVM virtual machine (I currently write a lot of CFEngine code so this is a frequent thing when the unit tests of the cfengine code don't go to plan and leave the KVM borked!)

The biggest use case in my shell history is simple navigation to look at or operate on various files. Fuzzy search in ST2 is slowly winning my heart here right enough but it doesn't work on remote hosts or even local dirs not in my project.

elohesra · on Nov 4, 2013

Very interesting, thanks for the in-depth explanation.

I think that this does appear to be a fundamental difference between Windows devs and Unix devs. All of the things you've listed there, I'd do through the GUI on Windows.

Searching, obviously, I'd Windows+E (or Super + E in OS non-specific keyboard terms) to open up the explorer, then tab+tab to navigate to find, then type the query. I can't think of an easy way to pipe a set of files into some sort of function (e.g. for substitutions) outside of F#, maybe something exists for that in PowerShell, but I find PowerShell to be a bit of a poorly-documented mess.

Source control, again, I'd do through the GUI. With TortoiseHG, I've got my hg commands available for me with the right click of a file/folder under a repository. I see that it seems that this'd be significantly slower than doing it through command line, but the windows shortcuts actually make it pretty snappy to navigate through the file system in the GUI.

For kicking off a VM, I'd usually attempt to shortcut the command I'm looking to execute regularly and then Windows+D my way to the desktop to execute it.

Windows does seem to have made the decision early on (i.e. back in '95, when I first started using it) to provide a clean and concise GUI first and foremost, and then to add programmer/automation-friendly terminal APIs as an afterthought. I wonder if the issues that Unix users on windows have are caused by the fact that Unix systems seemed to go the other way, and that attempting to replicate a Unix experience on Windows would just lead to frustration?

This is interesting, because I rarely get a chance to use Unix at work, nor do I get much of a chance to speak to Unix devs (we're a Microsoft shop -- WPF, ASP.NET MVC, and SharePoint if we're feeling masochistic). Thanks again for an interesting snapshot into your workflow.

CraigJPerry · on Nov 4, 2013

Likewise, ive just realised that I dont make effective use of the desktop on windows but when you mention win+d to get to a screen worth of cherry picked shortcuts, it makes sense.

I never really put anything on my desktop, I do use the win7 taskbar a lot but I reckon ill start using the desktop too.

EDIT: meant to add for piping in windows, someone showed me something similarly useful, if you drag a bunch of selected files onto a program icon, the program can often make use of those files. E.g. drag some files onto the outlook icon and it'll compose a new email with them attached.

elohesra · on Nov 5, 2013

With regards to piping, yes this can sometimes be true. It's very dependent on the actual program. Programs on windows take a string array as the argument to their main function, so if the program is written in such a way that it iterates across every element of the string array argument and then does something with them, then this'd work. Most of the Microsoft programs behave sensibly with this, and execute an Open command against each of the files dragged onto them, but this isn't guaranteed to be true.

I'd still like a nicer, general purpose way of manipulating multiple files in Windows. That does seem to be one place where it's definitely lacking, but it could just be that I don't know how to do it.

Xophmeister · on Nov 4, 2013

In my (anecdotal) experience: A lot of tools (everything from editors to compilers, etc.) that are open source are from the Unix world. They're designed to work in a POSIX way and the ports to Windows (if they exist, or you're forced to use Cygwin) are a bit cludgy: You might just as well have used some kind of Unix in the first place. Windows dev tools are available, of course, but you often have to pay quite a lot for them. Good OSS exists for Windows, but it's a subset of what's readily available under nix, which doesn't encourage the hacker/hobbyist spirit.

elohesra · on Nov 4, 2013

Thanks for the reply.

That does make sense. As a C# developer, I haven't really had call to use OSS IDEs, so I can't really comment, but I do remember having an ugly time trying to use an OSS interactive disassembler on Windows.

EpiMath · on Nov 4, 2013

Interesting, thanks. ( Like your username too, despite him being slightly denigrated in the linked article. ) I personally think that it is good to have a basic grounding in some kind of fundamental "real" programming language before you get into s-plus ( R ) or any other more domain-specific language. The main reason being that you will better understand the compromises, limitations and shortcomings. I like R but as a programmer I've seen some awful code written in it!

RA_Fisher · on Nov 4, 2013

Yes, if you read further on the blog, you'll see I'm more than slightly denigrating. I read a book recently called The Cult of Statistical Significance and it's really turned me against Fisher. Not only do I see his methods as sub-standard to Bayesian analysis, he was really mean and disparaging himself to those folks. It's interesting, my username is what it is because he used to be my hero! Just 6 months ago, I really considered him to be the father of modern science. Now I basically see his work as mostly enabling mediocre scientists to rise through the ranks. This is what I mean by that: http://www.economist.com/blogs/graphicdetail/2013/10/daily-c...

EpiMath · on Nov 4, 2013

I've spent my career thinking about these same issues. And certainly Fisher was, well, abrasive to put it mildly. I've read much of his early work and I think the intended context has mostly been lost. Many modern statisticians ( of the "frequentist" persuasion ) use a strange and awkward combination of fisherian and neyman-pearson. We talk about "p-values" but then interpret them as hypothesis tests with long-term error probabilities ( Fisher disavowed this interpretation of p-values, and insisted they did not have a long run probabilistic interpretation but were a measure only of evidence against the null in the particular experiment. ) I think Fisher gets a bad rap for a lot of later bastardization of his work. ( I'm sure it did not help that he was not a likable person, or his valid but misguided and testimony about tobacco. )

Still, I'm sympathetic to your position and can understand how you'd come to that way of thinking. I'm not convinced even a diehard Bayesian would completely disagree with Fisher's more restricted and stringent interpretation of p-values. Or at least they would see it as a big step up from the more common usage you are referencing.

baldfat · on Nov 4, 2013

When did ideas get muddied by personality? I see this in the whole Ender's Game debate and in this Fisher debate. We need to compare ideas outside of the personality THAN we can separate the idea for the person.

In our history (Historical Philosophy/Theology Major) you would be apsolutely shocked at the lives of great thinkers and disapprove so much of their lives but MOST people don't know. This is the issue with open lives we know so much more about people. It is never just a book or an idea we can learn maybe to much???

Myrmornis · on Nov 4, 2013

What a silly thing to think. I think you'll find that Fisher's development of likelihood and explicit probabilistic modeling is rather important in Bayesian statistics. You'll also find that Fisher was not only one of the top 3 if not the 20th century's most important theoretical statistician, but that he was one of the top 3 if not the 20th century's most important theoretical geneticist. This sort of fanboy Bayesianism is really rather embarrassing.

RA_Fisher · on Nov 4, 2013

Yes, I really do value and see the genius in likelihood, but Fisher is idolized in a really unhealthy way in statistics IMO. I'm a Bayesian fanboy because it works. I spent > 1 yr. trying to get Fisher's testing framework to yield business success. Positive results came, but were always a slog and more often than not it's use failed. As soon as I switched to Bayesian methods, consistent success started happening. It's exciting.

hadley · on Nov 4, 2013

I don't think saying R is not a "real" is a fair characterisation. It is not a domain-specific language in the sense the term is commonly used (SQL and ggplot2 are domain specific languages, R is not). R is a real programming language with inspirations from Scheme and Common Lisp; just because there's a lot of bad R code doesn't mean that the language isn't bad.

EpiMath · on Nov 4, 2013

If you are responding to my comment, I agree with you. I like the R language very much and use it. I may not have said what I meant well enough. What I mean is that there is a tendency with R programmers and much more so with SAS or Stata not to think about the guts of the program and what is really happening. The data frame in R is a nice framework for thinking and programming, but it does not promote thinking about whether i/o is to disk or memory, or whether iterations or loops or recursion is efficient, or overall program efficiency, reuse etc. Programmers tend to rely on the underlying system to worry about memory allocation disk i/o etc. There are many problems where these details matter. So much time is spent on the statistical problem and what procedure or function to call, rather than the algorithm at a deeper level. These details can be handled, but are not so "front and center". Matlab and others similarly. I have seen enterprise-class servers brought to their knees by a poorly written R program... ( you can say lots of things about how that should not be possible, or how that can happen with other languages, and I'd agree, but I still think it is more likely in a context that does not encourage programmers to think "closer to the metal" about where data is and what is happening to it. )

I've seen some great R code too... so this is a complicated topic and has to do with who is writing the code as much as R itself.