The code that makes him say "what a mess," I think is beautiful:
def summary(data, key=itemgetter(0), value=itemgetter(1)):
for k, group in groupby(data, key):
yield (k, sum(value(row) for row in group))
Perhaps that's because I'm a programmer, and Python is a general purpose programming language. But I think that's what his complaint boils down to: the Python statistical code looks too much like Python. Which, yeah, it does. Python is a general purpose programming language, not a domain specific language for statistical programming.
However, I don't think the programming concepts one needs to understand to make effective use of a well designed Python library are too much to ask. I've only dabbled in R, but when I did, it required me to exercise my general programming knowledge to understand list, matrices and functions. I think the author is also falling in the trap of what is obvious to him is obvious to everyone. I'm actually not sure of what the SAS code is doing, and much prefer the Python.
Perhaps that's because I'm a programmer, and Python is a general purpose programming language.
Exactly. You shouldn't have to be a programmer to do statistics. Just like you shouldn't have to be a network engineer to share files. What if DropBox had stuff in there about http, ports, levels of service, bandwidth, etc... You'd probably say, "Great! I always wnated to specify that DropBox use SSL4.7 draft B over CDMA EvoX.1 -- who wouldn't?"
When you're doing a DSL make it is as simple as possible. And if you have time, in v2, give it hooks to just break out and do crazy stuff... but the 90% case should be simple as pi.
There are GUI statistics apps for people who just want the common case, Dropbox-style: packages like Weka for data mining / predictive statistics, SPSS for descriptive statistics, and a dozen other such things.
The statisticians who choose to use a programming language like R or Python typically do it because they actually do want a programming language. I mean, that's why Bell Labs statisticians invented S (R's predecessor) to begin with.
I am a statistician that does both research and applied work.
I use R for three reasons: (1) It's Free Software; (2) It's a programming language; (3) Other statisticians use it so it's easier for me to collaborate.
There are the usual supporting arguments for (1). (2), I've only used SAS a little bit, and it was extremely unpleasant to use it for non-built-in stuff, which makes research harder for no good reason. For (3), I have nothing against Python but most other statisticians don't use it. If I want to share my work in R, it's easy (statisticians know how to install R packages). If I want to share my work in Python, I first have to teach [most] other statisticians how to use Python. There's nothing wrong with that, but why raise the start-up cost for them?
tl;dr I conjecture that most statisticians don't want what the author is suggesting. Also, there are plenty of companies that are trying to do what the author is asking for, but most of them seem to miss the desired sweet spot, or charge lots of money, or both. I haven't taken a survey of the available software in quite some time.
No, but I can give some suggestions. It would help to know what you want to do.
First of all, you need to decide if you want a language reference, or an application guide, as R books fall into those two categories.
If you have a specific type of work in mind (bio-informatics, data mining, data visualization, ...) I'd say to find a book that focuses on that topic. I haven't looked in a while, but I haven't seen a general R book that I like, anything I suggest there would be guessing on my part.
There are plenty of good references on the web. I'd start by looking at the material available from the R web site:
R's core manuals [1] are typically correct and reasonable to use. The "Introduction to R" guide will get you up to speed fairly well if you already know another programming language. There is also the contributed documentation [2]. I haven't gone through these, so I can't say much about them, or promise that they are up-to-date. I suspect not, as R develops rapidly. The one reference I can recommend highly is "The R Inferno" by Patrick Burns [3]. This is not a starter guide, but something you read after one. It gives excellent advice on avoiding common pitfalls in R.
Thanks. I do biology with limited amount of data and my needs are very basic. Here is a software I wrote to do sleep analysis in Drosophila: http://www.pysolo.net
So far I could satisfy most of my statistics needs with the function in numpy and scipy but occasionally I need to do something slightly more fancy and R I guess is the way to go.
Possibly. R is really great at doing "fancy" statistical analyses. It's very lousy at doing things like text manipulation. When I have a project that needs some text manipulation on the front end, I frequently use other tools (Python, vi, sed, ...) on the front end to beat text data into a nicer form for R. I couldn't say without knowing more about your project.
I always seem to come back to "Introductory Statistics With R".[1] It gives a lot of examples of how to do "the day-to-day stuff". Also, since, as the title suggests, the statistical contents are mostly (very) introductory in nature, it's really easy for me as a reader to decipher what's going on in each example- it's easy to tell which parts are specific to the example itself and which parts are generic to R, if that makes any sense.
Right. I wasn't saying that there didn't exist such packages, of course there are. I was pointing out that the reason a programming language looks good to a programmer and not a statistician is due domain expertise. And of course the common trap programmers fall into is assuming the domain is programming.
And don't lump R in with Python. And good statistician would have your neck. You mention S, but again S doesn't look anything like Python either.
I only see him "lumping R in with Python" in that they're both full-blown programming languages and TFAA apparently hates them both because they're programming languages.
_delirium is merely pointing out that there are push-button packages for statistics, and that statisticians using programming languages (be they statistics-oriented or not) usually do so because they want to or because they need to (as the push-button stuff is not sufficient for their needs, for instance)
I'm pretty sure that Python makes a lot more sense to mathematicians than the special purpose syntax of SAS. List comprehensions: mathematicians use set comprehensions all the time. First class functions: same.
If you just need graphs and pivot tables, use some GUI tool.
That looks close to as simple as possible, if you assume Python is to be used. My point about R was that even in a language designed for statistics, I saw dependence on common programming concepts.
I also prefer this python code to the SAS example listed. I have been trying to brush up on statistics over the last couple of years and I think this article points to an issue that occurred to me. Namely that when somebody says they "know statistics" it sort of has to mean that they know one of the big stats packages out there. It doesn't appear that anybody is really doing stats from first principles anymore.
It seems like there are differences in terminology between one author and another and now with the different programing models there is a whole new level of incompatibility.
>Python statistical code looks too much like Python. Which, yeah, it does. Python is a general purpose programming language, not a domain specific language for statistical programming.
I have to agree (with your criticism). I spend most of my day in SAS and R, and my Python is limited to tweaking code from my colleagues, but I don't see how either the SAS or Python listed is better or worse than the other.
I actually like the quote in the article reg. DropBox's simplicity, but I don't get the relationship to statistical programming languages.
Picking on Python for not having simpler built-in ways to do domain-specific statistical operations seems rather silly to me.
I've been involved the last few years with creating better data structures and tools for doing statistics in Python-- with excellent results (http://pandas.sourceforge.net and http://statsmodels.sourceforge.net). So I think the author should take a closer look at some of the libraries and tools out there.
I think his point here is that the most visible aspects of the code are the structures built up to do the computation, rather than the computation itself. As a description of a generator loop, it reads quite nicely. But the language does not give much ground to the topic it's describing, in the way (to use the obvious example) Lisp would. I think that's what he is getting at.
Yes, I like the Python also, but you have missed the point. For MBA-types, business types, and scientists the programming concepts are too much to learn. Why should they have to learn programming when their needs are simple? It is not just "keep it simple", it is "keep it simple" for non-programmers.
Maybe I'm missing the point too, because I don't understand why he's arguing that Python and R should cater to people that don't want to use a programming language. Isn't that akin to arguing that C is too complicated because it allows you to directly access memory rather than abstracting that away?
MBA- and business types have Excel. As a researcher, I flex both Python and R regulary -- but I want the full power of a programming language, not a couple of macros to generate a pivot table.
Agree with this point. I'm both MBA/bizdev and software engineer. When putting on my MBA hat and working on sales forecast, decision making models, spread sheet is all I use. It is quick, tweakable, super easy to share. Whereas building my site which focuses on market research services, I resorted to C and existing stats packages cause they are powerful, more flexible, and basically programmable. To me what MBA/bizdev people need is significantly different from what a software developer writing stats-related code need. It is a very different scenario from the dropbox story...
I understand that point. My point was that a Python library for statistics is not the right tool for them, but that in no way makes that statistic library or Python "bad." Python is a programming language. If you think that the users you have in mind can't handle programming, then don't give them a programming language.
A) Any half-competent scientist is comfortable programming.
B) Some programming is even required in a lot of undergrad business/MBA programs
C) What the author really means by "MBA-types" are morons. So, yes there is a market for an user-friendly domain specific statistical language. It's called SAS. It's expensive. But it does the thinking for you...if you're a moron.
Also, none of this has anything to do with Python, which is an absolutely beautiful language.
To be honest I find the whole repeated "no, shut up" thing to be a bit crass and it makes me unsympathetic if anything. I hope this doesn't become a catchphrase in blogs.
I thought it was perfect in the original dropbox quora post, but I agree that it doesn't quite fit here, and there is certainly a danger of it becoming a meme.
The original post wasn't insightful at all. That arrogant, know-it-all attitude is not how DropBox got their interface right. They got their interface right through careful attention to their users, by being humble enough to trust the user data and throw away features they had thought would be useful.
EDIT: Downvoted, great. This must be the ultimate triumph of snark: we are now perpetuating the myth that common sense and a sassy attitude is how DropBox created a breakthrough product, instead of careful beta testing and analysis of usage data.
If 90% of usage boils down to a small number of rigid patterns, then there is a simple solution: a handful of convenience functions. Often these functions are missing, because the demand for convenience functions is obscured by the fact that every experienced user defined them for himself years ago. That forces newbies to suffer through the unnecessary task of understanding the fully generalized API before they can accomplish simple tasks.
Languages that have good support for optional arguments, such as Python and Lisp, also make it possible to create APIs that are elegant and concise for experts but extremely intimidating for beginners. It may be more elegant to have a single function with a slew of optional arguments, and an experienced user may be able to accomplish any task quite concisely by specifying a few arguments, but a beginner would be better served by a handful of specific functions with specific names. API writers should consider providing those functions as simple wrappers to the general API, in order to provide a simpler learning curve for users who might never need more complex functionality. Examining how those wrapper functions are implemented can help intermediate users figure out the general API, too.
Exactly. It's trivial to write a prettied-up interface to those Python functions that would make as much sense (?) as PROC MEANS. Good luck trying extend SAS to do anything the designers didn't implement as a procedure, though. Having had to navigate through a complex SAS macro or two in my day, I can assure you that it the single worst experience I have ever had in 20 years of programming.
Actually in my experience business users want one thing on that list, and if they have that thing they don't care about whether you provide the other two. They won't ask for what they want because they don't know that they can get it. But they will be happy if they get it.
They want to get at nicely organized data easily from inside of Excel. Excel is a toolbox that they already know from which they can do their own pivot tables and graphs. And they'd prefer to do that because then they can just do it instead of less efficiently having someone else do it for them.
They want it to arrive nicely aggregated and organized, since Excel is not very good at that. But they are more than happy to do the pretty reports themselves. Just get them the data.
What Dropbox does is eliminate the programmer from the equation. You don't need your IT staff to do any special setup or to follow a special process.
I think the truth is business users don't want to work with you - they just want their data relationships to be discovered in a simple and intuitive way. If your data crawling is sufficiently good, maybe you can do that.
I understand the point the article is making, but I feel rather strongly that "(y)ou’ll want a third thing – to read in and parse data" translates to the person that builds a tool that does that nicely and automatically creates pivot tables, graphs, and other dashboard-y things will probably have to hire a team to shovel the money off so he/she can breathe.
I suppose it's an ok example to talk about this re: statistical programming languages, but my own experience in the three requirements that preface the whole discussion (pivot tables, graphs, data parsing) are a big example of something just screaming for a new solution, not a new prog language . . .
import numpy as np
import tabular
# CSV with Region, City and Sales columns
data = tabular.tabarray(SVfile = 'data.csv')
# Calculate the total sales within each region
summary = data.aggregate(On = ['Region'], AggFuncDict = {'Sales':np.sum}, AggFunc = len)
summary.saveSV('summary.csv')
I can't speak for the creators of R, but I have a strong suspicion that it wasn't intended for erehweb or MBAs. It's not Microsoft Excel, its R. It's used by PhD researchers in Mathematics, Statistics, Economics and Political Science.
Maybe I'm just a simpleton, but it seems odd to attack something designed to make statistics analysis easy for statisticians, because it doesn't meet the needs of mid-level managers.
Managers and other folks who need to make pivot tables, graphs and related things without programming have a great tool to do that: Excel. For these people, Excel is Dropbox.
> People don’t use that crap.
> But they do want pivot tables, ...
That's what the cookbook recipe provides, a function called summary() that makes a pivot table. Problem solved :-)
> I should be clear that my complaint is with
> Python rather than the code as such.
There are plenty of ways to write the summary() function with plain, straight-forward Python code that doesn't use generators, itertools, or any other advanced feature.
So, why does the recipe author use itertools? It is because they provide a way to get C speed without having to write extension modules. Had the author used map() instead of a generator expression, the inner-loop would run entirely at C speed (with no trips around the Python eval-loop):
for pivot_value, row in groupby(data, key):
yield k, sum(map(value, group))
I think it's wonderful that a two-line helper function is all it takes to implement pivot tables efficiently.
There are quite a few really nice tools out there for this kind of data analysis (generally called Business Intelligence, or BI for short). None of them that I have found are procedural, they are all based on interactive dashboards. I'm sure that there is some scripting or XML formatting required behind the scenes to get the system set up to accept data, but after that it's all point-and-click.
The systems that I've seen/evaluated are Needlebase, Birst and Spotfire. None of them are particularly cheap, but if you're in a business where real-time access to data would help your team make better decisions, they could be very valuable.
For the record, the Lisp dialect that is mentioned in the comments is http://lush.sourceforge.net/
with respect to the discussion, it is on the R/Python side, ie. powerful general-puspose (Lisp) language, with builtin Statistics facilities.
Anytime I hear a discussion about designing computational tools for "non-programmers" I'm reminded of the subway in Mexico City, Mexico. The subway stops have nicely detailed pictures that are descriptive of the locations around the stops. This is because many people are illiterate. It's about time people realized that programming is literacy.
Kirix Strata looks like the dropbox equivalent for this space (http://www.kirix.com/). All it does is suck in data -> Create relationships -> pivot, graph and report. The people I know that use it swear by it because it fills one small gap instead of many.
# group by Species, can be multivalued see ?by
# sum(Sepal.Length, Sepal.Width)
# mean(Petal.Length, Petal.Width)
by(data = iris, INDICES = iris$Species, FUN = function(x) {y <- colSums(x[,c(1:2)]); z <- mean(x[,c(3:4)]); result <- list(y,z); result})
I've been using http://tablib.org/ for awhile now to read in tabular data. With this summary fn and a few other functions to simplify the process of aggregating data into useful views I think you've got a winning solution to the author's complaint.
"Dropbox uses Python on the client-side and server side as well. This talk will give an overview of the first two years of Dropbox, the team formation, our early guiding principles and philosophies, what worked for us and what we learned while building the company and engineering infrastructure. It will also cover why Python was essential to the success of the project and the rough edges we had to overcome to make it our long term programming environment and runtime."
Theres also Tableau (http://www.tableausoftware.com/) for people interested in just pivoting data and charting. Its kind of expensive and PC only but serves that function well.
As a Statistician, I used to use SAS, Stata, R, Excel and of course SQL to extract data but for the purposes of pretty, pretty charts, Tableau is king.
However, I don't think the programming concepts one needs to understand to make effective use of a well designed Python library are too much to ask. I've only dabbled in R, but when I did, it required me to exercise my general programming knowledge to understand list, matrices and functions. I think the author is also falling in the trap of what is obvious to him is obvious to everyone. I'm actually not sure of what the SAS code is doing, and much prefer the Python.