I think open-source eventually replaces commercial products, in the same way that proprietary products become commoditized. The response for commercial products is also the same: continual differentiation, adding new features, benefits, support, documentation etc. Exceptions are also the same: natural monopolies (e.g. strong network effects).
Open-source is great at hill-climbing, where there are clear directions for improvement and especially for features that are obviously needed by users (provided the structure of the project is sufficiently modular to facilitate it), by tapping the collective intelligence of users.
It's not great at "hill-hopping": originating radically different products.
Counter-examples abound. Can you name even one open source app that has displaced a mature, user-facing desktop app with a non-trivial UI, other than a web browser?
Open source only seems to win in domains in which it makes sense for companies to share work in order to compete at a higher tier of functionality.
It has by no means "displaced" its proprietary equivalent, but Inkscape is one of the most user-friendly open source apps I've ever used. I find it far more intuitive than Illustrator. An incredible amount of power and complexity is presented in a way that makes it quite intuitive and a joy to use. It's also easily extensible if you're a programmer.
You skipped the bit about differentiation: if, for example, photoshop didn't keep improving, do you think the gimp would never catch up? I think you could find lots of examples where today's open source version is better than an x-years-old proprietary product.
The only way it realistically can realistically happen if is the commercial product is not being improved (i.e. differentiated) any more. Your example is one of these - standardization is related to commodification.
I suspect also that user-facing apps are easier to keep improving, because the user is right there, and always has more needs that could be served (e.g. text editors will evolve til they can read mail; those that can't will be replaced by those that can). Non-user facing apps tend to be defined by their environment, rather than by users - although, any component that creates a benefit that the user wants more of will keep being improved (from Clayton Christensen). e.g. databases, CPUs.
You probably don't remember this, but Emacs did that in the 1980s. And then of course there's Android, but I guess you might not consider it a "desktop app". And then there's Wikipedia, which has completely displaced Encarta.
I don't think it makes sense to make generalizations about where "open source seems to win". Things are changing too fast; the circumstances that made it possible for Mozilla to beat IE in the mid-2000s no longer exist, for example.
I don't think it's obvious that open source displaces commercial for scientific computing. For every example like R which has in many places displaced S-Plus, there are counterexamples like matlab, for which the open source clone Octave is a bad joke, at least the last time I tried using it: missing functions, slowness, extreme difficulty installing; or Mathematica, or eviews, or gauss, or Maple.
One other potential factor: a lot of this software is driven by academic use, either because academics used it or that's where people were first exposed, and academics often receive large discounts.
Today's Octave installs do not need more than click-click-ok-done or apt-get install octave.
> for which the open source clone Octave is a bad joke
I think you're not giving Octave enough credit, considering that they have only a few part-time developers, and that nobody does sponsor them, they have accomplished a respectable amount of functionality over the last 20 years, it is extermely unfair calling them a "bad joke".
Of course with those limited resources they are not able to match the output of Mathworks, but what they can do as of yet is usually more than universities teach, and _still_ many departments are kind of married to Matlab, only mention Matlab to students, give only matlab examples, matlab labs, matlab exercises, etc. Also very respectable people like Gilbert Strang, who gave the MIT basic Linear Algebra and Computational Science and Engineering classes, seem to have enough vested interests in Mathworks to not even mention Octave to students briefly as something they can download and work at home. Octave is extremely powerful and capable for what you pay for it, and deserves at least a mention.
It is probably not different at other universities and other departments. Several professors I had to deal with were similar, either not even aware that open source packages like Octave, Scilab, Maxima, Scipy exist at all, or extremely faithfully married to companies behind proprietary packages like Matlab/Maple/Mathematica.
This is not a new issue. Institutionalised education is most of the time producing "knowledge workers", as MS put it a while back. But without the knowledge, of course.
When did you last try Octave? Professor Andrew Ng recommended using Octave (probably because it's free) for the online Stanford machine learning class (http://ml-class.org/).
Probably not for a while. At this point octave has a TON of Matlab compatibility:
The grammar is pretty spot-on although there is usually some release latency when Mathworks changes it (obviously since their plans are not made know. Ahead of time).
Octave even has Matlab source level compatibility for mex files although they are slower than octaves own c interface.
If you start drifting away from Matlab core needs into the specialized add ins Mathworks provides (simulink, financial packages, etc) then Octave can't help. If you need those then I find that Matlab is rarely the tool for the job either (you just don't know it yet ;))
Cant even compare R, a statistical engine to Matrix Laboratory, a numerical matrix manipulation engine. they are different software packages designed for different purposes. eg. plugging R into a high grade NMR magnet and processing signals is probably not a good idea.
R is sick though and I am alwalys pleasently surprised at what clever people are doing with it. Octave on the other hand shouldn't be used until someone writes a proper interface and decent graphing. MATLAB is a million light years ahead of Octave in that regard
I still wish there are going to be some syntax improvements to Python then. Matlab's way of working with matrices is simply excellent. Formulas on paper map almost one to one to the code. Apart from that I'd take Python over Matlab any day but am forced to work with it for some of my classes.
Anecdotally, NumPy (Python) has some traction. Similarly they don't consider SQL libraries. And I'm sure there are statistical analysis libraries for Java. According to the bar chart below R is mentioned by 45%, SQL by 32%, Python by 25%, Java by 24%. This seems a more reasonable comparison to me than the graphs earlier (higher up) in the post.
I use R as my primary data-analysis tool for almost all of my work, with occasional recourse to SAS for certain specialized models (e.g., PROC GLIMMIX for generalized mixed models).
My only complaint is the awful default IDE, which can be mitigated to a large extent by scripting elsewhere and source()ing the script, and some odd edge behaviors including the mystifying row names of dataframes, the difficulty of dropping unused factor levels from aggregated or sliced data (another dataframe issue), and the perhaps unnecessary obscurity of some of the plotting functions (although holding R responsible for the lattice library is unfair).
All that said, for a free tool, it's extraordinary, and the authors of the base language and the many packages that I use have my gratitude.
I love R--but I end up using Stata more often because it is easier to produce vector graphics that can be imported to Illustrator. I wish that the R community would start to focus on graphics.
I've had some success with output from lattice using Cairo's SVG option, although you're right that it's never easy. Self-citing, the plots in these pubs were generated as above (JoCN may be behind a paywall):
> Robert A. Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob is a consulting statistician with 30 years of experience
Disclaimer: I hate R's syntax, but my company's analytics group uses R for just about everything.
Unfortunately, it's almost impossible to work with a very large datasets in R, because of the speed limitations. Many researchers I know use Matlab because of this.
My recollection is that Octave is significantly slower than Matlab, and some quick googling on benchmarks [1] suggests that it is (was?) as slow or slower than R.
I've complained before that Octave is the wrong solution to the Matlab problem, and if you aren't attached to one of the many fine Matlab toolkits, you're likely better served translating to a more expressive language, like Python+Numpy+Scipy.
Octave is Matlab clone, in fact Octave developers openly say that except for some special cases, any difference between Octave and Matlab is a bug.
The biggest difference between Matlab and Octave is JIT compiler in Matlab, which does incredibly good job at vectorizing simple (or sometimes even not-so simple) loops.
I think it's fair to say that Octave performance is very close to a Matlab in a pre-JIT time.
There's also a huge difference in toolboxes, profiling, sparse matrix operations, parallel computing and many-many more. In these areas I'm afraid Octave is light-years behind Matlab.
However, you still can do a lot of useful simple stuff with Octave and it's free! Matlab-like syntax is really, really cool then it comes to vectorized operations. So probably these two reasons determined Andrew Ng's choice of Octave as a main environment for ml-class. Huge win for Octave I guess. This might spur some interest in the development, attract new people to the product. I think it's a well-deserved success for John W Eaton and other people who develop(ed) Octave all these years.
I agree with your take on Octave performance relative to Matlab. The Matlab parallel toolbox is getting more and more useful in a multicore world.
As you note, the Matlab profiler is very nice. You can zero in on the 80% of the 80/20 tradeoff very fast, during your usual development cycle. It's as simple as:
>> profile on
>> do_something
>> profile report
and you get a nice graphical/textual report on time usage in everything do_something called.
This is not true. They strive for Matlab language compatibility, but none of them refers to Octave as a "Matlab clone", nor are they working on cloning Matlab, nor was the project started to become a matlab clone. It is like calling Linux a "Unix clone".
It's probably more an issue of easily pre-filtering/aggregating the data before analysing it with R. I like this approach of moving the calculation to the data, but we must be very late on the adoption curve if Oracle are doing it already.
For statistical genetics at least, it's common to process much of the data in parallel, so the RAM limitations on one R instance are not the gating factor.
Having seen and heard about what Bioconductor had to do to process genetic data, memory is a huge issue. It is even more so with next-generation sequencing data.
Yes, I guess I've always operated under the assumption that I've needed to parallelize dramatically. I usually operate on data from families of ~40 people with next-gen sequencing data, and the tools that I use generally finish within about an hour.
I use R every day for my research (doing social simulations sometimes based on sample surveys). An additional R limitation is the memory limit. R cannot use virtual memory and the maximum amount of data is limited.
There are two ways to deal with that, one is to load datasets through SQL database (using a SQL library) which IMHO is a "dirty hack". The other (what I usually do) is to load the huge datasets in STATA (or any other stats package) and filter the data to get a set that is small enough to work with R.
Other than that, the available libraries in R are crazy good. for example stuff like Approximate Bayesian Computation or survey analysis (considering weight factors) is straightforward with available libraries.
The core libraries available in R are some of the most well-reviewed, carefully written, and correct codes available.
There are a huge amount of available libraries (thousands!) of variable quality thanks to the open nature of the project. But commercial software has problems too, especially with new and niche products. And when something goes wrong in those cases, you can't see why for yourself. Worse, other independent experts would not have the chance to either.
He is probably comparing R to SAS (which are the two most popular statistical programming languages). SAS doesn't really have libraries, instead you buy additional packages from SAS, which are very reliable and well supported, but expensive.
My company shuns R (although I personally like it), primarily because of this issue. If we need to run a rare or uncommon statistical procedure, it is a lot easier to trust the SAS procedure, rather than an open source R package written by some grad student.
True,
Though if you need to run a rare or uncommon stat procedure, SAS is not likely to have it in the core, and then you are back to using what "some grad student wrote".
I am shunning SciPy and to a less extent NumPy for the same reason. I have reason to believe the developers are not experts in numerical linear algebra and some of the documentation also do not lend confidence.
yes.. but for less adopted or emerging platforms, you have to be more conscious of the source of the library, and should look at the source to verify its functionalities
Having worked on and off with SAS in recent years I'm aware it has its limitations, but round here we like constructive contributions please. Would you like to expand upon your remarks?
Open-source is great at hill-climbing, where there are clear directions for improvement and especially for features that are obviously needed by users (provided the structure of the project is sufficiently modular to facilitate it), by tapping the collective intelligence of users.
It's not great at "hill-hopping": originating radically different products.