> idiosyncratic scientist’s decision to write their analysis in Perl. In my expe...

MontyCarloHall · on June 9, 2021

> It is not hard to produce something completely inscrutable and non-replicable in Python and R the same way it's been done for ages using SAS, Stata, MatLab etc.

It’s possible to write inscrutable code in any language, but some languages sure make it easier.

Syntax issues aside, the main advantage of Julia/Python/R (the latter’s syntax might even be worse than Perl’s) for scientific computing is their ecosystems. A language for a particular use case is only as good as the packages available for that use case. The scientific package ecosystems for Ju/Py/R are far richer than that of Perl, simply because their userbases are much larger. Thus, a scientist using Perl would likely be forced to roll a lot of their own functions, which makes the code idiosyncratic and more likely to contain bugs. (To use one of your examples, people might implement OLS regression by manually computing the hat matrix because no stats package exists for the language they’re using. Now imagine their language lacks something more complicated, like a robust MCMC sampler package à la PyMC3 or STAN, and they have to roll that themselves. Yikes.)

And that’s not even getting into the value of the languages for interactive scientific computing, which is how most of it gets done these days. For instance, there’s no official Jupyter notebook support for Perl (although unofficial plugins exist, they don’t support inline graphics/dataframes/other widgets), and the REPLs for Julia/Python/R are much more modern and fully-featured than PDL2.

BTW, I agree the GSL is great for building standalone tools, but it’s totally irrelevant for any interactive work.

sivoais · on June 10, 2021

> For instance, there’s no official Jupyter notebook support for Perl

Not sure how official support would work in the Jupyter Project since anybody can write a kernel. I wrote the Perl one (IPerl) and that has existed since 2014 (when Jupyter was spun off from IPython). It supports graphics and has APIs for working with all other output types.

Now I do need to help make it work with Binder, but it does work.

---

The other point about MCMC samplers is valid. This is why I wrote a binding to R to access everything available in R and why I use Inline::Python sometimes. I should create a binding for Stan --- should not be hard --- at least for CmdStan at first, then Stan C++ next.

anthk · on June 9, 2021

>. Now imagine their language lacks something more complicated, like a robust MCMC sampler package à la PyMC3 or STAN, and they have to roll that themselves. Yikes

CPAN predates all of those.

MontyCarloHall · on June 9, 2021

How is that relevant? CPAN is a package repository, not a MCMC sampling library. Can you point me to a Perl library that implements an API for constructing a probabilistic graphical model and then performs inference on it via MCMC, like PyMC3 or STAN? Is it as robust and fully featured as either of those?

mattkrause · on June 9, 2021

Stan isn’t really written in any of those languages either.

The python pystan is wrapper that ships data to/from the Stan binary and marshals it into a python-friendly form; I think Julia’s is similar.

I’m not exactly volunteering to do it, but a PerlStan would not be that hard to implement. As for scientific communication, a point you raised above, I don’t think it’d be too bad. Most readers of a paper would be interested in the model itself, and that would be written in Stan’s DSL regardless.

MontyCarloHall · on June 9, 2021

Fine, STAN is a bad example since it’s written as a DSL parsed by a standalone interpreter.

But tons of other numerical methods are also missing from Perl. To use another stats example, in another comment, I gave the example that PDL only supports random variable generation for common distributions (e.g. normal, gamma, Poisson). Anything beyond stats 101 level and you’re on your own.

mattkrause · on June 9, 2021

In bringing up CPAN, the other poster's point might have been that Matlab/Python/Octave don't generally contain native implementations of these either. A lot of Matlab and NumPy is wrapper around BLAS/ATLAS, for example.

One could do the same with Perl, and in fact, people have. If you need random variates from a Type 2 Gumbel distribution, for example, Math::GSL::Randist has you covered https://metacpan.org/pod/Math::GSL::Randist#Gumbel

Honestly, I'm not rushing to convert our stuff to PDL, but I did want to push back a little on the idea that python is The One True Way to do scientific computing. It's a fine language, but I think a lot of its specific benefits are overstated (or mixed in with the general idea of taking computing seriously).

sivoais · on June 9, 2021

Yep, there's more than one way to do things and PDL wraps all the same GSL functions <https://metacpan.org/pod/PDL::GSL::RNG#ran_gumbel1>.

Also note that PDL does automatic broadcasting of input variables so it does an entire C loop for an array of values being evaluated. See this example <https://gist.github.com/zmughal/fd79961a166d653a7316aef2f010...> for how that applies to all GSL functions that are available in PDL. Though I do notice that some of the distributions available at <https://docs.scipy.org/doc/scipy/reference/stats.html#contin...> are not in GSL.

Though when I do stats, I often reach for R and have done some work in the past to make PDL work with the R interpreter (it currently has some build bitrot and I need to fix that).

smabie · on June 9, 2021

tbf, sum(x)/n is much faster than using pythons built in mean function.

nanis · on June 9, 2021

> sum(x)/n is much faster than using pythons built in mean function.

`statistics.mean` uses `_sum` which tries to avoid some basic round-off errors[1]. I think the implementation of `_sum` is needlessly baroque because the implementors are trying to handle multiple types in the same code in a not so type-aware language. Regardless, using `statistics.mean` instead of `sum(x)/len(x)` would eliminate the most common rounding error source.

As for statistical modelling handled by directly inverting matrices, there the problem is singular matrices that appear to be non-singular due to the vagaries of floating point arithmentic in addition to failing to use stable numeric techniques.

The point remains. The detriment to science are people who convert textbook formulas directly to code instead of being aware of implementation with good numerical properties.

Note:

    >>> x = [1e9 + .1, 1.1] * 50000
    >>> sum(x)/len(x)
    500000000.60091573

whereas

    >>> import statistics
    >>> statistics.mean(x)
    500000000.6

See also my blog post "How you average numbers matters"[2].

> Now, in the real world, you have programs that ingest untold amounts of data. They sum numbers, divide them, multiply them, do unspeakable things to them in the name of “big data”. Very few of the people who consider themselves C++ wizards, or F# philosophers, or C# ninjas actually know that one needs to pay attention to how you torture the data. Otherwise, by the time you add, divide, multiply, subtract, and raise to the nth power you might be reporting mush and not data.

> One saving grace of the real world is the fact that a given variable is unlikely to contain values with such an extreme range. On the other hand, in the real world, one hardly ever works with just a single variable, and one can hardly every verify the results of individual summations independently.

Correct algorithms may be slower, but I am hoping that it is easy understand why they ought to be preferred.

[1]: https://github.com/python/cpython/blob/5571cabf1b3385087aba2...

[2]: https://www.nu42.com/2015/03/how-you-average-numbers.html

dolmen · on June 11, 2021

Thanks for this post.

As this is worth to be better known, I submitted it here: https://news.ycombinator.com/item?id=27470323