Hacker News new | past | comments | ask | show | jobs | submit login
Will R Work on Apple Silicon? (r-project.org)
259 points by nojito on Nov 11, 2020 | hide | past | favorite | 224 comments



A similar issue has come up for codes that I write. Among other things, I write low level mathematical optimization codes that need fast linear algebra to run effectively. While there's a lot of emphasis on BLAS/LAPACK, those libraries work on dense linear algebra. In the sparse world, there are fewer good options. For things like sparse QR and Choleski, the two fastest codes that I know about are out of SuiteSparse and Intel MKL. I've not tried it, but the SuiteSparse routines will probably work fine on ARM chips, but they're dual licensed GPL/commercial and the commercial license is incredibly expensive. MKL has faster routines and is completely free, but it won't work on ARM. Note, it works fantastically well on AMD chips. Anyway, it's not that I can't make my codes work on the new Apple chips, but I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions. That's a lot to stomach.


Apple's own Accelerate Framework offers both BLAS/LAPACK and a set of sparse solvers that include Cholesky and QR.

https://developer.apple.com/documentation/accelerate/sparse_...

Accelerate is highly performant on Apple hardware (the current Intel arch). I expect Apple to ensure same for their M-series CPUs, potentially even taking advantage of the tensor and GPGPU capabilities available in the SoC.


Huh, this actually may end up solving many of my issues, so thanks for finding that! Outside of their documentation being terrible, they do claim the correct algorithms, so it's something to at least investigate.

By the way, if anyone at Apple reads this, thanks for the library, but, you know, calling conventions, algorithm, and options would really help on pages like this:

https://developer.apple.com/documentation/accelerate/sparsef...


That's the documentation page for an enumeration value, not a factorization routine (hence there are no calling conventions, etc, to document; it's just a constant).

Start here: https://developer.apple.com/documentation/accelerate/solving... and also watch the WWDC session from 2017 https://developer.apple.com/videos/play/wwdc2017/711/ (the section on sparse begins around 21:00).

There is also _extensive_ documentation in the Accelerate headers, maintained by the Accelerate team rather than a documentation team, which should always be considered ground truth. Start with Accelerate/vecLib/Sparse/Solve.h (for a normal Xcode install, that's in the file system here):

    /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Headers/Sparse/Solve.h


Numpy and SciPy reject use of Accelerate due to faulty implementations of some routines. https://github.com/scipy/scipy/wiki/Dropping-support-for-Acc... We have never received any feedback from Apple about these bugs.


I noticed that SciPy has dropped support. I believe it wasn't only related to bugs, but also an very dated LAPACK implementation (circa 2009). I can't tell from Apple's developer docs whether this has changed.

My sense is that Apple's focus is less on scientific computing and more so on enabling developers to build computation-heavy multimedia applications.


Accelerate is also available (and highly performant) on ARM as well. I was not able to beat it with anything on ARM, including hand-coded assembly, at least for sgemm and simple dot products, which are bread and butter of deep learning. It actually baffles me that Microsoft is not offering linear algebra and DSP acceleration in Windows out of the box. This creates friction, and most devs don't give a shit, so Windows users end up with worse perf on essentially the same hardware.


ARM themselves made a half-hearted attempt at addressing this with their Ne10 project (https://github.com/projectNe10/Ne10), but as far as I could see from the outside they never committed any real resources to it, and it now seems to be abandoned (no public commits for three years).


There's also https://github.com/ARM-software/ComputeLibrary, but Accelerate easily blows the doors off it, on the same hardware.


It worked well on PowerPC too and helped with the Intel transition.


> I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions.

Your complaint is kind of strange. You're blaming "GPL restrictions" but the cost is for a commercial license.


Well, if the FOSS license used was e.g. MIT he wouldn't have to buy a commercial license, that's the parent's point. With GPL, he does, because else his clients have to make their own code/project conformant...


Yes, that's correct. I write open source software as well and I don't begrudge anyone for licensing under GPL. And, I'm perfectly willing to obtain a commercial license, but I'm going to pass that cost on to my customers. In this particular case, though, the question for them is whether they want Apple silicon bad enough to pay an additional $50-100k in software licensing costs to keep their code private or to just buy an Intel or AMD chip. I know where I'd spend my money.


How do these types of licenses deal with software updates in general? Presumably, at some point they'll need to buy a new license anyway, and the issue will be moot, right?

And Rosetta will probably be around for a while...


> How do these types of licenses deal with software updates in general? Presumably, at some point they'll need to buy a new license anyway, and the issue will be moot, right?

It sounds like Intel produces an implementation of this thing that works on Intel and makes it available for free, whereas ARM don't (although another comment suggests Apple actually do), so you have to buy an expensive third-party implementation instead. That's not a difference that'll go away in the short term, and you can see why a processor company might legitimately choose one or the other approach.


Apple released the first Intel Macs to consumers in 2006, and in 2011 removed Rosetta from Mac OS X, so I guess it depends on what you mean by a while.


You were pretty specific that it was entirely the fault of the GPL:

> I'd have to explain to my commercial clients that there's another $50-100k upcharge due to the architecture change and licensing costs due to GPL restrictions.


What point are you trying to make here? The poster has been very clear on the mechanics, which are quite understandable, but I don't understand what you are trying to say. Is it just that you think it does not put the GPL in a positive enough light? I don't mean to put words in your mouth but that's my current best guess


Apple forced them into a situation that gives them fewer options. That isn't a statement about how good or bad each option is. It's a statement about the consequences that Apple's choices have for developers.

If I'm a travel agent and an affordable hotel near a travel destination closes down, I might have to book my clients in a nicer but more expensive hotel. Their trip will be a bit more expensive. Or maybe they'll travel to a different city. It doesn't mean I dislike the nicer hotel.


It seems clear enough from context that the "GPL restrictions" are that if they used the GPL-licensed compiler, the commercial clients might run into legal issues with their use of it, necessitating that they purchase the commercial license. It's not uncommon for businesses to have a prohibition against using GPL software in not only their shipping products but anywhere in their toolchain. (You can argue that's a counterproductive prohibition, but "your legal department just needs to change their mind on this" may not be an argument a vendor can effectively make.)


I would not make an argument even if I thought a client would accept it. If they are incompetent they will decide to use the GPL code with sloppy oversight, violate the terms of the GPL, then they will hold a little grudge against you for the advice that got them in trouble. Sloppy companies have no internal accountability, so it's your fault.

I use GPL code all the time at home and I would license many things GPL, but there's no reason to push GPL software at corporations. They should have limited options and spend money, possibly expanding MIT code, possibly just raising the price of engineers by keeping engineers occupied.


No, he was pretty clear that it was due to needing to use that solver due to it being the only one that works on ARM right now. The dual licensing was only relevant in that the client would have to pay for the commercial license (due to the GPL restrictions).

> MKL has faster routines and is completely free, but it won't work on ARM


That's still pretty silly. If the thing wasn't open source at all, you would still have to buy a license.

If your complaint is boo hoo, some people charge for software...well consider me unsympathetic.


Oh, so it's terrible to pay for software? How awful! Especially ironic because I'm sure the parent isn't working for free.


We all pay for software, but it's the amount that really shapes decisions. Most organizations have a dollar limit where we can just charge a purchase card and when they have to seek approval. In this particular case, the software costs are higher than what can likely go onto a p-card, so now it becomes a real pain to acquire. In fact, the software is so expensive that it's cost would like eclipse the cost of the computer itself. So basically, we're looking at a decision where the client can use a more performant library and save $100k as long as they stay off of Apple silicon.

That's really the point I'm trying to make and not to criticize anyone for using a GPL license. Moving to these new chips, in many cases, will be a much larger cost to an organization than just the cost of the computer.


>Oh, so it's terrible to pay for software?

Compared to not paying for it? Yes.

>Especially ironic because I'm sure the parent isn't working for free.

So? Who said that when you get paid yourself it stops being awful to have to pay for things?


I imagine the conversation with the clients will go like this:

- Here is a quote for 100k for adding SuiteSparse to the code.

- 100k‽ But I have found on the internet that SuiteSparse is free! Justify your quote.

At that point, they will have to explain to the client what GPL is and why they cannot use the free version.


> optimization codes

I'm curious do people in numerical specialties say "codes" (instead of "code")? I don't often hear it that way but I'm not in that specialty.


Really common usage in science/numerical computing.

I was trying to identify when, in normal usage, you'd say "numerical codes" rather than "numerical software" or just "numerical code". It seems a bit slippery!

Some contexts where it's prevalent: supercomputing, Fortran, national labs, large or multifaceted software. I also associate it with manager-speak ("our team has ported 77% of the simulation codes to HPSS").


Yes, this is a Fortran-ism which persists unto the present day.


Yes. e.g., "I work on multiphysics codes."

software => codes


Have you tried PETSc? It does sparse (and dense) LU and Cholesky, plus a wide variety of Krylov methods with preconditioners.

It can be compiled to use MKL, MUMPS, or SuiteSparse if available, but also has its own implementations. So you could easily use it as a wrapper to give you freedom to write code that you could compile on many targets with varying degree of library support.


I like PETSc, but how do its internal algorithms compare on shared memory architectures? I'd be curious if anyone has updated benchmarks between the libraries. I suppose I ought to run some in my copious amount of free time.

Sadly, the factorization I personally need the most is a sparse QR factorization and PETSc doesn't really support that according to their documentation [1]. Or, really, if anyone knows a good rank-revealing factorization of A A'. I don't really need Q in the QR factorization, but I do need the rank-revealing feature.

[1] https://www.mcs.anl.gov/petsc/documentation/linearsolvertabl...


PETSc developer here. You're correct that we don't have a sparse QR. I'm curious about the shapes in your problem and how you use the rank-revealed factors.

If you're a heavy user of SuiteSparse and upset about the license, you might want to check out Catamari (https://gitlab.com/hodge_star/catamari), which is MPLv2 and on-par to faster than CHOLMOD (especially in multithreaded performance).

As for PETSc's preference for processes over threads, we've found it to be every bit as fast as threads while offering more reliable placement/affinity and less opportunity for confusing user errors. OpenMP fork-join/barriers incur a similar latency cost to messaging, but accidental sharing is a concern and OpenMP applications are rarely written to minimize synchronization overhead as effectively as is common with MPI. PETSc can share memory between processes internally (e.g, MPI_Win_allocate_shared) to bypass the MPI stack within a node.


I'll have a look at Catamari and thanks for the link. Maybe you'll have a better idea, but essentially I need a generalized inverse of AA' where A has more columns than rows (short and fat.) Often, A becomes underdetermined enough where AA' no longer has full-rank, but I need a generalized inverse nonetheless. If A' was full rank, then the R in the QR factorization of A' is upper triangular. If A' is not full rank, but we can permute the columns, so that the R in the QR factorization of A' has the form [RR S] where RR is upper triangular and S is rectangular, we can still find the generalized inverse. As far as I know, the permutation that ensures this form requires a rank-revealing QR factorization.

For dense matrices, I believe GEQP3 in LAPACK pivots so that the diagonal elements of R are decreasing, so we can just threshold and figure out when to cut things off. For sparse, the only code I've tried that's done this properly is SPQR with its rank-revealing features.

In truth, there may be a better way to do this, so I might as well ask: Is there a good way to find the generalized inverse of AA' where A is rank-deficient as well as short and fat?

As far as where they come from, it's related to finding minimum norm solutions to Ax=b even when A is rank-deficient. In my case, I know the solution exists for a given b, even though the solution may not exist in general.


If you have one (or a small number of) right-hand sides, I would try to make LSQR work. It can find a minimum norm solution even if A is rank-deficient, and you can use preconditioning.

Also, if your problem is a good fit for a method like this, it could be impetus to add it to PETSc. https://epubs.siam.org/doi/pdf/10.1137/120866580


Unfortunately, in my case, the generalized inverse of AA' is the preconditioner for the system, which is why I need the factorization of A'. Essentially, I take this factorization and then run it through my own iterative method. When I run tests in MATLAB, SPQR scales fine for matrices of at least a few hundred thousand rows and columns. For larger, it would be nice to essentially have an incomplete Q-less QR factorization, which I don't think exists, but should be an extension of the incomplete Choleski work.

But, yes, LSQR or more fitting LSMR solves a similar problem, but they're the iterative solver and I need the preconditioner, which I'm using the factorization for.


I've made the point that GCC and free linear algebra is infinitely faster on platforms of interest (geometric mean of x86_64, aarch64, ppc64le) while still having similar performance on x86_64. I thought MKL used suitesparse, or is that just matlab?


As far as I know, MKL has its own implementation. As some evidence of this, here's an article comparing their sparse QR factorization to SPQR, which is part of SuiteSparse [1]. As far as MATLAB, I believe it uses both. I've a MATLAB license and it definitely contains a copy of MKL along with the other libraries. At the same time, their sparse QR factorization definitely uses SPQR, which is part of SuiteSparse. In fact, there are some undocumented options to tune that algorithm directly from MATLAB such as spparms('spqrtol', tol). As a minor aside, this is actually one of the benefits of a MATLAB license since they have purchased the requisite commercial licenses for SuiteSparse codes, it makes it easier to deal with some commercial clients who need this capability at a lower price than a direct license itself. This, of course, means using MATLAB and not calling the library directly. It's one of the challenges to using, for example, Julia, which I believe does not bundle with the commercial license, but instead relies on GPL.

https://software.intel.com/content/www/us/en/develop/article...


Just a note in support of Matlab's sparse capabilities. For the last couple of years, I used Matlab successfully on large, sparse multiplication and factorization problems. A friend who was using R simply could not approach the scale I was able to work at, and I assume it's due to weak sparse support.

I was multiplying and inverting sparse triangular matrices of size 650K x 650K with Matlab, on a laptop. Just amazing.


I'm surprised there doesn't seem to be anything in CRAN using SuiteSparse. It could presumably run at petascale, similarly to the dense support, if someone did similar work.


I doubtless mis-remembered about MKL, thanks.

I'm baffled why there would be a problem with commercial users running a free software program like Julia or GNU Octave+SuiteSparse; that's Freedom 0. (And commercial /= proprietary, of course.)


Most of the time, you're absolutely right especially with how Octave or Julia code is normally distributed. The code is delivered to the client and the client runs the code on their system. No GPL violations have occurred.

That said, I believe it gets trickier once we start compiling the code. Say I want to develop a piece of software for my client and I don't want them to have the source, Octave doesn't really have a way to do this, but MATLAB does and since MATLAB has purchased all of the requisite licenses, we're good to go. Julia makes me more uncomfortable. We can make binaries with PackageCompiler.jl, but if we do, we should be subject to the provisions in the GPL. That's no different than any other piece of software, but Julia, Octave, and MATLAB all use these libraries and most people don't know that something like the chol command hooks into SuiteSparse in the backend.


Yeah, the Julia devs are quite interested in removing our last few GPL dependencies and replacing them with something in pure julia. It'll take time though.


SuiteSparse switched from GPL to LGPL about a year ago if that makes a difference (for the couple of components I was looking at anyway).


Very cool and thanks for the heads up. I just went and checked and here's where it's at:

  SLIP_LU: GPL or LPGL
  AMD: BSD3
  BTF: LGPL
  CAMD: BSD3
  CCOLAMD: BSD3
  CHOLMOD Check: LGPL
  CHOLMOD Cholesky: LGPL
  CHOLMOD Core: LGPL
  CHOLMOD Demo: GPL
  CHOLMOD Include: Various (mostly LGPL)
  CHOLMOD MATLAB: GPL
  CHOLMOD MatrixOps: GPL
  CHOLMOD Modify: GPL
  CHOLMOD Partition: LGPL
  CHOLMOD Supernodal: GPL
  CHOLMOD Tcov: GPL
  CHOLMOD Valgrind: GPL
  CHOLMOD COLAMD: BSD3
  CPsarse: LGPL
  CXSparse LGPL
  GPUQREngine: GPL
  KLU: LGPL
  LDL: LGPL
  MATLAB_Tools: BSD3
  SuiteSparseCollection: GPL
  SSMULT: GPL
  RBio: GPL
  SPQR: GPL
  SuiteSparse_GPURuntime: GPL
  UMFPACK: GPL
  CSparse/ssget: BSD3
  CXSparse/ssget: BSD3
  GraphBLAS: Apache2
  Mongoose: GPL
There's probably a bunch of mistakes in there, but that's what I found scraping things moderately quickly. Selfishly, I'd love SPQR to be LGPL, but everyone is free to choose a license as they see fit.


Would their workflow allow just keeping a server on hand to do the number crunching, and still getting to use Apple Silicon on a relatively thin client?


>MKL has faster routines and is completely free, but it won't work on ARM.

It will probably be ported though, if there's a demand...


Maybe, but note that this is the Intel MKL. A library developed and maintained by Intel. It is not a secret that Intel does this to support their ecosystem and have been caught intentionally crippling support for AMD processors in the past [1]. Intel has recently been adding better support for AMD processors [2], but many suspect that is intended to help x86 as a whole better compete with ARM. If it does get ported, it is highly unlikely to have competitive performance.

[1] https://news.ycombinator.com/item?id=24307596

[2] https://news.ycombinator.com/item?id=24332825


Thanks for the links. If anyone is wondering about some of the hoops that need to be jumped through to make it work, here's another guide [1].

One question in case you or anyone else knows: What's the story behind AMD's apparent lack of math library development? Years ago, AMD and ACML as their high-performance BLAS competitor to MKL. Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine. That said, Intel has done steady, consistent work on MKL and added a huge amount of really important functionality such as its sparse libraries. When it works, AMD has also benefited from this work as well, but I've also been surprised that they haven't made similar investments.

Also, in case anyone is wondering, ARM's competing library is called the Arm Performance Libraries. Not sure how well it works and it's only available under a commercial license. I just went to check and pricing is not immediately available. All that said, it looks to be dense BLAS/LAPACK along with FFT and no sparse.

[1] https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AM...

[2] https://developer.amd.com/amd-aocl/


Eventually, it hit end of life and became AOCL [2]. I've not tried it, but I'm sure it's fine.

It's ok. I did some experiments with transformer networks using libtorch. The numbers on a Ryzen 3700X were (sentences per second, 4 threads):

OpenBLAS: 83, BLIS: 69, AMD BLIS: 80, MKL: 119

On a Xeon Gold 6138:

OpenBLAS: 88, BLIS: 52, AMD BLIS: 59, MKL: 128

OpenBLAS was faster than AMD BLIS. But MKL beats everyone else by a wide margin because it has a special batched GEMM operation. Not only do they have very optimized kernels, they actively participate in the various ecosystems (such as PyTorch) and provide specialized implementations.

AMD is doing well with hardware, but it's surprising how much they drop the ball with ROCm and the CPU software ecosystem. (Of course, they are doing great work with open sourcing GPU drivers, AMDVLK, etc.)


If you care about small matrices on x86_64, you should look at libxsmm, which is the reason MKL now does well in that regime. (Those numbers aren't representative of large BLAS.)


A free version of the Arm Performance Libraries is available at:

https://developer.arm.com/tools-and-software/server-and-hpc/...


> What's the story behind AMD's apparent lack of math library development?

I don't see a story. AMD supports a proper libm for gcc and llvm, has its own libm, BLAD, LAPACK, ... at https://developer.amd.com/amd-aocl/

Just their rdrand intrinsic is broken on most ryzens if you didn't patch it. Fedora firmware doesn't patch it for you.


You just run MKL from the oneapi distribution, and it gives decent performance on EPYC2, but basically only for double precision, and I don't remember if that includes complex.

ACML was never competitive in my comparisons with Goto/OpenBLAS on a variety of opterons. It's been discarded, and AMD now use a somewhat enhanced version of BLIS.

BLIS is similar to, sometimes better than, ARMPL on aarch64, like thunderx2.


In what world will Intel port MKL - Intel intellectual property - to ARM? The whole purpose of Intel's software tools is as an enabler and differentiator for their architecture and specifically their parts.


I don't know about this proprietary technology specifically, but Intel is a huge company with some FOSS friendliness. USB 4 is based on Thunderbolt 3, so I guess they licensed that one.


In a world where Intel already had licensed ARM and built it in the past:

https://newsroom.intel.com/editorials/accelerating-foundry-i...


That linked article from 2016 is about Intel's Custom Foundry program, which I'm fairly sure is for building chips under contract to other companies. It promotes that they have "access to ARM Artisan IP," but doesn't specifically mention an ARM version of MKL that I see. The list of compatible hardware Intel's page on MKL itself lists compatible processors and ARM is conspicuously absent:

https://software.intel.com/content/www/us/en/develop/tools/m...

And, this question on Intel's own forums from 2016 at least suggests that there wasn't an MKL version for ARM in the time frame of the article you're linking to, either:

https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Libr...

So, from what I can tell, while Intel is an ARM licensee and made ARM CPUs in the past, they haven't made their own ARM CPUs for years and there's no sign they ever made MKL for any ARM platform. Never say never, but I think the OP is basically right -- there's not a lot of incentive for Intel to produce one.


Intel had sold most of the relevant ARM IP and product lines to Marvell in 2006.


MKL is heavily optimized for Intel microarchs and purposely crippled on AMD (I believe dgemm is fast, sgemm slow). I don't think MKL benefits from optimizing it for Apple Silicon, especially considering Apple ditched Intel's hardware.


No it won't. Mkl is an Intel toolkit, so they will surely not support Apple's move to dump Intel processors.


> The ARM architecture floating point units (VFP, NEON) support RunFast mode, which includes flush-to-zero and default NaN. The latter means that payload of NaN operands is not propagated, all result NaNs have the default payload, so in R, even NA * 1 is NaN. Luckily, RunFast mode can be disabled, and when it is, the NaN payload propagation is friendlier to R NAs than with Intel SSE (NaN + NA is NA). We have therefore updated R to disable RunFast mode on ARM on startup, which resolved all the issues observed.

Hmm. ELF object files for Arm can represent this with build attributes [1]:

    Tag_ABI_FP_denormal, (=20), uleb128
        0  The user built this code knowing that denormal numbers might be flushed to (+) zero
        1  The user permitted this code to depend on IEEE 754 denormal numbers
        2  The user permitted this code to depend on the sign of a flushed-to-zero number being
           preserved in the sign of 0

    Tag_ABI_FP_number_model, (=23), uleb128
        0  The user intended that this code should not use floating point numbers
        1  The user permitted this code to use IEEE 754 format normal numbers only
        2  The user permitted numbers, infinities, and one quiet NaN (see [RTABI32_])
        3  The user permitted this code to use all the IEEE 754-defined FP encodings
Seems like their code should be tagged Tag_ABI_FP_denormal = 1, Tag_ABI_FP_number_model = 3 if it were an ELF .o, .so, or executable, in which case <waves hands> some other part of the toolchain or system would automatically configure the floating point unit to provide the required behavior.

Does Mach-O have a similar mechanism?

[1] https://github.com/ARM-software/abi-aa/blob/master/addenda32...


I wonder what happens if you `dlopen` a shared object that wants stricter behavior than the current executable and loaded shared objects. Does it somehow coordinate changing the state for all existing threads?


From GP's link:

> Procedure call-related attributes describe compatibility with the ABI. They summarize the features and facilities that must be agreed in an interface contract between functions defined in this relocatable file and elsewhere.

Seems like it might be reasonable to reject mismatched combinations.


Note that build attributes are only supported on AArch32, not AArch64.


Does that second setting imply that those NaNs need to be propagated? If not, then those settings aren't great. Sure, there are lots of chips where denormal behavior and NaN preservation are the same setting, but those could and probably should be split up in the future.


Brushing up on the difference between NA and NaN, makes the article considerably easier to read

NA - Not available NaN - Not a number

See this short and concise article on the differences: https://jameshoward.us/2016/07/18/nan-versus-na-r/


FORTRAN can be compiled on ARM Macs, but only commercially for now. https://www.nag.com/news/first-fortran-compiler-apple-silico...


I am probably the last person to talk about the difference between fortran version, but isn't the linked compiler for FORTRAN 2003 and 2008, whereas R needs a FORTRAN 90 compiler?


Fortran versions are additive. I.e. F03 is a strict superset of F90 and thus an F08 compiler can do F03 and F90


Only roughly additive. Some obsolete features have been dropped, though that doesn't mean compilers have dropped them.


Which means they support everything from Fortran 90 and then some.


Can the R project use that compiler to build binaries and then upload the release to the public?


GCC/gfortran should soon follow, no?


The current effort is done with this out of tree repository: https://github.com/iains/gcc-darwin-arm64

Which supports macOS on Apple Silicon quite well, hopefully will be merged soon to mainline GCC.


Soon and soon. Problem is that macOS/ARM64 has a new ABI [1], and nobody has implemented that in GCC. A couple of people are working on it apparently on their own time, but it's a fairly significant undertaking. Might be ready for GCC 11 which if history is a guide should be released in spring 2021. Or then it might not be ready.

[1] Why not use the standard ARM64 ABI as published by ARM? Well shits and giggles.


It’s the same as the iOS ABI.


It is possible that Apple Inc will forcefully eliminate GCC from their platforms, replaced with Clang/LLVM and other non-GPL tools only.


They're certainly well on their way to remove all GPL code from macOS itself if they haven't yet, but there's not much they can do to prevent you from installing GPL software yourself (nor much of a motivation to do so for that matter).


MacOS already doesn’t include GCC. The GCC command on MacOS is an alias for Clang unless you install gcc separately (unless there is a hidden GCC install I’m unaware of).


No, it really isn't.


R doesn't even work that well on Intel, at least in Ubuntu. Recompiling the package with AVX support often leads to a 30% performance increase on modern CPUs.

IMO the R base package should dynlink different shared libraries for different processors since vector extensions are mostly tailored to the kind of floating point numerical work that R does.


That's deliberate though, when you distribute software you choose the lowest common denominator in general, and that's SSE2 for 64 bit machines.


Many linear algebra libraries will handle this at runtime for you (OpenBLAS & MKL do this for example). You generally only need to use specialized builds of these if you don't want to have to ship extra code paths you won't use.


See, this is why Gentoo is the right way to manage an OS. /s


The SIMD makes little difference to the bulk of the system, but you want dynamic dispatch where it does, at least on a typical HPC system, which isn't heterogeneous five years down the line, and in things like libc generally.


When you install Python's numpy, I'm pretty sure it chooses a pre-built package based on your hardware, and if it doesn't have one I think it's pretty easy to get it to build the best one from scratch.


It generally installs MKL if you install a wheel (i.e. 'pip install numpy') these days, which dynamic dispatches based on the processor. It's been criticised a bit though because MKL doesn't perform as well on AMD hardware without setting some environment variables in the older versions, although it looks like they've added kernels that target AMD hardware recently.


The thing that makes a significant difference is BLAS, and it's easy to substitute. There are some old numbers at https://loveshack.fedorapeople.org/blas-subversion.html#_add... Most of it is unlikely to benefit much from -mavx and vectorization, but I have no numbers. -fno-semantic-interposition is probably a better candidate, which I've not got round to trying.


Looks like I need to compile R and my libraries with AVX2 support.

# list available instructions

    $ gcc -march=native -dM -E - < /dev/null | egrep "SSE|AVX" | sort 
This post has some info, but no benchmarks. https://stackoverflow.com/questions/37213060/does-r-leverage...


Same can be said for Python, tensorflow jumps to mind.


Ask HN: Does the FORTRAN issue also affect Numpy/Scipy for Python?


Yes, this will impact Numpy/Scipy as much as R, as they depend on many of the same linear algebra libraries.

Edit: actually, looks like I'm wrong emghost appears to know more about this than me.


Some of the NaN-handling issues are unique, but the compiler availability itself is still a problem.

https://twitter.com/StefanKarpinski/status/12929837172128931...

Since then, GCC has stepped up with this out-of-tree build, but it's still not 100% to my knowledge.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96168


How popular is R in general?

I started learning it because I want to make an attempt to do some projects on Kaggle. Most people use Pandas, Seaborn, etc, which I will also use.

However, to me R appears like a little better Swiss Army Knife to do initial analysis. ggplot2, tidyverse, ...

Any help leveling up would be appreciated.


As a data scientist who is proficient in both Python and R ecosystems, in my opinion R/tidyverse is substantially better for ad hoc EDA and data visualization.

However, Python is better for nearly everything else in the field (namely, working with nontabular data, external APIs, deep learning, and productionization).

It's about knowing which tool to use.


You forgot time series analysis where Python is years behind R. Robust regression methods too. But most of all, Shiny! Python's Dash for creating interactive data web apps is absolutely horrible compared to Shiny.

Basically if you were to go down any unbeaten path when it comes to statistical models, you're better off using R. But if your main goal is pushing to something prod, then you're better off with Python. The only exception being Shiny, they've put a lot of effort into making it production-ready.


> You forgot time series analysis where Python is years behind R

What does R offer?

In Python there's SARIMAX and Prophet, interested in what R has to offer.

Also interested in a decent Grid Search for time series.


R also has prophet, with some native ggplot2 integrations. https://facebook.github.io/prophet/docs/quick_start.html#r-a...


Thanks, but I'm more interested in what R has on time series, which Python does not.

Just checked Prophet source code and it looks like there are actually 2 implementations in the same repo: one for R and one for Python. I thought it would have been the usual C/C++ implementation with 2 bindings. I wonder why they chose to develop it this way.


R has the builtin ts library, and also see this Task View with literally hundreds of packages: https://cran.r-project.org/web/views/TimeSeries.html



tidyverts and fable are very nice, though I'd also be interested in seeing what others have to say, as I don't do much timeseries analysis. Prophet pretty much covers all my use cases thus far


You can use Prophet in R for sure


Hyndman is our king!


Ditto (not that I am proficient, but my experience matches).

However, because the rest is easier in python, and my mental gears grind when I switch from one to the other, I end up using Python for the adhoc EDA and viz, and with Spyder, it is a pretty decent experience.


R is 7th on this list:

https://pypl.github.io/PYPL.html


> (namely, working with nontabular data, external APIs, deep learning, and productionization).

I agree with all of that except for productionisation. I would have agreed before dealing with issues around getting consistent versions of python + libraries to run.

The issues I see with Python are as follows:

- pip doesn't actually check to make sure your dependencies are compatible, which causes real problems with numpy et al

- conda isn't available by default, and running it on remote boxes is non-trivial (I spent a whole week figuring out how to get it running in a remote non-login shell).

- This makes it really, really difficult to actually get a standard set of libraries to depend upon, which is really important for production.

R, on the other hand, actually resolves dependencies in its package manager, and the R CMD BUILD for packages, while super annoying helps you produce (more) portable code (did you know that conda doesn't provide cross-platform yml files unless invoked specifically?).

In terms of handing it over to engineering/non data science people though, Python is much much much better.

tl;dr Python's an ace language with a terrible production story.


I agree and disagree. We deploy our models via Cloud Foundry which has support for Anaconda.

Model building is done in AWS with access to Anaconda.

Usually we have an environment.yml for the REST API and one for model building.

This makes modeling -> deployment cycle fairly easy, if not perfect.

You can also use pip and env, but you have to make sure that all important dependencies are specified sufficiently specific. But that's also the case for Anaconda. (For instance, we had a problem in the API with a x.x.y release of greenlet or gevent since we only specified x.x)

For R, well use packrat. R IMHO has the problem of many different algorithms with different APIs. Yes, there are tools like caret, but 'you' will run into problems with the underlying implementations eventually. sklearn makes things easier here, at least most of the time.

I would also prefer R for EDA. But I don't like splitting eda and modeling that way, since there can be subtle differences in how data is read which can lead to hard to find problems later on. (Yes, you could use something like feather)

I also thing that tooling for python is much nicer, pytest, black, VSCode python integration just seem more mature.


> R, on the other hand, actually resolves dependencies in its package manager

And you have packages like renv which also help isolate specific versions of packages to make portable environments even more reliable.


‘renv’ (only) does more or less what pyenv does. Contrary to what the parent comment says, R doesn’t actually do any dependency resolution at all, and the official package repository (CRAN) doesn’t even archive many old versions (though MRAN does).

I strongly prefer R for data science, but its dependency management story is poor, even compared to Python’s (which, in turn, is poor compared to Rust/Ruby/…).


That's simply not true. R doesn't store old versions, which is actually brilliant because your code breaks when your dependencies rot.

Python will silently upgrade numpy as a transitive dependency and break everything, which is much worse. MRAN also has daily snapshots which is normally how i handle stuff that will never be updated.

I also specified building an R package which does handle dependencies versus the python equivalent which does not.

I'm not saying R is good, I'm just saying Python is way worse.


> even compared to Python’s (which, in turn, is poor compared to Rust/Ruby/…)

Compared to Rust? Sure. Compared to Ruby? Maybe in the way that a lockfile isn't automatically generated when using pip.

Hating on Python's dependency management is a meme at this point. You could do a lot worse than the current pip + venv, and upgrading to something like poetry or pipenv is pretty painless. I'm pretty sure 99% of problems occur because people don't pin stuff.


And what libraries do you use in R to generate a REST api that has Swagger 3 documentation? Authentication with JWT tokens? Monitoring, e.g. ApplicationInsights?


For that, R would not be the right choice. I stand by my comments on how difficult it is to productionise python ML applications.

I think all of those things you mentioned are JVM stuff, right? There's a version of R called renjin that could be used in that scenario.

Don't get me wrong, I'd love if this was better in python but right now it is far more difficult than it needs to be.


Nope, none of it is JVM stuff. It's pretty standard stuff if you want to ship an API into a production environment and expect other developers/services to interact with your model. How do you know your model is failing/slow to serve requests? You need monitoring/logging. How do add security? I'm talking API security, like JWT tokens with scopes and claims.

Maybe we mean different things by "productionising ML applications" but building a docker container with an R runtime and the correct package versions is not all, or even half, of what's required for production.


Why would you have any of this tightly coupled to your model?

Set up a separate API gateway, which covers all your points (REST endpoints, monitoring, security) - there's plenty of off-the-shelf options. Route authenticated requests to the backend that runs your model.


Depends on your model. Mine score users daily, so I don't need to worry about building an API.

Logging is pretty available in both (though better in Python to be fair).

I don't really see how building my model in Python would make it easier to add this API functionality either, so it's a bit irrelevant. Like my docker container (which appears to be almost essential in Python but nice in R) can call predict in any language, and then pass through to the API using the tools noted above.


Probably the most popular non-software engineer language for working with data.

Millions and millions of users that have no idea what this blog post is technically about (but is interesting nonetheless)


I think this nails it. Given both python and R, these people will pick R because the motivation for using R is very clear, it does data analysis and statistics. Whereas python kinda does everything and that makes it a bit more tricky to understand.


Second after Excel?


Fair, although I would say Excel and R are serving two separate purposes. But yes, Excel is of course #1


> Fair, although I would say Excel and R are serving two separate purposes.

It's true that they target different use cases overall (obviously with some overlap), but Excel tends to be used for lots of things that would be better handled with a different tool, because it's what people know.


Matlab has a substantial userbase, too.


Agreed, but IMO R >> MatLab and SAS over the past ~10 years. Both Matlab (physics??) and SAS (pharma/financial) seem to have further sunk into deep niches.


Matlab is still big in engineering as well. Matlab+Simulink in particular seems to have a fairly entrenched niche.


However, to me R appears like a little better Swiss Army Knife to do initial analysis. ggplot2, tidyverse, ...

R is far superior for interactive exploration/analysis and report writing. However Python is far superior if you are writing a program that does other things too.

My rule of thumb is that if a Python program is 70% or more Numpy/Pandas/Matplotlib etc then it should be R. Whereas an R program does comparatively little analysis and a lot of logic and integration, it should be Python. No one size fits all.


Can you "import R" in python? Sounds like that would be best.


Call Python from R: https://cran.r-project.org/web/packages/reticulate/vignettes...

Calling R from Python: https://pypi.org/project/rpy2/

I'm not sure I'd ever mix the two directly myself; I'd compose my application as separate R and Python communicating somehow. It seems cleaner.


Haha, what you said made me think: sagemath!

And sure enough it has something to tie in R too, truly connecting everything in a big nest: https://doc.sagemath.org/html/en/reference/interfaces/sage/i...


Extremely popular and widely used. Pandas etc are Python implementations of R constructs


think of it like shell scripting for statistics, although not nearly as limited as bash is compared to other programming languages.

it works best if it's used semi-interactively, as a glue language between statistical packages which may be written in other languages. or to write simple "batch" scripts that basically just run a bunch of procedures in a row.

RStudio makes the whole experience much nicer in terms of plotting, and RMarkdown is great for preparing documents.

of course like shell scripting you can write fairly complicated programs in it, and sometimes people do, but due to backwards compatibility and weird design choices meant to make interactive use easier, programming "in the large" can get weird.

the analogy works for Python too -- it is definitely reasonable to use Python for shell scripting, but using Python interactively to pipe things from one program to another is slightly more frustrating than doing it in the shell, although might be preferred due to its other advantages.


In my experience I’ve seen R used in more exploratory/ad hoc type analysis and algorithm development by “non-developers”—-statisticians, scientists, etc. usually without performance consideration—-and that code is then turned into production code with the dev team using Python or C or something more performant or maintainable.


R is a nasty nasty language for productionalizing things, honestly it's just too flexible and let's you do the craziest things.

But being so flexible makes it really expressive for doing ad-hoc analysis where you really don't know what you're looking for yet.


It’s not just the language/runtime though. It’s the entire ecosystem around that that’s required to productionize it in any modern sense of the word. It’s hard to get right even in Python. I mean Flask (even with restx) still doesn’t even generate Swagger 3 documentation.


I work with people who mostly have a background in the social sciences or humanities and who work in R pretty much every day. They dont see themselves as programmers and Python is complete gibberish for them, while R just makes sense. When i meet people from other companies in roughly the same space (i work in healthcare doing data analysis), it's mostly the same. I actually meet more people who use SAS/SPSS than Python.

For data analysis, R is in my opinion better than Python. It's when you have to integrate it in existing workflows that Python quickly becomes a better choice.


Similar story for me. I am an Engineer (the non software type). I work at an industrial plant. We use SAS pretty extensively for data analysis, time series analysis, multivariate regressions etc. As well as for BI type stuff (reports, graphing, adhoc queries).

For a while R was being pushed pretty heavily as a SAS alternative. My org paid R training courses etc. I found R and SAS pretty comparable at least the R packages we looked at (dpylr, ggplot2 etc).

I know about Python, the programming language I used PyGTK back in the day to build GUI apps. But it would not be my first thought for doing data analysis work. Does Python even offer something like R studio/ SAS Enterprise Guide and does it have a trending package?


It's not so much that Python is gibberish but that is written, as far as I know, predominantly by engineers, who aren't really experts in statistics, or experts in science for that matter. A scientist will tend to trust more code written by another scientist than code written by an engineer. At least, I would.


I find this statement interesting. Historically scientists have a reputation for writing relatively poor code. Code that runs really slowly due to things like unintended nested loops, or striding values (x,y vs y,x). And code that doesn't handle non-happy path cases very well.

Are you saying that you trust the code more because the domain knowledge make it more likely to get the right answer then? Has general knowledge increased such that scientists' code isn't as painful as it was 20 years ago?


Yeah, exactly, even though it might be terribly inefficient, I still trust scientific code more when it's written by scientists than when it's written by non-scientists, in terms of getting the right answer.


For what it is worth (not at all clear), TIOBE ranked R as the 9th most popular programming language in the world this month: https://www.tiobe.com/tiobe-index/. For comparison, Python is ranked number 2.


Very popular in academia, moderately popular in industry when it comes to data science/analysis. In any case, very powerful while Python has certainly numerous advantages over it.


More popular than I wish it was. It is the bash of the data science world. Totally ubiquitous and kind of a dumpster fire.


I disagree, the language is extremely powerful for interactive data exploration. A terse one-liner is all it takes to compute something like "what's the correlation between number of children and home size for people over 45 who live in counties with income variance at the 90th percentile weighted by population".

Not that pandas/scipy/numpy don't make an admirable job. You can do something like this, but it's nowhere near as ergonomic as it is R. At the end of the day, R is fundamentally a language for data exploration, whereas with python those facilities are bolted on top of a general purpose environment.


R is a but of a horror show under the hood, I agree, but if you're just an end user doing data analysis, consider:

    flights.iloc[0:10, flights.columns.get_indexer(['year', 'month', 'day'])])
versus

    flights %>% select("year", "month", "day") %>% head(10)
I could go on...


You could have done

    flights[['year', 'month', 'day']].head(10)
which is not so different from standard R

    head(flights[c("year", "month", "day")], 10)
but it's true that the following may be nicer

    flights[1:10, c("year", "month", "day")]
(by the way using head(10) is not the same as indexing 1:10 if there are less than 10 rows)


I totally agree that it is a very efficient and powerful tool for ad-hoc data analysis. It's just not what I would view as a responsible choice for production / publication code.


    flights[["year", "month", "day"]].head(10)


It’s pretty sad that both are worse than

     SELECT year, month, day
       FROM flights
      LIMIT 10


SELECT before FROM isn't really a good thing.


That's a very dishonest example. Yes, terribly written pandas code looks terrible.


flights[1:10, .(year, month, day)]

for the data.table fans

Which is arguably the superior way to handle tabular data in 2020.


This is the best description of R that I've ever come across, and I say that as someone who learned R as their first programming language.

Big mistake, btw. It took me years to unlearn all of the terrible habits I picked up from the R world. Do yourself a favor and start with python, if only to learn proper programming practices and techniques before diving into R.


While I agree that Python is better than R for programming etiquette, I would argue that proper programming practices and techniques are better learned in languages with static typing and proper variable scoping. Do yourself another favor and also look into C#, Swift or even Java.


If R is a reasonable tool for a given problem, C# or Swift or Java almost certainly will not be. The realistic alternatives to R are other numerical analysis packages, Julia, and Python. “The” answer for any given person or project is likely to be a function of your colleagues and peer group, your problem domain, your library needs.

One of course is allowed to learn more than one thing. Maybe play with a bondage and discipline language to expose yourself to the concepts the parent comment is advocating for.


They're not saying use Swift/C# for those problems, they're saying learn good programming practices from those languages and tools and then go do things in R/Python with that expertise under your belt.


A lot of people don’t have the luxury of doing both those things. They’re confronted with a problem and need to solve it, and solving it requires choosing and learning how to use a tool. If you have plenty of free time, choosing C#, Swift, and Java seem like odd choices for a pedagogic programming language. For learning about type safety, spending a couple weeks playing with SML or Haskell would be a good idea, though they’re both functional.

As a student I constantly complained that we were being taught these useless languages. As a grownup I realize that while some of the Comp Sci faculty may’ve been out of touch, their goal was not teaching us commercially viable skills. They were endeavoring to teach us how to think. Once you know how to think you can express those thoughts in nearly any language, no matter how hostile to those thoughts it may be.

But maybe you just want to get things done, and if that’s so, the answer for data problems is basically one or more of R, Python, Julia, etc.


As I programming language I don't love R. I didn't get it till I took a biostatistics class. (We could use Stata/Excel or R for the class). It really shines analyzing data. Its loved by statisticians and some programmable attributes too.

Biologists like it for single cell analysis. They use Seurat and save the data as an object and load it up/ pass around around for analysis. Its actually kinda neat.

R's ggplot2 library is top tier in making graphs.

RStudio makes it very accessible.


>How popular is R in general?

Very popular. To the point of even having quite a lot of Microsoft support, lots of books, etc.


I love using R for exploratory work. Hadley Wickham's TidyVerse of packages make everything so ergonomic to use.


Lots of scientists like it, psychologists and statisticians and such.


In finance and fintech is pretty standard.


no self respecting person who calls themselves a date scientist ai researcher or ml engineer would touch R.. it's a toy for making pretty plots and fitting traditional stats models to small data.. it is not a proper programming language but a horrible old scripting whatever that was unfortunately saved by Hadley and his persistence on creating an eco system around it..


So I guess the authors of the Elements of Statistical Learning aren't "real" researchers then?

For reference, the authors of that book (the best book about ML in general) were all involved in the development of S and R.


clicked on link expecting R shitposting. was not disappointed.


I reckon they'd finally have to get R working natively on the new chip. I don't foresee Apple offering the fat binary support in the long term. It's probably only an intermediate solution for the transitional period. Also, does it mean the native version of R will finally work on the iPad? I know Apple doesn't allow compilers but there are a few examples like Pythonista and Apple's own Swift playground. It'd be cool to get R Studio on the iPad.


Just to be clear, PPC-Intel fat/universal binaries are still supported even on Big Sur, the PPC portion is just ignored. I don't expect Intel-Arm binaries to go away any time soon.

I believe what you're really thinking of is Rosetta though. That, indeed, is sadly unlikely to be around forever. We have history as an indication of that.


Yes, you're right. I meant Rosetta2. It's good to have native binaries nevertheless.


When Apple transitioned from PowerPC to Intel the fat binary support (Rosetta) lasted 3 OS updates or about 3 years. Definitely won't be a super long term thing, but there's plenty of time I guess.


FWIW, RISC-V explicitly doesn't support NaN payload propagation so R will have a problem there as well.


Question: Would using R inside Docker on one of these Macs work somewhat well?

Previous benchmarks[0] show that the overhead on Intel Macbooks for the Docker Linux VM is quite low for scientific computing.

Would the x86 emulation hurt performance substantially or is there some other issue with this approach?

[0]: https://lemire.me/blog/2020/06/19/computational-overhead-due...


Why would you run R on a Mac in Docker? Docker isn’t an emulator. You’re still going to need ARM code.


Not that this wasn't a known caveat of Docker, but I think a lot of people are going to realize this in the next year or two.


I would imagine that Docker for Mac will likely get native support for ARM macOS soon enough, in which case there'd be no x86 emulation involved and you could run the ARM Linux version of R in a container just fine.

My understanding is Rosetta 2 does not support x86 virtualization.


Does anybody know what the status is of Accelerate [1]? Is it implemented for Apple Silicon? Is it optimized for it? To me it seems very few people use this framework.

[1] https://developer.apple.com/documentation/accelerate


From WWDC 2020, it has been "highly tuned" for Apple Silicon:

https://developer.apple.com/videos/play/wwdc2020/10686/?time...

Makes sense, since Accelerate has been available on iOS for many years.


Similarly, Matlab is also not initially available for Apple Silicon natively now, and they are preparing an update to let Matlab run in Rosetta 2 instead, until the development cycle of native version completes.


Hi everyone, did any of you try out R or SPSS under a new M1 Macbook, do either of these work fine under Rosetta 2, as I suppose none has a native ARM version yet.

In addition, did anyone try CorelDraw as well?

I am asking these question, because I think a lot of us working in data science have second thoughts about moving to ARM, at least for the next year or so....


Apple is obviously a contrast with the usual state of affairs where one of the first signs of a new CPU is in the GNU toolchain.


Is this a new cpu in that sense? I thought this was ARM64?


Yes, though I don't know what version. Maybe I should have said new system, but it's a new micro-architecture, as I understand it, with an unsupported ABI.


Is there a reason R can't use the BLAS/LAPACK implementation that comes with macOS in the Accelerate framework?


It should run fine under rosetta, if anyone encounters issues please submit bug reports.


An inability to use R Studio would be a deal breaker for me.


There's always R Cloud, accessible from a browser.


It sounds like R's design decision to use a non-standard NaN value to represent NA is an obscenely bad one. Wasn't it obvious that this would become a problem someday?


It's not a "non-standard NaN". It's just a particular one, out of many possible quiet NaN values. If the Apple silicon isn't propagating the payload of the input NaN value to output, that's a violation of IEEE 754.

(IIUC, that is. It may be something like a "should" not a "must".)


It is a "should", not a "shall".

Quoth the standard (emphasis mine):

> For an operation with quiet NaN inputs, other than maximum and minimum operations, if a floating-point result is to be delivered the result shall be a quiet NaN which should be one of the input NaNs.


The article contradicts your assertion. Did you read it?


Wow, a little hostile here?

My assertion, that R's NaN is not "non-standard", seems upheld by the article. It's a quiet NaN with a payload, which is well-defined by the IEEE 754 standard.

As other posters pointed out, it's relying on a "should" behavior from the spec, which is risky but common. It sounds like disabling the "RunFast" mode cleared up their issues, which seems quite far from it being an "obscenely bad" design decision.

It's not terribly unusual to require IEEE 754 compliance in numerical code, like the usual options for avoiding --ffast-math -style stuff.


Fair enough. My snide question was uncalled for. Sorry. Thanks for the additional info.


Does R work on Amazon ARM chip (graviton)?


Yes, if it's available for the OS you run, like EL or Debian. It's aarch64.


Installing R on a MacBook, even the older 2019 ones, was nothing short of a fucking nightmare.

Ended up installing on a vagrant machine instead.


I think maybe once in the past 6yrs I’ve had an issue with `brew install R` and I’m a power R user (upgrade regularly).

How we were you attempting to install? Build from source?


Installing R through conda is a PITA esp. for packages that aren’t in conda-forge yet, I’ll give you that.

Installing through homebrew or using the R project builds is very smooth in my experience


I‘ve had few problems with R on my 2020 13’ Macbook but several of my coworkers have struggled with R on theirs. Some of them are very new to programming and likely get stumped by what I would consider “simple“ bugs.


Installing even common packages like data.table require mucking around with R's makevars. There's no common set of variables that "just work", since different packages need different compilers to install.


This is only true if you insist on installing with OpenMP support.

I can think of maybe 3-5 packages, most relatively low use, that have intricacies required to install.


Will Apple even allow compilers that aren't Apple Compilers? They're not allowed on any other Apple Silicon.


As long as they support LLVM it shouldn't it be mostly painless?


Are you thinking of JITs? And yes those are allowed.


The question is why in 2020 R still uses a Fortran compiler?


Tons of software are still using Fortran in the scientific/numerical world. NumPy and SciPy, for example, also make extremely heavy use of Fortran. As would many things that rely on BLAS/LAPACK.

It's not just legacy code, either... Fortran is still very active in its own little niche in the numerical world.


Yep, I think the issue with R is that they're using a customised version of BLAS/LAPACK – Python has been running these things on Raspberry Pis for ages now, I suspect using a more standard implementation.


No, the issues are different, please read the article. R also runs fine on ARM64 linux. But macOS is not linux, as mentioned in the article it has a different ABI and no free Fortran 90 compiler is available yet.

The other issue is that R distinguishes between NA values and NaN values (NumPy doesn't), which are propagated differently on ARM64.


Based on the article I don't think those are the problem. I think the new Apple silicon is distinct enough that it needs a bit of porting effort to get a Fortran compiler running, along with the issue of quirks in handling NaN payloads and some other (seemingly rather minor) differences.


Fortran is still a foundation of many important libraries for numeric computing. A fairly large number of implementations of BLAS are written in Fortran (including the reference implementation), LAPACK is written in Fortran.

One of the first domains solved by programming was efficient implementations of common linear algebra computations. Fortran was the original language of choice for many of those projects. When you care about absolutely optimal performance for these computations you're not going to mess with fined tuned code that has been slowly tweaked and improved for over 50 years.


Fortran is domninant in HPC (Maybe not dominant, but there are a lot of software in HPC written in Fortran). R uses some performance oriented libraries which most likely implemented in Fortran.


Fortran is the standard implementation language for scientific computing in 2020. Try compiling NumPy and SciPy from source sometime.


This is answered in the article. Chunks of R are written in Fortran 90, which can’t be converted to C easily right now.

You might ask why it’s written in Fortran at all. Probably has something to do with its history coming out of the S language at Bell labs in the 70s and 80s.


If I recall correctly, C and C++ allow some types of pointer aliasing that fortran forbids. If you're reading from one buffer and writing to another, those buffers can overlap in C or C++, but can't in fortran, so a fortran compiler is allowed more leeway with the way instructions are ordered (and maybe elided? Not sure). In compute-intensive workloads, every little bit helps.


It's also worth noting that although modern C/C++ has the `restrict` keyword, compilers for those languages are generally worse at actually using that information. For example, there's been a long running series of LLVM bugs that has (several times) required Rust to stop emitting that metadata because it would actually miscompile fairly simple vector operations that used `restrict`[1]. I'm hopeful that Flang (the Fortran LLVM compiler) will shake most of those out, since there's a large body of Fortran code that relies on good aliasing optimizations.

[1]: https://github.com/rust-lang/rust/issues/54878


Just C. The various “restrict equivalents” in C++ are non-standard.


Correct. Back in the days, it used to be that you could run Fortran 77 code through f2c and then compile with gcc using the assume-no-aliasing option, and you would get roughly the same performance as if you compiled with f77.


Fortran is crazy fast and the standard language in many scientific domains where performance matters.


> The question is why in 2020 R still uses a Fortran compiler?

You will find that almost any scientific codebase of any size includes or relies on at least some Fortran code.


I'm sure someone will come along with a Rust rewrite any day now.


I admire your sense of humor.


Fortan is berserkly fast.


If it works well, why change it?


Because it's R, it doesn't even have 64 bit integers yet.


It does with the bit64 package, but agree that I wish it was directly supported.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: