Ask HN: How to design an array lang (ex: APL or J) LLVM compiler?

mamcx · on July 1, 2017

I'm in a similar path, but with a relational language. However I have find have a lot in common with arrays, in fact, I think make sense, similar to kdb+, to model relations as columnar tables.

You can read about J here:

https://sblom.github.io/openj-core/ioj.htm

(From my long list of resources: https://gist.githubusercontent.com/mamcx/e1743571b9a1ea163a7...)

The main gist is understand that you can generaliza arrays and scalars in one single structure (aka a array!)

A scalar is a array of 1. Others list are of N. You tag the structure with the type alike:

   [Type; N, N2, N3...]

and then you can do things like:

   for i in 1:

because is totally legal here!

I expand the idea to say that a relation is a multiple-array (aka: Columns), and array is a relation of 1 column, a scalar is a relation of 1 column and 1 value.

This mean is potentially easy to exploit SIMD and stuff like that, but I haven't explored that.

throwaway7645 · on July 2, 2017

Thanks for the help! Does your project have a public presence?

mamcx · on July 8, 2017

No yet.

If interested, we can share ideas?

My web with contact details: http://elmalabarista.com

mhh__ · on July 1, 2017

The standard books (e.g. Engineering a compiler, or Appel's stuff) would be fine.

I've never implemented a array language myself, but one way of doing this while minimizing cognitive overhead would be to lower it to more C-like IRs (Similar to how Haskell is compiled by GHC). The advantage here, would be that by the time the AST has been lowered into these IRs translating directly to LLVM IR would be mentally trivial. Whether this is worth doing or not, I have no idea.

I'd have a look at some LLVM APL compilers and see how they do it.

throwaway7645 · on July 2, 2017

I don't think there are any LLVM APL compilers or I would definitely look. I think both Dyalog, J, and K use regular C as the VM. Arthur Whitney is said to have written the first K compiler in only a few (like 3-5) pages of C. Do you think it'd be easier to start with using a custom C VM over LLVM? I have a little C experience and no LLVM, but I see it being used for a lot these days (Julia, D, Rust, Swift, Crystal...etc).

mhh__ · on July 2, 2017

Are you implying writing one's own JIT? This would involve either writing a new LLVM backend or implementing the same optimizations. Easier to start with, yes: This doesn't really justify the design decision. (In short: Easier != Better)

You could compile to C, then use Clang or GCC to compile to a final executable: This should, if possible, be avoided as it limits what you can (C is not a good IR for a compiler). I suspect this is what you meant by "use regular C as the VM".

As these things go, LLVM is fairly sane/clean. The point of using LLVM would be to produce an optimized executable, rather than bytecode to be run on a VM.

Also, LLVM is C++ (Complex C++, in fact). If you don't already, I thoroughly recommend reading about optimizing compilers work (It's either very useful or fascinating, depending on whether you're the kind of person to find this stuff interesting). Using LLVM is, after getting over the C++ build times, actually quite simple (Take a look at http://llvm.org/docs/tutorial/)

throwaway7645 · on July 3, 2017

Thanks for the explanations and advice. APL isn't as fast as I'd like, so I thought LLVM would be nice, but then you lose REPL functionality I believe which is vital for this kind of language. Julia JITs every expression to LLVM (I think)which is how it has a REPL, so that seems like a good choice to get said REPL. So maybe just the Julia route would be smart. It appears to be taking off in the areas that APL, J, Matlab, Numpy, Octave, Mathematica have typically reigned supreme in.

mhh__ · on July 3, 2017

If the compiler is fast enough, then one could make a REPL out of it (by modifying it slightly). LLVM also has interpreters. The trick would be finding a consistent way of representing high level state in the REPL (You can't just lob bits of LLVM IR together), unless you can manage without (of course).

i_don_t_know · on July 2, 2017

I think APL/J/K are typically implemented as interpreters rather than compilers, because the languages contain some ambiguity that you might not always be able to resolve statically. See for example: http://t3x.org/klong/ambiguity.html

The J Incunabulum provides a glimpse of a toy interpreter and of Arthur Whitney's style: http://code.jsoftware.com/wiki/Essays/Incunabulum

I've found it quite interesting in terms of data structures, parsing, evaluation and how the primitives are implemented. I reformatted it into a more readable style (more readable to me). Memory management is through reference counting I think (maybe not in this example, but I vaguely remember reading that somewhere).

Interpreters for array languages are quite efficient because the primitives operate on arrays and you tend to not have explicit loops in the scripts themselves. The loops are "hidden" inside the optimized primitives in the interpreter. You can add further optimizations by recognizing certain idioms like +/ for "sum" and providing a special optimized implementation. Obviously that doesn't work for user-defined functions. I don't know if K etc can optimize those as well.

Another thing to consider is how you represent intermediate values. If you're operating on large arrays, you don't want to create a lot of large intermediate arrays that you'll throw away immediately. You can probably detect situations where you can modify arrays in-place.

This papers provides as different solution: http://dl.acm.org/citation.cfm?id=512761 Unfortunately I don't know if there's a free copy available anywhere, so I haven't actually read the paper. I'd be interested to know if there's a free legal copy available somewhere.

It might be similar to stream fusion for lists in Haskell: http://fun.cs.tufts.edu/stream-fusion.pdf Or how you might optimize SQL/relational queries for databases by pushing down projections through selects and joins to reduce the amount of data that you need to retrieve and work on.

These approaches might also allow you to optimize user-defined functions.

Finally, Timothy Budd implemented a compiler for a slightly modified version of APL and documented the result: http://www.springer.com/de/book/9780387966434 Again, I don't know if there's a free legal PDF anywhere, so I haven't read the book. I'd be interested in a free legal copy of this book as well.

throwaway7645 · on July 2, 2017

I knew there had to be some kind of catch to not see some form of successful compiler in the wild. Your comment was quite helpful!

SonOfLilit · on July 2, 2017

I once read that first J compiler (I think it was J).

It's in a VERY alien dialect of C, but quite readable with some effort:

http://code.jsoftware.com/wiki/Essays/Incunabulum

tlack · on July 1, 2017

Even though you'll be outputting LLVM as the intermediate representation, I think you'll still have to think a lot about the design of the internal structures and the operation of the "virtual machine" so to speak (although there isn't one in the traditional sense). You'll still need to think about how functions interact, what anonymous functions/lambdas boil down to, etc.

To that end, it may be interesting to study the source of J, Klong, Kona, etc. There are a lot of older papers about representation and performance in APL as well.

throwaway7645 · on July 2, 2017

Good idea, I started reading Hui's book on J's internals before. Maybe it's time to take another look.

geocar · on July 1, 2017

Why LLVM?

Aaron's Co-dfns[1] is an APL compiler that compiles to C++ and there's some advantage to doing this.

[1]: https://github.com/arcfide/Co-dfns

throwaway7645 · on July 2, 2017

I think Hsu's Co-dfn compiler compiles to GPU for running very high performance code if you use the parallel co-dfn paradigm. I'm not sure if you have to just use GPUs or if it gives you a native binary you can use for other purposes. The other problem is you have to have Dyalog APL, which ain't cheap. I think a new array language that has a REPL that JITs to LLVM (good for prototyping, debugging, and testing) and AOT compilation to LLVM for scripts would be amazing for sharing code and deployment.

geocar · on July 2, 2017

> I think Hsu's Co-dfn compiler compiles to GPU ... [I'm not sure] if it gives you a native binary you can use for other purposes.

It compiles to C++.

> The other problem is you have to have Dyalog APL, which ain't cheap.

Good isn't.

However, you can download Dyalog APL for free and use it for noncommercial purposes, including trying it out and personally learning how to build an APL compiler.

I think this is pretty clear from the website.

> I think a new array language that has a REPL that JITs to LLVM (good for prototyping, debugging, and testing) and AOT compilation to LLVM for scripts would be amazing for sharing code and deployment.

Why do you think that?

I don't see how LLVM helps with any of that. The hard part of writing an APL compiler is parsing and analysis, and that means finding an algebraic morphism between APL and CPUs (or GPUs) that you can do meaningful things with. It isn't making executables.

sitkack · on July 2, 2017

Lots of good advice, I'd start small and work on end-to-end examples.

http://www.compileroptimizations.com/category/loop_fusion.ht...

https://www.google.com/search?q=array+language+loop+fusion+f...

throwaway7645 · on July 2, 2017

Thank you! Will do!

sitkack · on July 3, 2017

I should have mentioned the Programming Language Zoo [1]. Beware of the LLVM bog of dead compiler writers.

http://plzoo.andrej.com/