Show HN: Prometeo – a Python-to-C transpiler for high-performance computing

zanellia · on Nov 17, 2021

Hi all,

prometeo is an experimental modeling tool for embedded high-performance computing. prometeo provides a domain specific language (DSL) based on a subset of the Python language that allows one to conveniently write scientific computing programs in a high-level language (Python itself) that can be transpiled to high-performance self-contained C code easily deployable on embedded devices.

The package is still rather experimental, but I hope this concept could help making the development of software for high-performance computing (especially for embedded applications) a little easier.

What do you think of it? Looking forward to receiving comments/suggestions/criticism :)

pella · on Nov 17, 2021

Nice project.

small comment - related to the benchmarks:

- in Julia: it has a newer ricatti solver (in package)

https://github.com/andreasvarga/MatrixEquations.jl/blob/mast...

https://github.com/andreasvarga/MatrixEquations.jl

giaf · on Nov 18, 2021

The benchmark in prometeo is a discrete time Riccati recursion (as opposite to a continuous time Riccati equation) algorithm. And it is the exact same algorithm implemeneted in all languages, making the comparison more fair as the only variable is the implementation itself.

klyrs · on Nov 17, 2021

Kneejerk reaction as an enthusiastic Cython developer: "bah, another crappy (subset of Python)-to-C compiler."

After reading: this is really cool. If I understand this, I think you should be able to beat Cython without breaking a sweat. I'm quite excited to use this.

zanellia · on Nov 17, 2021

hahaha thanks!

chriswarbo · on Nov 17, 2021

Very interesting! What are the similarities/differences compared to RPython (as used by PyPy)?

https://rpython.readthedocs.io/en/latest/rpython.html

loeg · on Nov 17, 2021

Looks like RPython is a bigger language that doesn’t target an embedded use case without a Python runtime. Though I may be mistaken - I am not super familiar with RPython.

chrisseaton · on Nov 17, 2021

RPython programs can be compiled to a standalone executable without a Python runtime - it's what PyPy is written in, for example.

loeg · on Nov 17, 2021

To get ahead of the obvious question I had and I’m sure others will, this is from the README:

> Cython is a programming language whose goal is to facilitate writing C extensions for the Python language. In particular, it can translate (optionally) statically typed Python-like code into C code that relies on CPython. Similarly to the considerations made for Nuitka, this makes it a powerful tool whenever it is possible to rely on libpython (and when its overhead is negligible, i.e., when dealing with sufficiently large scale computations), but not in the context of interest here.

I.e., it’s a python-like DSL that does not depend on the Python runtime.

Thanks for sharing OP, this is pretty cool.

zanellia · on Nov 17, 2021

Right, that's indeed the main reason I could not simply use Cython or Nuitka (or Julia?). The Python runtime library will do all kinds of non real-time/embedded friendly operations such as garbage collections, memory allocation/de-allocation and so on, in the background.

adsharma · on Nov 18, 2021

This is awesome! The direction of using a subset of python, while leveraging the user base and static typing to accomplish some other everyday task in a different language is very legit IMO.

I took a cursory look at:

https://github.com/zanellia/prometeo/blob/master/prometeo/cg...

It seems quite similar in spirit to

https://github.com/adsharma/py2many/blob/main/pyrs/transpile...

I'm not spending much time on py2many last few months (started a new job). Let me know if any of it sounds useful - especially the ability to transpile to 7-8 languages including Julia, C++ and Rust.

zanellia · on Nov 18, 2021

Pretty cool! How do you manage multiple target languages with a single AST parser? Do you use an intermediate AST?

adsharma · on Nov 21, 2021

No intermediate AST. To understand the various stages of transpilation and separation of language specific and independent rewriters, this file is a good starting point:

https://github.com/adsharma/py2many/blob/main/py2many/cli.py...

pure_simplicity · on Nov 20, 2021

That's a good question to put in the discussion tab on their GitHub repo, that way others that are interested can find it too :)

cossatot · on Nov 17, 2021

I'm curious what an example use case is for scientific computing on an embedded device. Is this for real-time analysis on a data logger or something?

Many of us think of clusters as high-performance scientific computing, which are about as far from embedded as it gets.

Please note that I am not being snarky, just curious!

zanellia · on Nov 17, 2021

Thanks for the question! My background is in numerical optimization for optimal control. Projects like this https://github.com/acados/acados motivated the development of prometeo. It's mostly about solving optimization problems as fast as possible to make optimal decisions in real-time.

pure_simplicity · on Nov 18, 2021

Have you considered using Nim? It's a great language that has some similarities to python and compiles to C code. It is a very convenient and powerful language to use, coming from Python.

zanellia · on Nov 18, 2021

I did look into Nim, but, given Python's maturity/popularity, I decided to stick with it as host language for the DSL.

pure_simplicity · on Nov 18, 2021

Well, you give yourself a whole lot of extra work by attempting to create a Python to C transpiler, basically creating an even less mature and less popular language ecosystem than Nim has, when you could use Nim and get most of the benefits (statically typed, compiled to C) out of the box and enjoy their growing ecosystem. It seems that Prometeo is currently more limited than Nim.

Maybe I'm not understanding your use case well enough and maybe your approach is actually a locally optimal solution, and I honestly greatly respect your effort and want to see your project succeed, cuz I love Python too. I even wanted to make a faster python myself at some point. Just, the more I learn about Nim, the more I appreciate its design decisions and think to myself: this is the faster version of python I'm looking for. Now, we just need the large package ecosystem that python has, but I'm willing to both wait and participate in making it come about.

This presentation about writing keyboard firmware in Nim may be helpful, if you're willing to give Nim some more consideration:

https://youtu.be/dcHEhO4J29U

There is also another talk about embedded programming in Nim from the same conference, here:

https://youtu.be/rlZ4ALGAU1M

fault1 · on Nov 18, 2021

How about zig? From what I've heard, zig is supposed to be a "better" drop in replacement for C.

MR4D · on Nov 17, 2021

Looks like this could be pretty nice.

I noticed your disclaimer at the bottom of the linked page [0], and wanted to get an idea of how far you were looking to take this. Will it go beyond maths into normal functions (string handling, etc) ? Do you eventually plan on supporting most of python - for instance, do you think I could write a web server using your tool in the future?

[0] - "Disclaimer: prometeo is still at a very preliminary stage and only few linear algebra operations and Python constructs are supported for the time being."

zanellia · on Nov 17, 2021

Unfortunately, I think that writing a transpiler for general Python programs might be rather difficult without resorting to approaches used, e.g., in Cython/Nuitka. Among other things, computing the worst-case heap usage could be quite cumbersome/computationally heavy for a general program without "constraints". I'd be happy to hear what others think about the topic though.

rich_sasha · on Nov 17, 2021

Soo... it takes Python syntax and produces a C program, with no links back to Python - is that right? It uses a strict subset of Python, so that Prometeo programs are valid Python, but not necessarily the opposite. Is that fair?

Do you envisage this being a conduit for tight loop optimisation in Python? Or is it rather "you'd like a C program but can't write C good"?

And if the former, how do you compare to Nuitka and Cython? I read your README but couldn't quite make sense of it :)

zanellia · on Nov 17, 2021

> Soo... it takes Python syntax and produces a C program, with no links back to Python - is that right? It uses a strict subset of Python, so that Prometeo programs are valid Python, but not necessarily the opposite. Is that fair?

yep

> Do you envisage this being a conduit for tight loop optimisation in Python? Or is it rather "you'd like a C program but can't write C good"?

There are already plenty of options for calling high-performance libraries from Python. Now 1) interpreting Python programs that use, e.g., NumPy, can be slow. 2) Compiling these programs using, e.g., Cython or Nuitka, can speed up the code across calls to high-performance libraries, but the resulting code will still rely on the Python runtime library, which can be slow/unreliable in an embedded context.

Coming to the second part of the question, writing C code directly is definitely an option, but, after doing a bit of that, I realized how tedious/error prone it is to develop/maintain/extend relatively complex code bases for embedded scientific computing (e.g. this one https://github.com/acados/acados). Or, to put it as Bjarne Stroustroup once said "fiddling with machine addresses and memory is rather unpleasant and not very productive". The good news seemed to be that many of the code structures necessary to write that type of code are rather repetitive and can hopefully be generated automatically to some extent.

> And if the former, how do you compare to Nuitka and Cython? I read your README but couldn't quite make sense of it :)

This table (from the README) shows some computation times for Nuitka, prometeo, Python and PyPy.

CPU times in [s]:

Python 3.7 (CPython) : 11.787 Nuitka : 10.039 PyPy: 1.78 prometeo : 0.657

Other than performance, the main difference is, again, the runtime library dependency.

rich_sasha · on Nov 17, 2021

Right. Gotcha. So Prometeo isn't another "make Python fast again" project, but rather an orthogonal effort to write fast (C) programs, but in a high-level Python-like language. Thanks.

zanellia · on Nov 17, 2021

yep, that's right.

BBC-vs-neolibs · on Nov 17, 2021

And Cython? (Not CPython)

OulaX · on Nov 17, 2021

Each programming language has its purpose.

C code is performant and that is a fact. Python code is not.

When building mission critical systems why don't programmers just use C itself instead of coding in another programming language and having it transpiled for them? Why introduce such tools all the time?

I am against this because the tools programmers use are becoming too bloated compared to 10-20 years ago.

Want to build an Android App? Use Java/Kotlin.

Want to build an iOS App? Use Swift.

Want to build a Web App? Use a Single JS Framework (Why millions of frameworks?)

Want to build a Windows Desktop App? Use C#.NET Either with WinForms or WPF.

I really see tools and technologies coming up all the time to solve a problem that most of the time doesn't exist.

packetlost · on Nov 17, 2021

You've never been near a lab environment clearly. Python is a dominant language in university labs and runs of lot more real-time systems than you think. Grad students rarely have industry experience and don't necessarily have the know-how to write C code effectively, so it's a question of resources and ecosystem. Numpy, matplotlib, pandas, scikit, TensorFlow, etc. are all huge draws for the scientific and ML communities.

PaulDavisThe1st · on Nov 17, 2021

> runs of lot more real-time systems than you think

soft real time systems, for sure. if it runs any hard real time systems, get out of the lab.

We could of course debate the boundary between hard and soft, but I'd rather not.

packetlost · on Nov 17, 2021

No, I mean nanosecond precision real-time systems. Exhibit A: https://github.com/m-labs/artiq

To save some reading, this uses a combination of an FPGA and kernel (which I think is in Rust) to generate a realtime buffer that is programmed using a mix of compiled-realtime python and plain python via a RPC system.

ranger_danger · on Nov 18, 2021

Python does not have nanosecond precision.

nurettin · on Nov 18, 2021

3.7 has time.time_ns()

PaulDavisThe1st · on Nov 18, 2021

"nanosecond precision" means that you can schedule something for a time in the future, and it will happen within 1 nsec of that time.

packetlost · on Nov 19, 2021

We are doing exactly that with Python and that system I linked.

4w4s · on Nov 17, 2021

We can talk instead about how the requirements to run slightly less simple near-hard real-time controllers are really heavy in terms of money, effort and .. weight. This tool may actually help to streamline the software part, potentially being a substitution in certain cases to i.e. Matlab Coder or similar tools.

zanellia · on Nov 18, 2021

Right, MATLAB Coder is a very related tool.

zanellia · on Nov 17, 2021

I think many people who have at least once first prototyped a numerical algorithm in a high-level language (say Python, Julia, MATLAB?) and then implemented it in C, can relate to the experience of transitioning from error messages of the type: "dimension mismatch for XYZ" to "segmentation fault". That's in my opinion a strong motivation to build tools that can automate certain parts of the development process.

Writing C code directly is as a good option, as long as your code is not too complex to develop, maintain and extend.

And, again, Python here is intended to be the host language for an embedded domain specific language that gets compiled into C. It does not need to be efficient it needs to be expressive and easy to analyse and transpile.

adgjlsfhk1 · on Nov 17, 2021

Note that the whole point of Julia is that it saves you the rewrite. There is Julia code running on top supercomputers that gives speed competitive to C/C++/Fortran. You will have to put in some work to get Julia code to be that fast, but it is usually dramatically easier than a rewrite in a different language.

zanellia · on Nov 17, 2021

It's not so easy deploy an algorithm written in Julia on an embedded platform though, is it?

adgjlsfhk1 · on Nov 17, 2021

Probably not :)

fault1 · on Nov 17, 2021

yes, but "speed" in 'top supercomputers' is not "speed" in 'embedded systems'.

I do think Julia potentially can crack this space, but the runtime at least historically has not been tailored for it.

It does seem like Julia has become more modular lately especially being able to disconnect the JIT (or LLVM ORC). H Hopefully you'll either being able to either defatten or completely remove the runtime dependencies (ala Rust in no-std mode). Each of these is important for different use cases.

klyrs · on Nov 17, 2021

The problem that this language solves is that it automatically sorts out the memory usage for you. That isn't a problem for me; I've been programming in C for decades. But it is a problem for most python programmers who don't have a lick of C experience, but want to get C performance. It drastically lowers the barrier of entry.

zanellia · on Nov 17, 2021

For what it counts, I have developed code for this kind of applications exclusively in C for ~5 years (let's say 20% of my working time). I still think that debugging a segfault that you could have avoided is not very productive and that motivated me to look into possible alternatives.

klyrs · on Nov 17, 2021

FWIW, I almost always use valgrind before a debugger, when tracking down segfaults. It doesn't catch everything, but 90% of the time, it gets me to the right region of code in a single run.

zanellia · on Nov 17, 2021

sure I use valgrind and gdb too - still hard to argue that a segfault is pleasant to debug though?

klyrs · on Nov 17, 2021

Good good, just wanted to advocate for my favorite tool there. But, in my experience, segfaults are usually the easiest bugs to resolve. Unlike a sign error in my math, they're impossible to miss!

That said, tooling to get rid of them entirely is not to be sneezed at :)

kevin_thibedeau · on Nov 17, 2021

99% of the time a stack trace shows the culprit for a segfault straight away. No different than debugging Python.

zanellia · on Nov 17, 2021

if we are arguing that implementing a numerical algorithm in C is as easy as implementing it in Python - I would disagree. But maybe I am just wrong :)

kevin_thibedeau · on Nov 17, 2021

The issue is debugging crashes, not productivity.

_abox · on Nov 17, 2021

Yes and it will also prevent common memory management bugs that can lead to code injection.

_abox · on Nov 17, 2021

It doesn't have to be production. Maybe it's for a research project where you just need the extra performance.

Everything has a cost. This may not be ideal but learning to do C properly as an experienced Python dev will have a time cost as well. This may just be the best way to get from A to B.

I remember when I did a one-off project with a PIC microcontroller. I only had an assembler and I spent 2 days getting nowhere.

Then i found a C compiler and I had the whole thing running in 2 hours. The compiler turned out to much more efficient in speed as well as code size than my hand-written assembler.

Zababa · on Nov 17, 2021

> When building mission critical systems why don't programmers just use C itself instead of coding in another programming language and having it transpiled for them?

Why C and not assembler?

> Why introduce such tools all the time?

C compilers are one of those tools.

cerved · on Nov 17, 2021

Looks like a cool project!

I can't speak much about the code itself or the aims of the projects. Personally I would recommend more informative commit messages.

I do this myself, especially working on personal stuff, but writing commit messages that succinctly explain what each commit does is a good practice and gives a serious impression.

I often find myself hacking away and periodically going back to flesh out messages using rebase.

zanellia · on Nov 17, 2021

Thanks for the suggestion. Until now it's been a lot of discussion with friends and colleagues and much less actual collaboration on code writing - I might have drifted into bad practices.

savant_penguin · on Nov 18, 2021

For a matrix size 50 it beats Julia by a factor of 10 wow

https://github.com/zanellia/prometeo/blob/master/benchmarks/...

adgjlsfhk1 · on Nov 18, 2021

This isn't that meaningful since the "Julia" version is just calls to OpenBlas/LAPACK which are known to have relatively high overhead for small matrices. I'd be much more interested in seeing a comparison vs LoopVectorization/StaticArrays which are Julia libraries specifically optimized for small matrices.

zanellia · on Nov 18, 2021

Good point, it should be easy to add Julia to the Fibonacci benchmark. Here is the Python code https://github.com/zanellia/prometeo/blob/master/examples/fi...

adgjlsfhk1 · on Nov 18, 2021

does prometo change python to use int64?

shele · on Nov 18, 2021

The Julia-code has some performance flaws in the hot loop (like making a superfluous copy of a matrix before just reading it.)

sys_64738 · on Nov 17, 2021

Seems like a python to C++ would translate more of the language to like for like concepts more easily.

zanellia · on Nov 17, 2021

Right, for sure I would not need to re-invent the machinery to translate a class into a glorified C struct. The whole thing started with C in mind for portability arguments, but it might be a good idea to keep an eye on C++ as an option.

4w4s · on Nov 17, 2021

But some "embedded platform" tool-chains do no support C++

zcw100 · on Nov 17, 2021

Have you thought of targeting WebAssembly? If you're going from Python/Prometo -> C you could always make the extra sep of Python/Prometo -> C -> WASM but I wonder if there would be an advantage of skipping the intermediate C.

zanellia · on Nov 17, 2021

Python to ASM would actually be really cool and would guarantee performance gains for small matrices, but it would require quite some implementation effort. Not sure about WASM.

fwsgonzo · on Nov 17, 2021

Why WASM? It would be a pessimization compared to just transpiling to C if performance is the goal. WASM also is restricted to 128-bit vector instructions.

zcw100 · on Nov 17, 2021

Because wasm doesn't support Python and it might be nice to be able to write WASM in a Python like language.

fault1 · on Nov 17, 2021

Hasn't cython been ported to wasm (iodide), or perhaps one of the "rewrite in Rust" Python impls? rustc can output wasm pretty naturally.

zcw100 · on Nov 18, 2021

That would be running python in wasm so python (script) -> pyiodide (cpython/wasm) which is pretty heavyweight this would be prometeo -> wasm which I'd imagine would be fairly lightweight. If someone is willing to write in AssemblyScript I figured that something like Prometeo might be a welcome alternative. As I said if it transpires to C there's nothing stopping someone from using emscripted to then go to wasm.

nurettin · on Nov 18, 2021

One major problem is that the error messages need a lot of work. Why aren't class variables and static methods not accepted? I can't know that if your code just throws an exception while iterating some dictionary.

zanellia · on Nov 18, 2021

I agree that error handling is one of the main things to be improved. The problem is that in some cases the AST walker ends up in unhandled states and prometeo throws a generic exception with a line number only. Are you looking at something in particular? With basically 0 users at the moment, this kind of feedback is quite useful.

4w4s · on Nov 17, 2021

It seems a convenient/high_level way to use highly optimized C libraries with minimal overhead both in terms of execution time (i.e. vs standard interpreted Python) both in term of runtime size/complexity (see Julia).

zanellia · on Nov 17, 2021

That's correct. I'd say one of the fundamental differences between the two lies in the fact that the code generated by prometeo does not depend on a runtime library (which is somewhat fundamental for embedded applications, e.g., embedded optimization). From prometeo's README:

Finally, although it does not use Python as source language, we should mention that Julia too is just-in-time (and partially ahead-of-time) compiled into LLVM code. The emitted LLVM code relies however on the Julia runtime library such that considerations similar to the one made for Cython and Nuitka apply.

throw5399375930 · on Nov 17, 2021

Great project, but terrible name, considering how popular Prometheus is.

zanellia · on Nov 17, 2021

fair enough :p I might change it in the future.

4w4s · on Nov 17, 2021

Nice job! Is this aimed at single core/thread computations or the prometeo layer is also a way to write in a more "user friendly way" basic parallel code?

zanellia · on Nov 17, 2021

For the time being, it targets single core/thread applications only.

fwsgonzo · on Nov 17, 2021

Do you have access to builtins and intrinsics? Are there any plans?

The single threaded thing is not an issue because you can still call the same function on each CPU and use the CPU ID to target parts of the computation, like a compute kernel function.

zanellia · on Nov 17, 2021

Intrinsics (or directly assembly) are used in BLASFEO (https://github.com/giaf/blasfeo) the linear package used by prometeo. It would be cool to generate assembly directly for a few things, but that would require quite a bit of work!

giaf · on Nov 18, 2021

Indeed, and that would make the code machine-dependent too. On top of that, BLASFEO is already used for the most computationally intensive operations (e.g. matrix-matrix operations and factorizations, involving O(n^3) flops on O(n^2) data) providing high performance and low overhead, and the other operations on vector and matrices are anyway usually memory bounded and would see little to no benefit from directly using the assembly language.

4w4s · on Nov 17, 2021

On top of linear algebra, is it also possible to interface it directly to standalone, well optimized existing QP solvers, like HPIPM (https://github.com/giaf/hpipm)? In case one is interested in implementing MPC controllers.

giaf · on Nov 18, 2021

You are not the only one request this :p So, if I understand correctly, at the end what you would like is a tool to write your MPC problem more conveniently in a high-level language and still be able to deploy your code using HPIPM on an embedded device? [BTW, I happen to be the main HPIPM developer, so very glad about your comment!]

zanellia · on Nov 18, 2021

Interfacing existing C code would be an important thing to be addressed. It is currently a missing feature, unfortunately.

marmaduke · on Nov 17, 2021

Stand-alone is a very useful concept. I don’t like deploying Python stacks much. Wouldn’t that additionally mean you could target CL, CUDA or Sycl variants of C?

zanellia · on Nov 17, 2021

I'd say that's possible in principle - definitely not there at the moment though (and not even planned).

lvass · on Nov 17, 2021

Cython, pypy, micropython, nuitka, shedskin, ironpython, graalpython, jython, mypyc, pyjs, skuptjs, brython, activepython, stackless, transcrypt, cinder and many more I don't remember.

They're all practically useless or delegated to specific tasks. At this point you'd need to present incredible evidence that an alternative compiler can be useful. Personally I find it comical how many developers are still eluded by a promise of performant python. I hope you achieve your goals, good luck.

zanellia · on Nov 17, 2021

The point of prometeo is not to obtain a "performant Python". Python is used merely as a host language for an embedded domain specific language. You could do the same thing with any other language with a mature library for AST analysis :)

lvass · on Nov 17, 2021

Which makes this thread's title at least confusing.

throw10920 · on Nov 17, 2021

I'd argue it's downright misleading. "Python-to-C transpiler" means Python, not "a DSL based on a subset of Python".

An accurate title would be "a DSL embedded in Python for high-performance scientific computing" or something similar.

zanellia · on Nov 17, 2021

fair enough - could not cram "embedded" into it :)

fwsgonzo · on Nov 17, 2021

Most of the ones you list require dynamic linking and so is hard to make use of in specialized environments.

His project seems to be generating generic C code which is much easier to port to any weird platforms. In fact, it might be perfect for my use-case where dynamic linking is just extra attack surface.

I understand that the project is still in the early stages, but I will be paying close attention to it. If at some point it will be possible to write "regular" Python in it (minus most of the standard library and imports), then it could be a candidate for an edge computing platform.

m_ke · on Nov 17, 2021

Numpy, Numba and PyTorch seem to be doing ok.

gh02t · on Nov 17, 2021

Cython is also pretty successful obviously, though I don't think it quite fits in OPs list given that it's more about writing extensions than replacing your entire Python code/stack. But I do agree with OPs sentiment even as someone who writes a lot of Python.

Der_Einzige · on Nov 18, 2021

Not sure what you mean by "eluded by a promise of performant python". Tensorflow and Pytorch do a great job thank you very much.

nspattak · on Nov 17, 2021

so much effort to match the performance of lower level languages that it would have actually been easier to use those directly :)

Zababa · on Nov 17, 2021

I'm not sure, most people aren't writing ASM these days because the compilers are good enough for most cases. Compilers are great.

zanellia · on Nov 17, 2021

I think most HPC people would disagree with this statement. State-of-the-art HPC code is still written in ASM (see e.g., https://github.com/xianyi/OpenBLAS) [that's what Intel is doing too]

marmaduke · on Nov 17, 2021

ASM makes sense when the time spent in a specific routine exceeds the time it takes to write the ASM, which makes a lot of sense for Blas, less so for other HPC yet speculative or less fundamental projects. Cvodes for instance doesn’t need to be written in ASM, and I think Julia makes a strong case that it could have been written in Julia.

Zababa · on Nov 17, 2021

I don't think they would. I think they realize that state-of-the-art HPC code is a small fraction of all the code written. I doubt that these people write ASM instead of Python or JS or C or whatever when doing simple scripts.

guenthert · on Nov 17, 2021

That ASM code is however not necessarily constructed manually. You'd think for high performance code with limited scope, a superoptimizer would be used.

zanellia · on Nov 17, 2021

Not sure what a "superoptimizer" would look like in this context. For a reference, I know for sure that this https://github.com/giaf/blasfeo (which beats Intel MKL) was coded entirely by hand.

giaf · on Nov 18, 2021

There is more and more effort in the automatic development of high-performance linear algebra kernels. But based on my experience, it would certainly be a big challenge to have a tool able to exploit the subtle differences in the assembly languages of different architectures, if the aim is to match or even exceed an expert-crafted assembly kernel.

Anyway, that's surely a very promising active research direction.

Tozen · on Nov 17, 2021

Good point. And you don't have to go that low. Maybe go use Object Pascal, Nim, or Vlang. I know... the libraries. But a lot of them are bindings of C libraries. So, you can create bindings in other languages too or use Python from those languages. There are various options.

zanellia · on Nov 17, 2021

I would disagree on "easier" :) Ever spent half a day debugging a segfault?

staticautomatic · on Nov 17, 2021

SpaCy is pretty incredible evidence.

amkkma · on Nov 17, 2021

Regarding all the questions about Julia:

There's ongoing work to reduce runtime dependencies of Julia (for example in 1.8, you can strip out the compiler and metadata), but then it's only approaching Go/Swift and other static languages with runtimes.

Generating standalone runtime free LLVM is another path, that is actually already pretty mature as it's what is being done for the GPU stack.

Someone just has to retarget that to cpu LLVM, and there's a start here: https://github.com/tshort/StaticCompiler.jl/issues/43

zanellia · on Nov 17, 2021

That's quite cool. Maybe the whole thing can be rewritten in Julia too at some point. I just know too little about Julia to judge.

amkkma · on Nov 17, 2021

Well IMO it can definitely be rewritten in Julia, and to an easier degree than python since Julia allows hooking into the compiler pipeline at many areas of the stack. It's lispy an built from the ground up for codegen, with libraries like (https://github.com/JuliaSymbolics/Metatheory.jl) that provide high level pattern matching with e-graphs. The question is whether it's worth your time to learn Julia to do so.

You could also do it at the LLVM level: https://github.com/JuliaComputingOSS/llvm-cbe

One cool use case is in https://github.com/JuliaLinearAlgebra/Octavian.jl which relies on loopvectorization.jl to do transforms on Julia AST beyond what LLVM does. Because of that, Octavian.jl. a pure julia linalg library, beats openblas on many benchmarks

adgjlsfhk1 · on Nov 18, 2021

Octavian isn't just better than OpenBlas. It beats MKL by about a factor of 2 up to 100x100, and is roughly tied with MKL up to around 3000x3000 (OpenBlas is 2-3x slower up to around 500x500)

giaf · on Nov 18, 2021

Thanks for mentioning Octavian, I didn't know about this interesting project. Are you referring to single- or multi-threaded applications?

In the context of embedded optimal control applications (i.e. the original framework motivating the Prometeo development), applications are typically single-threaded, and in this case for matrices of size 100x100 MKL is _very_ close to peak performance already, there is no way something can be 2x faster without breaking the laws of physics. [Trust that I know what I'm saying here, as the main BLASFEO developer, I check MKL performance often enough ;) ] Just for reference, MKL has special flags MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ which enable extra optimizations improving performance for small matrices (e.g. turn off most input arguments checks), these should definitely be used in a fair comparison.

On top of that, linear algebra is much more than matrix-matrix multiplication, and e.g. in embedded optimal control the performance of factorization routines plays a key role.

adgjlsfhk1 · on Nov 18, 2021

Octavian is absolutely early in it's development (currently I think it only supports matmul including all the transposed versions). https://raw.githubusercontent.com/JuliaLinearAlgebra/Octavia... is the benchmark. It uses automatic threading from both MKL and Octavian (although for these sizes, it will only use a few threads). With only one thread, MKL is much closer and is only behind by about 20% at n=25 and roughly equal by n=60. I haven't done timings with MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ, but I think that's unfair since Octavian has the same overhead of figuring out how many threads to use.

giaf · on Nov 18, 2021

Looking forward to see Octavian development then, it looks exciting! Dealing with triangular matrices and data dependencies in other linear algebra routines such as triangular solves and factorization will surely be an interesting benchmark for the approach, since such difficulties do not arise in matrix-matrix multiplication. Anyway, that's surely a good starting point for Octavian.

Just one clarification: MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ is not about figuring out how many threads to use, it's about turning off checks on input arguments sizes, e.g. if m>lda, or negative lda or m or stuff like that. All these pedantic checks (which comply with the reference BLAS implementation in Netlib) are often times not done anyway in experimental linear algebra packages that do not aim at providing a compliant implementation of the standard Fortran BLAS.

sergius · on Nov 17, 2021

How does this compare with Nim and MicroPython?

zanellia · on Nov 17, 2021

I though about using Nim as a host language for the DSL for a while, but then decided to rely on Python simply because it is more mature (and I had already partially figured out how to manipulate Python ASTs to generate C code).

xapata · on Nov 17, 2021

That's a compiler. I don't understand the desire to create a new word when the old one is fine.

zanellia · on Nov 17, 2021

"A program that translates between high-level languages is usually called a source-to-source compiler or transpiler" from https://en.wikipedia.org/wiki/Compiler.

xapata · on Nov 18, 2021

Yes, a compiler. The fact there's a note in Wikipedia doesn't change my view.

zanellia · on Nov 18, 2021

Anyway, the main point is to give it a name that is as informative as possible. If I read "transpiler" I immediately make the connection with the fact that it translates a high-level language into another high-level language (which is what prometeo does)- but maybe I am biased.

xapata · on Nov 18, 2021

You'd need to know what both languages are to know what it does, and with that knowledge you can just call it a compiler.

zanellia · on Nov 18, 2021

If you have a better reference I'd be happy to change my mind.

xapata · on Nov 18, 2021

Not everything benefits from citation.

zanellia · on Nov 19, 2021

hmm I tend to be a bit skeptical about "ipse dixit"-like statements, but I think I got your point about compilers/transpilers :)

cycomanic · on Nov 17, 2021

How does it compare to pythran? Except for the fact that it's c and not c++?

zanellia · on Nov 17, 2021

Not sure how easy it would be to make the code generated by Pythran standalone, i.e., no dependency on the Python runtime library. Any Pythran expert? :)

cycomanic · on Nov 17, 2021

Pythran code is standalone, i.e. no dependency on the Python runtime AFAIK.

zanellia · on Nov 17, 2021

It generates a Python extension, doesn't it? Would not know how to run it outside of Python.

cycomanic · on Nov 17, 2021

It can generate python extensions, but doesn't have to, here is a blog post talking about using it to generate self contained c++ code by the author: https://serge-sans-paille.github.io/pythran-stories/pythran-...

BTW very cool project nevertheless, just wanted to see the differences parallels to pythran. There might even by room for collaboration on some features.

zanellia · on Nov 18, 2021

Pretty cool! I should check how it works in more detail.

omegalulw · on Nov 18, 2021

Do you have benchmarks against numpy on big computations (10-1000s)?

zanellia · on Nov 18, 2021

No, that's not the timescale of interest, I would say. However, if the big chunk of computations is delegated to HPC libraries I would say that NumPy could be rather competitive there (although still not easy to embed). If instead you need to run many times the same piece of code where a large fraction is pure Python, of course, it would not change the picture with respect to the "small" computations scenario.

up6w6 · on Nov 17, 2021

Yes, I'm gonna talk about Julia...

It's kinda of sad how much effort is put on the creation of new Python compilers to make it slight faster while the problem of latency to compile that people hate at Julia is not tracked because of the lack of manpower to improve Julia's interpreter.

https://youtu.be/IlFVwabDh6Q?t=2530 (tldr: The Julia interpreter is currently about 500x slower than JIT code and there are a lot of low-hanging fruit work there that could easily give it a 10x speedup - this could make more viable to switch between compiler and interpreter depending on the work)

zanellia · on Nov 17, 2021

Personally, I think Julia is great - just don't know it well enough to write a package that takes Julia ASTs and generate C code from them :) There could totally be a Julia implementation of the main idea behind prometeo (Julia per se does not solve the problem that prometeo aims at solving).

adgjlsfhk1 · on Nov 17, 2021

You can just use `@code_llvm` to generate LLVM code, or `@code_native` to generate assembly. Does that do what you need?

zanellia · on Nov 17, 2021

hmm not sure, the compiled LLVM code would still depend on the runtime library?

adgjlsfhk1 · on Nov 17, 2021

The LLVM code will only call into the runtime for allocation or dynamic dispatch, both of which are avoidable. Lots of real Julia code will never touch it.

fault1 · on Nov 17, 2021

the problem with Julia in the use case of OP is really the fact that it is garbage collected (and perhaps also how its GC is tuned). You can work to eliminate allocations, but the memory determinism problem is more important in real time control and embedded systems. see for example, this video: https://www.youtube.com/watch?v=dmWQtI3DFFo

It's kind of why C is still king in this space.

dom96 · on Nov 17, 2021

This is really cool. Just a bit of pedantry: is Python higher-level than C? If so this isn't a transpiler but a compiler :)

zanellia · on Nov 17, 2021

Fair enough - it's blurred I'd say. I see C as a lower-level, and yet still high-level, language, if compared to Python :)

b20000 · on Nov 17, 2021

just write your code in c or c++ and be done with it. if you need math libs there are plenty out there for anything you can imagine. python will go the way java went many years ago.

BBC-vs-neolibs · on Nov 17, 2021

A brief comparison distinguishing it from Cython would be most welcome.