Hacker News new | past | comments | ask | show | jobs | submit login
IR is better than assembly (2013) (popcount.org)
153 points by oherrala on Jan 1, 2017 | hide | past | favorite | 58 comments



> If you really can write better assembly than LLVM, please: Don't write any more assembler by hand - write IR and create new LLVM optimisers instead. It'll benefit everybody, not only you. Think about it - you won't need to write the same optimisations all over again on your next project!

Very noble goal, but I can imagine that it can take a lot more time to do that than just writing a bunch of assembler instructions.

Perhaps there could be some intermediate approach, where LLVM can learn from a IR/assembly pair and improve itself (?)


Sure, you can file a bug with a snippet and expected assembly and make the devs do the work :P

They generally seem pretty willing if it's simple, and if it isn't then it'd probably be a pretty involved (but interesting!) side project for you anyway.


>LLVM can learn from a IR/assembly pair and improve itself

Oof, Reflections on Trusting Trust just got more interesting...


If I wanted to do this sort of thing, I'd probably use either intrinsics or C directly -- the compiler is already good at dealing with both, and will probably do a better job than LLVM IR.

The biggest reasons to drop to assembly is because there are high level constructs that the compiler is very unlikely to recognize and optimize effectively. Things like AES-NI, hardware RNGs, and similar.


This is neat, I had no idea there was an intermediate language like this which is cross platform. It would seem that I could decompile binaries using llvm tooling and then recompile for other platforms.

http://stackoverflow.com/questions/6981810/translation-of-ma...

Obviously not cross os, but might be good for bare metal stuff. I've gotten libraries in the past compiled with weird ABIs. This sounds really neat.


I'm afraid not, because the LLVM IR does not abstract for things like endianess, word size, header file contents, or many other things that are platform dependent.


LLVM IR is pretty close to portable - the issues you mention are an issue with Assembler code as well. Even the examples in the article show the same IR being compiled to different CPU architectures.

I'm not sure I'd go hand writing IR code, though. It's pretty easy to just write C code with vector extensions, etc to produce the IR I'd be after. When I do need to write assembler code these days, it's typically to get access to some privileged instructions in kernel space. Most other instructions are available in C code via __builtin_foo_bar functions.


You have to be very, very careful when writing platform agnostic IR.


Sure but you have to care about word size, endianness, etc when writing C code too.

You would not go writing entire apps with it anyway, just a few inner loops or so.

I'd still use assembly-looking C with extensions and intrinsics for that, though.


The other thought I had here, is that AFAICT IR is not a standard. There is no requirement that it remains compatible in 50 years or 5 months. There is no standard IR, and shouldn't be, as that would become an impediment to compiler evolution and fit/optimization to newer architectures.

Doesn't AS/400 use an IR approach as well? Which let IBM seamlessly migrate the underlying CPU a few(?) times now?


Almost every modern compiler uses some form of intermediate representation. The choice of IR is shaped by history and design. As the posted article shows, LLVM uses a SSA based IR to describe programs. GCC in contrast uses 2 IRs, GIMPLE and a LISP based IR called RTL. GHC uses Core Haskell (Haskell without the syntactic sugar).

The purpose of every IR is to remove the ambiguities and language complexities of programs. By simplifying programs into series of statements such as "%3 = op $type %1, %2", generic optimisers can be built easily. Certain language specific optimizations can be written for the frontend of the compiler as they have knowledge of the language being compiled. Generic LLVM-IR may not be optimised to deal with issues such as devirtualization in C++ (though there is work being done in that area).

LLVM's IR undergoes fairly occurrent changes to better handle "new" problems.


> The other thought I had here, is that AFAICT IR is not a standard. There is no requirement that it remains compatible in 50 years or 5 months.

Correct. Libfirm[0] is the only compiler I am aware of that attempts to use a "firm" IR.

[0] http://pp.ipd.kit.edu/firm/Index


The ACK's IR (which it calls EM code) has been stable for 34 years.

http://tack.sourceforge.net/olddocs/em.pdf


Neato! Thanks for the link.

Sorry if the tone of "only compiler I am aware of" came of as snotty it was meant as an expression of my naivety on the subject.


Nah, don't worry; nobody's heard of the ACK! (Mostly because it's so old that it doesn't really believe that registers exist, which means it doesn't get on well with modern architectures.)


> Doesn't AS/400 use an IR approach as well? Which let IBM seamlessly migrate the underlying CPU a few(?) times now?

Yes it uses an IR approach. MI code is essentially a byte code to which programs are compiled and then in turn the OS compiles them to the underlying machine code. AS/400 (or IBM i as it is officially now called) did use this to help in the switch from a proprietary/undocumented CISC CPU architecture (apparently similar but not identical to the IBM mainframe instruction set) to PowerPC/POWER. However, binaries are stored with both MI code and machine code together, and there is an option "Delete Program Observability" which removes the MI code section, which makes that migration strategy impossible – but, in practice, many people didn't choose that option, and if you did, so long as you or the vendor still have the source code it is just a recompile to fix it.

AS/400 has two program models – OPM (Original Program Model) and ILE (Integrated Language Environment). Basically, the original object code format, runtime library, etc, were designed for use with RPG/COBOL/PL/I, and they didn't work very well with languages such as C and C++, so ILE was created to remedy that deficiency. The relevance to IR, is that OPM and ILE actually use two different MI formats - Old MI (OMI) for OPM and New MI (NMI) for ILE. (NMI is also called "W-code".) IBM has publically documented most of the details of OMI and provides a public API to convert OMI code to machine code. By contrast, my understanding is they've chosen to keep the details of NMI confidential, and there are no public APIs to convert it to machine code.


> Doesn't AS/400 use an IR approach as well? Which let IBM seamlessly migrate the underlying CPU a few(?) times now?

In these systems both something like IR and final assembly is stored in the binary. The OS recompiles the IR for the current CPU if necessary and replaces the assembly in the binary. That way there is no compilation overhead unless architectures are changed.


IR isn't a standard but there are efforts to standardize a stable subset. This would be useful for projects like Google's pnacl and Apple's bitcode. I'm guessing they currently must invest lots of effort to stay in sync with upstream llvm.



The Open64 / PathScale compiler suite has had intrinsics written in the IR (WHIRL) for a long time. WHIRL is stable enough that it's not a maintenance problem. Being written in IR means that the full power of inlining, function specialization, etc etc will be used, even if whole-program optimization isn't being used.



Is there an actively maintained fork of Open64 anywhere? Also, I recall PathScale being open-sourced a while back, but it looks like newer releases are proprietary again and I can't find the source code for the open-source one.


Most of the guides I've seen for LLVM recommend you use the LLVM libs to generate the IR. Why? I feel like it would be much easier to generate the IR directly like the author has done.

It also wouldn't tie me to any particular library - I think the only actively maintained one is the C++ one.


The IR is highly unstable, especially the text representation, which is basically just a debugging tool.

The bitcode format is rather complicated, but at least there is backwards compatibility.

LLVM is the library, so you are bound to it anyway.

Further by having code for emiting IR you are basically just copying functionality that already exists.


> The IR is highly unstable, especially the text representation, which is basically just a debugging tool.

It's not guaranteed to be stable, but it's not "highly unstable" either. Not too many breaking changes have been introduced in the past few years and it's unlikely that you'd hit those parts when hand writing it.

Not that I'd recommend hand writing LLVM IR.

It's a shame that we don't have an actual portable IR, which would not be tied to toolchain version or contain target specifics. LLVM IR can't be used as such for a portable IR, which is why we've seen efforts like SPIR-V (Vulkan GPU IR) and WebAssembly (IR for browsers), both of which are very similar to LLVM IR (and a lot of work was duplicated).


WebAssembly is very different from LLVM IR, because it is essentially just the greatest common denominator of what existing JavaScript VMs handle in the backend. For instance, WebAssembly allows only reducible control flow, has very limited support for function pointers, and a particularly nasty memory model (linear memory, which precludes any kind of non-trivial alias analysis and optimization).

All of these things may change in the future (apart from the reducible control flow restriction), but, at the moment, the goal of WebAssembly is just to have a viable compilation target for C++ in the browser. Google was championing LLVM IR for this purpose in the form of pnacl, but that was not enough of a compromise to work on the web platform. :(


You can make a perfectly stable IR from that Kaleidoscope example in no time and still avoid tying yourself to C++ish LLVM. I'm not sure how big of a deal that stability thing is though, you still have to cover everything with tests, so who cares if it breaks a bit in future releases?


Because LLVM makes no guarantees about IR compatibility between releases. With the API at least, your compiler can catch the larger breaks when you upgrade, and they're likely to come with more doc/comments and change notes than the IR tweaks.


To be more precise, there is no guarantee at all about textual IR compatibility. Compatibility of bitcode, which is a binary representation of IR, is meant to be preserved across a very generous range of LLVM releases (but not forever).


Presumably because it allows a workflow that skips over parsing (and possibly error prone generating code). Also allows generators to e.g. lint things without a bananas workflow.


The C api is probably the best bet.

Most projects that aren't written in C++ thenselves use the C api. It has fewer features than the full C++ api, but is more stable.


I object to the article just associating the term "IR" with LLVM IR.

IR is short for "Internal Representation". Most complex compilers have at least one level of IR, usually multiple ones that are progressively lower level.

The point is that an IR carries more information than machine code, and so potentially allows more specific optimizations.

This should at least be mentioned in the article.

Rust and Swift, for example, both use LLVM, but have their own intermediate levels of IR ('Mir' for Rust).

LLVM IR is already stripped of a lot of information that might be important for certain higher level optimizations. For example, numbers are all unsigned, and there are different operations for signed and unsigned arithmatic.


In this context, IR stands for Intermediate Representation.

Many if not most compilers have one or more intermediate representations, but most of them are not as rigorously specified as LLVM's.


In this way, IR would fullfil the same role Macro-32/64 did for porting VMS to Alpha and beyond. However, it appears to my understanding (sorry, I was still crawling when VAXes were on the way out), that the benefit was retaining "VAX" syntax to avoid massive rewrites.

If you're starting from a clean slate, what's the benefit of writing IR? Why not use C? After all, IR won't really give you complete control over generated code, and it's still an abstract VM (albeit that obviously allows writing IR that will only sensibly compile on a specific arch - e.g. system register acceses and so on).


Why the hell are people down-voting this? It's a purely technical comment. Seriously people, someone making an argument against a technology you like shouldn't make you reach for the down-vote button. I've noticed I get down-voted too if I make something critical of certain technologies, and it's really sad . Make a response or an argument.


(Replying here, so my upvote on the parent doesn't get lost...)

Working in C means you get restricted to only doing things which C can do, and you're out of luck if you want to do things that C can't do: unaligned accesses, tail calls, saturating arithmetic, overflow detection, exceptions, stuff like that. IR allows you to do all this, at the expense of being considerably more complex and painful to use.

Of course, if your compiler don't want to do any of this, then C is a perfectly valid choice, as it doesn't tie you to the LLVM toolchain --- see Nim, for example. But as soon as you venture outside C's comfort zone, working in IR starts paying off.


But LLVM-C (i.e. C + llvm extensions) does do all the things you just described, plus it's a stable interface for programming. IR is unstable, and therefore unsuitable as a source language for most projects. I say this as someone who writes an enormous amount of both assembly and llvm-c, and has considered and dismissed IR for exactly this reason.


That's all fair and correct, but the amount written for, say, an OS, in assembly - native or IR - is quite minimal and well-controlled, usually written in assembly because of a good reason (exception handlers, for example). At that point, it's just more obvious to write in native assembly, given that you're writing architectural code, and for a different architecture, that code would be sufficiently different.

So again, I see all the value in IR for compilers. Can someone give me at least a reasonable use case for beginning to write in IR, or even using IR for optimizing cerain code paths?


LLVM IR can detect overflow, has exceptions, tail call optimisations etc? I thought it was lower level than C itself. I have kind of a strict hierarchy of abstraction in my head, with LLVM IR sitting a bit below C, but maybe it's not that simple.



Sitting below C is exactly what you want in this case. It's not about tail calls as an automatic optimization - it's about having a tailcall opcode of some sort, that guarantees a tail call when used (but you're responsible for lifetimes etc). That is inherently a low-level facility - lower than what the C standard provides. C implementations can do tail calls, but because this behavior isn't actually guaranteed and is purely a quality-of-implementation issue, a language that needs tail calls for its idiomatic code to not blow up the stack cannot target portable C, unless it redoes the stack itself.


It's interesting since I proposed using LLVM in place of inline assembly a while back. I got this counter point when I asked on ESR's blog:

http://lists.llvm.org/pipermail/llvm-dev/2011-October/043724...

Any LLVM experts have thoughts on this or my original goal within the context of LLVM's current situation?


I'm the opposite of an LLVM expert, but I'm having a hard time seeing the relevance of that email. It's about how LLVM IR isn't cross platform, but inline assembly isn't either, right?


This may be a little off-topic but does anyone know a good and up-to-date tutorial for using LLVM in C language?


LLVM's C is compiler clang. You can find it from

http://clang.llvm.org/

In many cases it's just drop-in replacement for gcc.


I rather understood it that they wanted to know whether there is a C API to use LLVM, so perhaps something like this: https://pauladamsmith.com/blog/2015/01/how-to-get-started-wi...


That's exactly what I am looking for but it's almost two years old and I think I tried it a few months back and got lots of errors. I was hoping to find a more up-to-date tutorial. Unfortunately LLVM people don't pay much attention to documentation and tutorials for their product.


I've been there.

Pretty much the only way is to just read the C api header and figure things out from there.

It does contain comments, which are mostly helpful, but sometimes incomplete or out of date.


This series may also be relevant, but I never read it in full: https://msm.runhello.com/p/category/plt/no-compiler.


I don't have projects that fits on this, but sounds like a no brainer. Abstracting the specifics and keeping the timeless is a beautiful move!


LLVM IR is not really suitable because of compatibility. Someone should create a portable assembler on top of LLVM instead.


What's the difference between LLVM IR and various GCC IR's?


TL;DR: vendor lock-in.


Vendor lock-in comparable to writing in Rust or Go.


Sure. My point was related to that it would be interesting a standard IR, instead of a vendor-specific one :-)


Yes. Let's all write IR instead of assembly. Let's encourage others to do the same, making Assembly even more of an ivory tower (or, more accurately, a grimy sub-basement) than it already is, further discouraging newcomers from learning it, thus keeping them from properly understanding the machines they program on. Eventually, nobody will even understand any of these machines.

Use whatever you want in production, I don't care. But don't discourage people from learning assembly. It's a worthwhile task.


Anyone tell the LLVM team that the Babel tower is a myth and that it ends bad?

Some CPU have specific idioms that are not only hard to translate but requires to be used fluently. Like natural language.

Btw, I never uses any software relying on a name of a myth that was a pure failure such as Babel or death star. It makes me feel like people intend to fail.


Not really. IR is exactly a good fit for compiled languages, because there are no "hard to translate" idioms in languages targetted by LLVM that need really orthogonal translations on different architectures, which by the way aren't really all that different at a high level.

However I am still looking for a use case to write IR directly, or in place of bits of inline assembly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: