Hacker News new | past | comments | ask | show | jobs | submit login
Copy-and-Patch: Fast compilation for high-level languages and bytecode (2020) (arxiv.org)
182 points by tosh 30 days ago | hide | past | favorite | 50 comments



We found these ideas very useful for doing micropatches, post hoc intrafunction binary patches, using off the shelf compilers. At the least, a voice of support that these kinds of calling convention control is useful.

- Copy and Micropatch: Writing Binary Patches in C with Clang preserve_none https://www.philipzucker.com/permutation_compile/

- https://github.com/purseclab/Patcherex2/pull/31 A pull request to the PatcherEx2 library


This is interesting. I came up with a very similar idea in the late 1990s to JIT a lazy functional language. The idea was to only compile the code that you needed to execute next, so you could run lazy code without needing to compile it (recursing across infinite lists, etc). It had templates with holes (for things like cons etc.)

I stuck my paper on it under Prof Phil Wadler's office door and ran away. I have no idea if he ever read it :D


If you post a link, we all can read it :)


If only I still had it. It was literally printed out and put in a plastic binder. I think it was written on a Sun Sparcstation in LaTeX.


There's some related discussion here: https://news.ycombinator.com/item?id=40406194


Thanks! Also related:

A copy-and-patch JIT compiler for CPython - https://news.ycombinator.com/item?id=38769874 - Dec 2023 (68 comments)

Copy-and-Patch: Fast JIT Compilation for SQL, WebAssembly, and Others - https://news.ycombinator.com/item?id=28547057 - Sept 2021 (7 comments)


Is this the technique they are using in Python's new JIT?


Yes. This is the algorithm chosen for the CPython JIT under development, scheduled for inclusion from 3.13 onwards. See https://m.youtube.com/watch?v=HxSHIpEQRjs


It’s kind of annoying how static compilers insist on figuring out everything from scratch. Even in an LTO build, the amount of code that changes between two consecutive commits is typically minimal - rather than the very coarse ccache-style, it would be nice if compilers had nicer ways to cache state across execution so that unchanged code constructs could just emit the same assembly as last time without needing to do anything else.


The very first objection that immediately have sprung up to my mind it that this approach gives up on (chances of) inlining a lot of stuff (or de-inlining, for that matter). The second thought is that it also heavily interferes with any code motion transformations, including instruction scheduling, loop unrolling and auto-vectorization. Granted, this second objection can probably be dealt with by using some clever fine-guided techniques... but at this point I'd wager they'd be both a) about as slow as doing all the work it from the clean slate; b) have incredibly large surface for obscure bugs — and those steps are already one of the most finicky ones.

And back in the 70s, when the microcomputers and interactive computing were becoming the norm, and people really cared about the compilation times — still nobody bothered to implement that kind caching, AFAIK, even when the compiler performed way less complex kinds optimizations than they do today (heck, even register allocation was not done as graph colouring back then).


Here's an incredibly minimalistic example:

    int f(double x, int y) {
        return x * y;
    }

    int g(int x, int y) {
        return x * y;
    }
These two functions differ by a single token, yet their assembly translations have only one common instruction, and this instruction is "ret":

    f:
        movapd      xmm1, xmm0
        pxor        xmm0, xmm0
        cvtsi2sd    xmm0, edi
        mulsd       xmm0, xmm1
        cvttsd2si   eax, xmm0
        ret
    g:
        mov         eax, edi
        imul        eax, esi
        ret


> The very first objection that immediately have sprung up to my mind it that this approach gives up on (chances of) inlining a lot of stuff (or de-inlining, for that matter). The second thought is that it also heavily interferes with any code motion transformations, including instruction scheduling, loop unrolling and auto-vectorization.

I feel like a lot of these sorts of optimization are best done at the language level, before it gets to the instruction level, eg. fusion, supercompilation, partial evaluation are all more general than and far more effective optimizations than these low-level rewrites. If you've already done those optimizations, then you just want to spit out code as fast as possible.

The low-level transformation that matters the most is probably register allocation. Good register allocation is responsible for a lot of OCaml's speed. I'm not clear whether that would be significantly impacted with copy and patch.


No matter how you slice it almost time spent in a compiler is wasted by recreating optimizations for things that haven't changed. Back in the 70s disk IO was slow and memory was scarce. Compilers couldn't cache much for that reason. Compilers today are designed for constraints that no longer exist.


> Back in the 70s disk IO was slow and memory was scarce. Compilers couldn't cache much for that reason.

That's... not exactly true? Yes, memory was scarce but that's precisely why the compilers in the 70s used disk I/O for temporary storage way more extensively than the modern ones do. Let's take Unix V6, for example: the C compiler used two temporary disk files to keep data between its three passes, the assembler used three temporary files to keep data between its two passes, and I can't quite wrap my head around how many temporaries its linker used but it's at least 4. Nobody bothered to try to keep and reuse those temporaries, however.

Compilers today, instead, keep almost everything in memory and don't generally bother with temporary files at all because why would they? The growth of the computing speed has outpaced the memory/disk speeds anyhow.


Overlay[1] is another related technique that is today completely obsolete.

[1] https://en.wikipedia.org/wiki/Overlay_(programming)


If you’re going for very fast codegen, sure. But in a production build typically you have problems like “this shifted by one instruction so every non-relative branch beyond it needs to be patched”.


Hmm... if only there were some scholarly article about how to patch binary snippets in a fast way?


Rust's incremental compilation does this. Unfortunately, it doesn't help with linking, but it speeds up compilation by reusing everything that didn't change.


I could be wrong but I think Rust’s incremental compilation is still pretty coarse. LLVM is where this would need to be implemented and that’s a tall ask (i.e. skipping expensive computation when the AST is the same or close enough to a previously computed result). And Rust’s incremental compilation doesn’t apply to LTO. There’s only so much you can do from outside the compiler toolchain.


I would still call this a traditional template jit. The new things have nothing to do with copy&paste, but with using continuations and TCE (tail-call elimination), which I haven't seen in a simple template jit (or sometimes called method jit) before. Plus the register variants, to avoid moving them too much, but this is also very pedestrian, low tech. Low tech is good!


Do you have any links to the traditional ones for those interested (i.e. me)? I only found out about the approach due to the author's articles a few months ago.


I wonder if the comparisons with Cranelift would still show the same build-time/run-time gap four years later.

From what I understand, Cranelift's current instruction selector is very close to copy-and-patch conceptually (it maps patterns of bytecode instructions to sequences of machine instructions), and the pass Cranelift spends the most time in, register allocation, still has to happen in copy-and-patch.

So if copy-and-patch still has a meaningful advantage over Cranelift, I'm very curious to know where it comes from.


> At runtime, C&P generates executable code by copying the object code and patching the holes with runtime known values.

how would this work on OSs under hardened runtime rules?


The same as with any other JIT runtime: you do your transformations first, and then you do the `mprotect` call that turns write permissions off and execution permissions on. The only caveats I can think of (`pledge`d not to use `mprotect`, marked most of the address space with `mimmutable`) apply to all other JITs too. The gist is that you operate on a copy of code, and that copy is in a writable page until it's ready to run, so you never violate the W^X rule.


Or you do what V8 does with WebAssembly and just use WX pages because doing it correctly is "too hard" to do without losing performance.


Does that even work in W^X platforms? Context for my response has that assumption, we can't simply throw it out the window, right? I think I read somewhere about making two mappings to the same physical page (one W, one X), are you referring to that? (I'd still need to know how that works as it kinda defeats the protection, the OS should prohibit that, right?)


Oh, for sure what I said wouldn't work on a W^X system. I was just pointing out that one of the most widely used JIT software uses WX pages.

What OSes prohibit that? Linux doesn't (well, I think it can with SeLinux maybe?). OpenBSD might?


The question was about OSes with hardened runtime protections. The most basic of them all is W^X. All BSDs use it, and IIRC Linux is able to enforce it as well. I'd be surprised if it isn't the default in most distros, but I guess it's not impossible. I need to go for lunch so I won't check right now.



Aren't you just better off using GNU Lightning? It's basically a set of binary stencils defined by C macros, except because they're macros, your interpreter/compiler will only embed the stencils you actually use.


I love this approach. Has any independent toolkit evolved to help do this? Or have all projects rolled their own?


There's Deegen, which is being used to build a LuaJIT successor. https://github.com/luajit-remake/luajit-remake

From what I gather, Deegen is independent of the VM in that repo, but lives there afaik.


I mailed the authors a few months ago and indeed, it lives there but is independent-ish. They don't consider it ready to use for anything else other than experiments, though.


Blog post about the approach here: https://sillycross.github.io/2023/05/12/2023-05-12/


It is utterly baffling that people are downvoting this comment.

The repo I link to here is by Haoran Xu, one of the authors of the paper.

Y'all need to chill out.


I think libgccjit uses this technique or can use it: it reminds me of things I read about the elisp JIT in newer versions of emacs


Cpython in beta


Sure but from looking at the patches, it seems like theirs is tied into the CPython build and infrastructure. I am talking about a dedicated library which can be dropped into a variety of superprojects to help them implement a JIT.


(2020)


Link to the HTML version for those who need it and to help spread the knowledge of the wonderful arxiv ar5iv tool: https://ar5iv.labs.arxiv.org/html/2011.13127


Afaik, this technique is called threading (https://en.wikipedia.org/wiki/Threaded_code) and has been used in most bootstrap compilers for decades.


I don't think this counts as threading, though it makes use of it. Threading mostly removes the dispatch overhead, but you still have a general function per instruction, a function call, and an inability to inline constants. Copy and patch could be thought of as a generalization of threading in the sense that you still precompile code for your instructions, but instead of calling that code you poke holes in it to replace constants you only know at runtime, then make a single callable program out of those instead of jumping around for every instruction.

There is a prior, very similar approach in GNU Jitter, but it uses only the compiler and some magic rather than the linker for marking spots to replace. I read about it by mention of moonchild in a thread[0] linked by foota here.

[0]: https://news.ycombinator.com/item?id=40410420


Threading doesn't necessarily mean following the full function-call convention on every transition. A switch/case approach is one way, but it's also possible to jump intra-function without following the calling convention, i.e. a goto with a pointer-based target.

I say possible in the sense that hardware supports it. The standard C language doesn't support this, but GCC's GNU C does support it as one of its extensions.

The GNU Jitter author wrote about this, see page 48, referred to as page '17', of this PDF: https://ageinghacker.net/talks/jitter-slides--saiu--ghm2017-...

See also Gforth writing about the same technique: https://gforth.org/manual/Threading.html


That is not the point though. I know threading allows using a custom convention and just jumping. What it doesn't, IIUC, is allow specializing your code for runtime data. Copy-and-patch and Jitter do. It's essentially static vs dynamic. I'm new to this so I may very well be misunderstanding threading, but I don't think it allows for modifying the functions on runtime by itself.


There's bit of a mushy terminology problem here. Generally when we say threading we mean schemes that don't rely on generation of new native-code sequences, sure, but people who are serious about efficient threading don't necessarily view it that way.

The term context threading refers to generation of native code that behaves much like conventional threaded-code interpretation, except specialised to the particular input code. [0][1]

See also dynamic superinstructions, a term used by gforth to refer to its runtime generation of native code for input Forth code. [2]

Disclaimer: I'm no Forth expert. I should give [1] a proper read, it looks interesting.

[0] https://webkit.org/blog/214/introducing-squirrelfish-extreme...

[1] (PDF) https://csng.cs.toronto.edu/publication_files/0000/0162/demk...

[2] https://www.complang.tuwien.ac.at/forth/gforth/Docs-html-his...


I see. I'll try to read the links over the weekend, they seem really interesting. It's sad that the terms are so overloaded though, as it makes it hard to reach preexisting approaches and leads to rediscoveries of the same idea (I see that both Jitter and C&P seem to implement this very same idea, except maybe in the way the templates are created, and both seem to believe their invention is novel).


Regarding threaded-code strategies I think the terminology is reasonable. There's a real sense in which context threading is just another kind of threaded code, although it does make it awkward to describe conventional direct or indirect threading.

The creator of Jitter has read a lot about Gforth, as evidenced in [0][1].

You're not the only one to suspect there's some reinvention going on here somewhere though. [2] I'm not familiar enough with these topics (copy-and-patch and Jitter) to weigh in.

[0] (PDF) https://ageinghacker.net/talks/jitter-slides--saiu--ghm2017-...

[1] (PDF) https://ageinghacker.net/talks/jitter-slides--saiu--bts-2022...

[2] https://news.ycombinator.com/item?id=40410420


They are both super close in the latent space. I think you could design for threading, but then copy into place and NOP out what you don't need, or wire in the right constants, etc.


That's true. But it's still a stretch to say they are the same thing. That's why I said it can be seen as generalizing the approach.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: