Hacker News new | past | comments | ask | show | jobs | submit login
How Tail Call Optimization Works (eklitzke.org)
220 points by g42gregory on Jan 8, 2021 | hide | past | favorite | 208 comments



Inspired by my observation that tail calls can be seen as "goto with arguments"[1], in 2013 I wrote a Common Lisp macro called argtags which is like tags/prog, but with labels that have parameters. Giving them arguments causes the corresponding variables to be assigned as the transfer takes place. It compiles to tagbody.

Then, I used argtags as the basis for tlet: a macro that looks like it is defining local functions using syntax similar to labels or flet, but actually compiles to argtags, so everything ends up as a tagbody.

The calls are always tail calls, whether or not in tail position. You cannot accidentally blow the stack, but you can accidentally expect a call to return, which it won't!

This is all found here:

http://www.kylheku.com/cgit/lisp-snippets/tree/tail-recursio...

There is a complementary cross-module tail calling system in here which provides a defun-like construct deftail for defining tail calling functions. That's based on a trampoline dispatch loop: tail calls unwind up to the loop, passing their arguments to it, along with the next function to be called.

[1] Later I found out that Steele also offered the viewpoint back in the 1970's that tail calls can be seen as goto with arguments.


I'd like to thank you for helping me understand what non-techies must feel listening to me talk sometimes, with that penultimate sentence.

I understood none of it, but I know it has deep meaning and I respect it.


That’s amazing!


Tail call optimization was added to the Javascript language specification in ES6 (ES2015). However most Javascript runtimes don't support it. This is a shame because recursive operations like Haskell's foldl/r would more straightforward to translate to Javascript. Minus the automatic currying and lazy evaluation of course.

More info: https://2ality.com/2015/06/tail-call-optimization.html

Edit: Babel may have tried, but couldn't do it for all cases, so they stopped trying? https://github.com/babel/babel/issues/256


Safari supports it. Chrome had it ready, behind a feature flag. It was then dumped for FUDdy reasons centered around user concerns that turned out not to be warranted (Safari has proper tail calls per the ES6 spec, and no one complains about it).

The truth is that it would have been very expensive for MS to add it to Edge (which was built around incompatible calling conventions), and that cross-realm tail calls (calling into code from another I frame) would be hard to implement in Firefox.

Now that Edge switched to Chromium, the only problem left is with FF. It could be spec’ed around, but the consensus in the TC-39 isn’t there anymore :-(


For folds over lists you don't need tail call optimization:

You just implement them any which way (eg with loops), and the user of those functions never has to know. That's how eg reduce or map or filter work in Python.

That holds in general for combinators.

Haskell is actually a bad example to use here, because the tail recursive foldl is rarely used. Thanks to laziness, foldr is usually preferred, and it is not tail recursive.

OCaml or Scheme are perhaps better references? Especially given the history of Javascript: the author originally wanted to write a Lisp, but got told by management to add curly braces.


> Thanks to laziness, foldr is usually preferred, and it is not tail recursive.

With guarded recursion. foldl' has its place.


Yes, hence my weasel-y expression of 'usually preferred'.


It's sad that browsers decided to do this. In my opinion, this just means that Chrome, Firefox & co. just don't support ES6. They don't support it but with a small foot note, they just don't support it. Period.

I just wish we'd live in a world where people were willing to write code using TCE and just say "wontfix" when users of non-compliant browsers complain that "the code crashes", but alas, in 2021 browsers can just choose to not fix non-compliances and blame the website instead. This is just fucked up.


Hm what does TCO have to do with a language spec? It impacts neither syntax nor semantics of a language, which is what a language spec should be about. It's a detail of the compiler.

So I'm curious what this has to do with the languagr spec? Couldn't a smart compiler/interpreter do this before ES6 already?


"Optimization" is a misnomer with TCO. Optimization implies that only execution speed is affected without the optimization. Without TCO, potentially infinite mutually tail-recursive algorithms (or code transformed into continuation-passing style) can't be implemented without emulating tail recursion using a heap-allocated stack/linked list of states or manually transforming the mutually recursive functions into a trampoline variant of continuation passing style.

  function factorial(n, accumulator=1) {
    if (n <= 1) {
      if (n < 0) throw "Arument must not be negative";
      return accumulator;
    }
    return factorial(n-1, n * accumulator);
  }
TCO doesn't just affect the speed of this factorial implementation, it greatly affects the largest value that can be computed.

It's not to tricky to manually convert tail recursion into a loop, but for mutually recursive functions, you'd probably be best off manually transforming your mutually recursive functions into a trampoline variant of continuation passing, where each function returns a tuple of function and arguments to a trampoline loop that just repeatedly invokes the returned function on the returned arguments.

  function trampoline(f, ... args) {
    while (true) {
      f, args = f(*args);
      if (args == null) { return f; }
    }
  }
  
  function fatorial_impl(n, accumulator)
    if (n <= 1) {
      if (n < 0) throw "Arument must not be negative";
      return [accumulator, null];
    }
    return [factorial_impl, [n-1, n * accumulator]];
  }

  function factorial(n) {
    return trampoline(factorial_impl, n, 1);
  }
It's all very mechanical, and you'd prefer that the compiler does it. The main argument against TCO is that it's easy for an unobservant programmer to accidentally make a change that prevents TCO. Ideally, the language would have a construct allowing a programmer to cause a given return site to be a syntactic error if it's not a TCO-able tail call.


"optimization" does not imply that only execution speed is impacted.


> "optimization" does not imply that only execution speed is impacted.

No but "optimisation" implies the normally observable semantics are not altered, which is not the case at all with TCE (aka "guaranteed TCO"). Removing TCE from a language makes it a completely different language e.g. Erlang would be severely restricted as it has no imperative loops.


Fair enough, but in most people's minds, it suggests that the correctness of an implementation doesn't depend on which optimizations are applied, and in this sense, TCO isn't an optimization.

TCO isn't merely an optimisation in the same sense of local/global value numbering/common sub-expression elimination , dead-code elimination, loop-invariant expression hoisting, etc.


> It impacts neither syntax nor semantics of a language

It impacts the correctness of algorithms. For some algorithms it's the difference between using O(1) stack space and O(n) or more, where the latter would blow the stack and effectively crash the program.


Automatic TCO affects stack traces, which programs might rely on at runtime, requiring that browsers agree on what the right behavior should be; on the flip side, manual TCO (with annotations) would need to be a language feature.


This is the correct answer per the linked article.

In particular this is runtime inspectable via func.caller so it really is a behavioral language change to allow the elision of stack frames.


Safari supports proper tail calls per the ES6 spec. If your code relies on the non-elimination of tail calls it will not work in Safari, and thus not be Web compatible.


One could argue, that writing programs, which rely on the stack trace is a bad practice. Handling of exceptions should take care of unexpected situations. I don't know of any scenario, where a program needs its own stack trace, which could not be solved by making use of exception handling properly. Perhaps anyone can give an example?


> Automatic TCO affects stack traces, which programs might rely on at runtime, requiring that browsers agree on what the right behavior should be;

Browsers don't have a say in this. They're making the decision to not be compliant with the specification, so they effectively don't support ES6.


It does have effects on the semantics: without TCO, simple recursion will blow the stack and is therefore a very sketchy technique (few if any specs describe the minimum size of the stack).


The C standard does not mention a stack. One can implement a C compiler that puts activation records on the heap or does other magic to avoid a stack that can overflow. A call stack is an implementation detail. Expecting it to exist or not exist is assuming things about the compiler. Those assumptions may be reasonable but are not part of the semantics of the language itself. They are about particular properties of your compiler.


You are entirely correct[1]. Which is why the Scheme standard says,

"Implementations of Scheme must be properly tail-recursive. Procedure calls that occur in certain syntactic contexts called tail contexts are tail calls. A Scheme implementation is properly tail-recursive if it supports an unbounded number of active tail calls. A call is active if the called procedure may still return. Note that this includes regular returns as well as returns through continuations captured earlier by call-with-current-continuation that are later invoked. In the absence of captured continuations, calls could return at most once and the active calls would be those that had not yet returned. A formal definition of proper tail recursion can be found in Clinger's paper [5]. The rules for identifying tail calls in constructs from the (rnrs base (6)) library are described in section 11.20."

in Chapter 5, Semantic Concepts (http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-8.html#node_cha...). Clinger's paper is "Proper Tail Recursion and Space Efficiency" (https://www.cs.tufts.edu/~nr/cs257/archive/will-clinger/prop...).

The Rationale for "properly tail recursive" is,

"Intuitively, no space is needed for an active tail call, because the continuation that is used in the tail call has the same semantics as the continuation passed to the procedure containing the call. Although an improper implementation might use a new continuation in the call, a return to this new continuation would be followed immediately by a return to the continuation passed to the procedure. A properly tail-recursive implementation returns to that continuation directly.

"Proper tail recursion was one of the central ideas in Steele and Sussman's original version of Scheme. Their first Scheme interpreter implemented both functions and actors. Control flow was expressed using actors, which differed from functions in that they passed their results on to another actor instead of returning to a caller. In the terminology of the report, each actor finished with a tail call to another actor.

"Steele and Sussman later observed that in their interpreter the code for dealing with actors was identical to that for functions and thus there was no need to include both in the language.

"While a proper tail recursion has been a cornerstone property of Scheme since its inception, it is difficult to implement efficiently on some architectures, specifically those compiling to higher-level intermediate languages such as C or to certain virtual-machine architectures such as JVM or CIL.

"Nevertheless, abandoning proper tail recursion as a language property and relegating it to optional optimizations would have far-reaching consequences: Many programs written with the assumption of proper tail recursion would no longer work. Moreover, the lack of proper tail recursion would prevent the natural expression of certain programming styles such as Actors-style message-passing systems, self-replacing servers, or automata written as mutually recursive procedures. Furthermore, if they did not exist, special “loop” constructs would have to be added to the language to compensate for the lack of a general iteration construct. Consequently, proper tail recursion remains an essential aspect of the Scheme language."

(http://www.r6rs.org/final/html/r6rs-rationale/r6rs-rationale...)

The language specification describes, indirectly, all of the valid programs in the language. With proper tail call elimination as part of the language semantics, simple recursion is a valid iteration technique in those programs. Without proper tail call elimination as part of the language semantics, it is not.

[1] Technically, a hardware stack is not required. However, any programming language with a 'function' abstraction will be required to record the return location somewhere (if not function arguments and other stuff); without tail call elimination, the space for those records grows linearly with function call depth, no matter how it is stored.


Thank you! This was the well-informed answer that I was hoping for!


People want guaranteed TCO so that they can write code relying on it. If TCO ends up not happening such code will leak stack memory or hit max-recursion-limits.


As far as I understand, TCO can't be applied in all situations so it may be risky to rely on it. If you need to rely on in (meaning that your program breaks if it's not applied, as opposed to just being a bit slower), then shouldn't the language provide a way to enforce it? That is, the language semantics specifies O(1) memory for tail recursion, for example. If it does not, you are just up for a guessing game and things will break in unexpected ways and moments.


If not guaranteed by the language specification, e.g. Scheme, you cannot rely on it for certain kinds of algorithms, because it will just blow on your face when changing implementation.


It would simplify things a great deal for people compiling to JS. Forcing one looping facility into the behaviour of another can be hard. Implenting just about any looping facility as recursion is a matter of thinking for a while and then sitting down and doing it.


If you want to play with TCO in a node-type environment, the built-in JSC on MacOS is one option. It doesn't include system integration tools (file operations, etc.) and isn't suitable for production (at least not in any way I'm comfortable with).

https://stackoverflow.com/questions/6830192/how-to-run-javas...


Chrome supports tail call on desktop and mobile for Wasm. Not the same as JS, but it is definitely thought of.

Obviously this is just support in the instructions.


There's a funny interaction between TCO and memory management. If you have scoped destructors (C++, Rust) then you might accidentally defeat TCO by simply declaring a variable. GCs solve this problem by deferring collection to allocation points, but now TCO requires GC.

What's the modern view here? I think it's bimodal: either TCO is a defining feature of the language, or a best-effort optimization that nobody should rely on.


> either TCO is a defining feature of the language, or a best-effort optimization that nobody should rely on.

I think a third option is that tail calls are only guaranteed if you add a special attribute (eg. [[musttail]]) and that attribute fails to compile if your code is written in such a way that a true tail call cannot be guaranteed.

The LLVM backend supports a "musttail" marker on function calls (https://llvm.org/docs/LangRef.html#call-instruction), but it specifies a set of constraints that must be met for "musttail" to be valid. I have been experimenting with a Clang change that would plumb a C++ [[musttail]] attribute through the C++ frontend to LLVM, but it requires some care to make sure we satisfy LLVM's constraints.


I nominate the syntax

    goto function(parameter1, parameter2);
Also, may I suggest that it runs destructors prior to the jump rather than throwing up it's hands and saying you can't TCO because you have destructors?


I don't think it will be a new syntax, just an attribute on existing syntax, eg.

    [[clang:musttail]] f(a, b);
or:

    [[clang:musttail]] return f(a, b);
> Also, may I suggest that it runs destructors prior to the jump rather than throwing up it's hands and saying you can't TCO because you have destructors?

I don't think that is feasible, it would change the semantics of the language. You can always put your destructors in an inner scope if you want them to run first:

int Func() { { TypeWithDestructor a; int ret = a.get(); } [[clang:musttail]] return ret; }


What about the destructors for the temporaries used in the computation or the function call parameters, or the parameters themselves?


Handle it the same way as a return I think. Evaluate the temporaries, move / copy into the right place, run destructors, jump? Copy elision might be hard to pull off I suppose.

Edit: Hmm there is an issue in handling the stack space I suppose, because there are live old parameter objects already in the place you want to put your new parameters. An ugly but possibly workable way of handling it is to have two areas for parameters. One used for odd recursion depths and another used for even ones.


Then you changed the semantics of the code. Now a program that e.g. prints a message in its destructor, which gives this output:

  Message 1
  Message 2
  Message 3
Will give this output:

  Message 3
  Message 2
  Message 1
Your odd/even fix also won't work, because a function may push an argument to a list, and pass the list in another argument, making the original temporary accessible from any recursion level.

A call to a destructor is a normal call. If you need to call destructors, then it means your tail call is a call to a destructor and not what you're seeing in the code. It's a fundamental semantical problem. You can't have "make the last call in this function be an implicit call to some function" and "make the last call be the explicit call to this other function" at the same time.


If it's a different syntax, then you can expect people to know that it's semantically different. That's the whole point of using a different syntax, I think.


I thought that the point of this syntax was to make the compiler warn you when, what you think is a tail call, in fact isn't one, and demand that the compiler optimizes it; not introduce a completely new control flow construct.


> push an argument to a list, and pass the list in another argument, making the original temporary accessible from any recursion level.

Either you push a copy of the object to the list, in which case it is fine. Or you push a pointer/reference to the object, in which case it becomes dangling.


Exactly. It wouldn't be dangling without the transformation you proposed.


Parent comment only proposed a syntax. From grandparent comment:

> fails to compile if your code is written in such a way that a true tail call cannot be guaranteed


Make it an error unless all parameters passed are destructor-free (because there is non like int/pointer, or because they were passed in by reference)


That doesn't help, the parameter might be invalidated by a destructor (e.g. passing foo.c_str(), where foo is a std::string on the stack)


Then it isn't destructor-free (and not passed by reference if it's on the stack).


No. Arguments that are trivial types can still depend on other objects not being destroyed.

  void foo(const char* str) {
    if (*str == '\0') return;
    printf("%c", str[0]);
    std::string x = str + 1;
    return foo(x.c_str());
  }


This syntax is used by Perl; perversely, if you use it you have to dereference the function.


Speaking of TCO constraints, all of the common C/C++ calling conventions[0] have a fixed size stack cleanup. Some are caller-cleanup and some are callee-cleanup, but they all have amounts of stack cleanup that are constant (cdecl,stdcall,thiscal, etc.) or at least fixed at call time (varargs).

This means that TCO can't be done across calls where the amount of stack space used for arguments in the callee is larger than that of the caller. (In cases where the stack space used by the callee is less than the caller, the caller just needs to leave "wasted" stack space as if amount of stack space used by arguments were the same.) It wouldn't be much of a point on architectures where the cdecl calling convention passes enough arguments in registers to cover the majority of functions, except that some of these ABIs (notably Windows x64 calling convetion, but not Linux x64_64 SysV ABI) require the caller to allocate shadow space on the stack for all register-passed arguments. (Edit: I was wrong, the Windows shadow space on the stack is a fixed 32 bytes, regardless of the number of register-passed arguments.)

This motivates a couple of ABI questions:

1. Why does the Windows x64 ABI require the caller to pre-allocate "shadow space" to potentially spill register-passed arguments? It's wasteful if it's not needed (especially in ABIs with a redzone), and it reduces opportunities for TCO. (Edit: ahh, unlike Linux, the Windows x64 calling convention has no redzone. I guess this then becomes "Why doesn't Windows x64 provide a redzone?")

2. Why not define a calling convention that is callee-cleanup where the post-cleanup stack pointer is passed in a designated register (or at the top of the stack) to the callee? I understand that it might not make sense to pay the cost (fewer arguments passed in registers, and often an extra stack spill) for the majority of functions, but it seems an oversight that there's not a calling convention that (in the absence of destructors) always allows tail calls to be optimized.

I guess the answer to both questions is that most TCO opportunities are within a single DLL, so the compiler is free to create non-exported versions of functions with custom calling conventions. Is this right?

[0] https://en.wikipedia.org/wiki/X86_calling_conventions plus their variants on other architectures


> I guess the answer to both questions is that most TCO opportunities are within a single DLL, so the compiler is free to create non-exported versions of functions with custom calling conventions. Is this right?

LLVM mostly follows this path: it allows full TCO only when using the “fastcc” (i.e. “whatever LLVM feels like generating”) and some calling conventions from functional languages (like GHC) that have a specific carve out for it.


Correct me if I'm saying something stupid, but:

Shouldn't TCE be trivial with callee cleanup?

When I (the function) get called, I get X bytes of stack in arguments, then start pushing variables and whatnot. When I return, I simply put my return value at the start of my stack space, pop everything else, then jump back.

In a tail call, all I need to do is put the next functions arguments at the start of my stack space, pop everything else, then jump to its address.

Obviously in IRL CPUs some of this data would be kept in registers instead of the stack, but the mechanism should be the same.


You're correct, I did overgeneralize in the case of callee-cleanup calling conventions. Though, I'm still not aware of any callee-cleanup variadic calling conventions. printf, execl, etc. doen't actually get passed anything that directly indicates the number/size of arguments passed.

In the article's printf example, the compiler is able to perform TCO only because it uses at most 5 pointer/integer arguments (with number of fp arguments passed in al) and at most 6 fp arguments, so it's caller-cleanup, but with no cleanup to perform.

Under SvsV (Linux, etc.) x86_64 ABI, varargs calls, the number of fp arguments passed in XMMx registers is passed in al, but nothing to directly indicate how much stack space is used by arguments (and it's also obviously caller-cleanup). The printf example from the article works because all of the arguments fit in registers.


This is what OCaml does: you can annotate function calls with [@tailcall] to result in compiler diagnostics if the call in fact isn't in tail call position.


Scala too with @tailrec


tailrec is more restricted: it only supports simple tail recursion (i.e. calling the same function in tail position) that can trivially be rewritten into a loop. Scala isn’t really in a position to optimize mutual tail calls, since those don’t in general collapse to loops.


Alternatively, compilers could generate warnings when they find something like `return some_func(args...)`, and can't eliminate the tail-call because of something that happens elsewhere in the code.


I don't see how that's an issue, at least for Rust.

If a variable is captured by the tail-called function (i.e. the function receives the unique or borrowed pointer) then you can do TCO with an extra argument (said variable, owned/unique) and clean up at the end of the "loop".

If the variable isn't captured by the tail-called function, then it's unobservable (except through side effects) if it's deallocated before the tail call.


It's not that it requires GC, but that it is disabled by implicit calls at the end of scope. Scoped destructors aren't only for memory management. They are also widely used for e.g. closing files or sockets, basically anything that you may think of as a resource. Go has the defer statement, which would also disable TCO, even though Go has a GC. You can call just about any function in a defer statement. You can have TCO in a language that has scoped constructors/defer statement and doesn't have a GC. It's just that the things that at first look like tail calls often aren't. But the same thing can be said about operator overloading. TCO will still be in this language, but it will only be applied to tail calls, not calls which the programmer mistakenly believed were tail calls.


You could change the drop semantics to drop as soon as a variable is no longer used/borrowed (as opposed to now when it goes out of scope).

But IMO that does not seem like it's worth it considering how less predictable it would make drop.


Yep. Matz disabled TCO in Ruby because it was incompatible with some alternative implementations like JRuby.


Disabled by default but it can be enabled at runtime https://ruby-doc.org/core-3.0.0/RubyVM/InstructionSequence.h...


TCO is an optimization.

When it’s a semantic language feature like in Scheme, it should be called “Tail Call Elimination” because lack of it changes program semantics (thus not merely optimization)


In Scheme jargon, it has come to be that a "tail call" is in fact the name of the goto-like control transfer which replaces the regular subroutine call.

Tail calls are not eliminated; rather it is ensured that a function call in tail position is a tail call.

This is a useful view/terminology. Without the code transformation that implements tail calls, there is nothing special about a function call in tail position; it proceeds exactly like a call not in a tail position. Therefore, it is not a different kind of call. It becomes one after the code transformation: it becomes a tail call.


> ... goto-like control transfer which replaces the regular subroutine call.

That very much sounds like a [regular subroutine] call has been eliminated.

> Tail calls are not eliminated

This sounds like rather delicate semantics. A call has been eliminated. It was in the tail position. It was replaced by something else, which you call the "tail call". Maybe you'd prefer RSCITLEIOTC (regular subroutine call in tail location elimination in favour of tail call)?

I think tail call elimination is good enough to communicate the concept, whereas (when it's not purely an optimisation) tail call optimisation is not good enough, because calling it an "optimisation" is such a dangerous trap.


I don’t like that it has no special mention.

In a factorial function, fac(n-1) would be a tail call, but 0+fac(n-1) would not. It would be nice if one could annotate (and thus make the compiler verify) tail calls.


> It would be nice if one could annotate (and thus make the compiler verify) tail calls.

This always comes up, but honestly, I've never really run into the problem where I mistook a non-tail call for a tail call. (And I've programmed the majority of my career in languages like Haskell, OCaml and Erlang.)

Of course, I'm always in favour of the compiler verifying more stuff. It just doesn't seem very pressing.


It might be easier in those languages, but C (and C++), Scheme, Lisp and Nim have macros that could obscure a tail call's missing elimination.

It was never a problem for me because I never trust the compiler to do TCE and structure my code to be independent of it. But it IS a problem.


I've not yet heavily used macros in my project, but surely have used macros others wrote, sometimes perhaps even without knowing, aside from the Scheme implementations standard forms. However, I have yet to hit a case, where I thought something was a tail call, but in the end was not. It just does not seem to come up for me and has never been a problem.

Perhaps I have only used "good macros" or something, which do not make seemingly tail calls into non-tail calls?


is it possible you actually did get a non tail call where you thought it would be one, but just didn't go deep enough to blow up the stack so you weren't aware? It could be possible the call recursed "only" 10,000 times and didn't blow up any stack.

Or perhaps you've only used good macros. But it is incredibly easy for a macro to end up with (+ 0 x) where you think it will end up with "x".

It's also possible that you have used a macro that ended with (+ 0 x), the optimizer turned it into "x" and you got the tail call eliminated but were not guaranteed to get that, and it may blow up on another Scheme implementation.

That's what I don't like about it; The compiler definitely knows if it did TCE or cannot, and it does change program semantics if it did or didn't. So why can't I just annotate the code saying "well, I intend this to TCE, please complain if you can't?"


I see your point about being explicit. I think ideally the specification would be precise enough for the user to conclude: "If this is a specification conforming Scheme implementation, then this must be a tail call." I was under the impression, that this is the case with the Scheme specification, but I might be wrong, as I am not very knowledgeable in terms of the specification. If it is not so specific about what exactly becomes a tail call, then I would say, that it would be great to have an explicit way to tell the interpreter or compiler, that you want something to definitely be a tail call.


... and if you have that declaration mechanism, then it can easily work for any call in any position whatsoever!!!

So that further separates the concept "tail call" from "call in tail position".

Any position can be the tail position if we just wave he right magic wand at it, so the concept becomes meaningless. All we are left with is the style of call: returning versus goto-like.


Well, in Scheme you need to rely on TCE, if you want your code to run for more than a specific finite time. There are no other looping constructs.

Haskell does have several macro systems, but they are not nearly as pervasively used as in Lisp languages. (Mostly, because laziness makes it easier to add eg something as foreign as even eg Prolog as an embedded DSL without macros. And, of course, adding to Haskell syntax feels less natural than in Lisp.)


Exactly. The tail function in the second case is '+' not 'fac' so you can't replace the recursion with iteration


A tail call is just any call that's at the tail of a function. It might be used to refer to TCE specifically, but that's a colloquialism and would be wrong in a more technical context.


If we have no plan to treat any ordinary calls as tail calls, there is no point in identifying which are in the tail position, or using any terminology related to tail calling.

The conversion of a regular call to a tail call could easily be applied to non-tail positions (e.g. it could be requested by some "please make this a tail call" annotation). The tail position is where that can be done without changing the meaning (of programs that meet certain restrictions, at least).

The semantics of not being able to return, and not acquiring any stack-like space is what makes the call a tail call, whether or not it is in a tail position. If we make a call that doesn't return, that call is then the last action the function performs, which puts it at the tail of its activities.

The terminology also gives us nice ways to talk about vaguely related things, also, like:

"execvp is a kind of big tail call in Unix; it calls a new program, such that it's impossible to return to the original, which is entirely replaced (not just the stack frame of the caller of execvp)".

"A boot-loader branching into firmware is a kind of tail call."


Scheme 'optimizes' any tail call. Whether that's a recursive call or something else.


It can't not change the semantics of a language. It's basically automatic recursion-to-loop transformation (whether you call it "optimisation" or not is just a semantic issue). So obviously will turn "infinite recursion but not really because it blows the stack" into "infinite loop".

But then many languages are quite loose with stack size requirement anyways, as well as stack traces (e.g. JVM's can include details about the internal implementation and/or automatic compiler transforms, e.g. concerning lambdas, methodhanles, invokedynamic etc).


If you need an infinite loop (or very large iteration count), and rely on TCE but it is not available, the semantics of your program will change.

In the "best" case, you'll get a stack overflow exception. In worse cases, your exception/condition system will kick in at a time which is essentially non-deterministic (it's usually deterministic but depends on a lost of things you may have no visibility into).

But if TCE is available, you might even have no allocations after some startup. That's different semantics which are part of the language (e.g. in the case of Scheme) and thus not an "optimization".


How does it change the semantics of Scheme? It avoids blowing the stack for sure, but the stack size isn’t part of the semantics of Scheme. If you imagine a computer with arbitrary amounts of memory, then the language behaves identically with or without tail call elimination, right? The only difference is performance and memory usage, but the meaning of programs is still the same.


The scheme R5RS [1] (essentially, an edition of the language standard) defines in section 3.5 what it means for an implementation to be “Properly tail recursive”: it must support an unbounded number of active (yet-to-return) tail calls. I think “proper tail call” is a good terminology for a tail call that replaces the callers stack frame (or some equivalent, depending on implementation).

It doesn’t necessarily change the semantics, but it does change what kind of programs are sure to crash your machine, and what kind are going to run fast, which is also an important property of programming languages.

[1]: https://schemers.org/Documents/Standards/R5RS/


> It doesn’t necessarily change the semantics, but it does change what kind of programs are sure to crash your machine, and what kind are going to run fast, which is also an important property of programming languages.

Yeah, this is exactly what I meant. The semantics of the language are the same regardless (i.e. the meanings of programs doesn’t change), but in practice it makes a huge difference.


it changes it from 'all functions can recurse only up to some limit' to 'functions with tail calls can recurse indefinitely' (i.e. you can write an infinite loop recursively). It's not just a performance improvement but a change in capability.


Aren't you adding some assumptions about the implementation in your comment? Common Lisp doesn't require TCO, but it doesn't forbid it either, and SBCL chooses to do it. Your first sentence would be incorrect for a Common Lisp program someone executes with SBCL.


Well, then the statement 'functions many only recurse up to some limit' applies to Common Lisp as specified but not to this implementation of it. This isn't particularly unusual, since language implementations often have capabilities which aren't part of the specifications (see the many extensions to C, for example. e.g. in standard C you can't assume that type punning using a union will be defined, but almost all implementations explicitly allow for it). These are differences in the semantics of the language (namely in what is and isn't allowed), not just optimisations.


> Aren't you adding some assumptions about the implementation in your comment?

You'd have a point if they were talking about implementation semantics, but they're talking about language semantics. A Scheme program can assume TCE, a CL program can not, an SBCL program can.


> an SBCL program can.

SBCL will only perform TCO at specific optimization levels, so you can't even rely on it in SBCL.


It's not clear to me that the person I replied to was talking about language semantics. The phrasing of that comment would have taken the form "may not" or "can't assume", but instead asserted that you can write some code that's guaranteed to run differently, which is implementation semantics.


It says “semantics of Scheme” directly (and IMO unambiguously) upthread. What would that more likely be than the language?


That depends on whether you are claiming your code is written in Common Lisp (the language) or for SBCL (the implementation).

The former will run on other implementations.


Yes. You generally just should not write a recursive function in a language without TCO/TCE.


Recursion is nothing but a loop with a stack. So yes, can certainly rewrite things that way, but all you're doing is trading the stack for a stack structure on the heap.


My sense is that the only unique thing about Scheme here is the fact that it's specified in the standard. That means that developers are going to develop software that expects to be able to write tail-recrusive code without fear of blowing the stack.

Other language standards keep this from affecting the semantics of the language by simply leaving it out of the semantics of the language. When the expected behavior is undefined, the implementors can do anything they want without, strictly speaking, affecting the semantics of the language.


Some languages (Python, Tcl, R and possibly others) expose the stack as part of the language/standard-library, and thus an implementation that DOES do TCE cannot be compatible.

In python, it's mostly visible through exception objects.

In Tcl and R, it's in common use because it is standard practice to do things in your caller's scope instead of your own.

In C++ (and any language with destructors), many routines are invisibly not TCEable because the language semantics require the destructor to run after the tail call returns - which means that a change to a different part of the code determines if a specific function is TCEable or not.

So, despite your claim being technically true, it is in practice often informally specified by the language or standard library.


Can you observe the difference as a programmer? Through things like the stack-trace? Then it's not an optimization.


There are many types of observation we might define.

The most fundamental is whether we can write a program which halts in one case and doesn't in the other.

An alternative is whether we can write an if/then/else branch to distinguish between the two cases. Note that this is less general than the above, since we can always use an if/then/else to implement a halt/no-halt program, but we can't always go the other way (due to the Halting Problem). This definition is useful, since if/then/else can cause arbitrarily-large changes to a program's behaviour; whereas we can't "use" a difference which affect halting.

Another definition, which is more subjective, is whether we can do the above "reliably". For example, any optimisation that makes a program faster could be observed by a language which allows access to a clock. Whilst hackers and debuggers might find these useful, it's not the sort of thing that a "reasonable" programmer would do (i.e. code which relies on this sort of thing shouldn't pass code review). I'd probably count inspecting stack traces, or running a debugger on ourselves, in this category.

The messy subjectivness of the latter is why I like to limit the capabilities of what I'm developing with: I don't have to care about timing altering the behaviour of a program which can't access a clock.


Right - but you can observe the differences reliably without a clock, through looking at the stack trace.


Clocks and timing were one example. I generalised to these other cases when I said:

> I'd probably count inspecting stack traces, or running a debugger on ourselves, in this category.

I would put such things in the "unreliable" category, since it's the sort of thing that may vary between compilers (including future versions), and may depend on internal details of dependencies which are subject to change.

I certainly wouldn't let them through code review; it can be hard trying to understand 'foo(bar)' in terms of 'the function "foo" applied to the input "bar"', I'd rather avoid the extra complications of having to think about 'pushing a fresh stack frame, clearing registers A, setting the instruction pointer to B, adding C instructions to the ALU's adder pipeline, forcing a flush of D cache lines, ...'


That's (a small) part of the reason why Haskell didn't have stack traces for the longest time.


Or even more simply, by blowing up the stack. If you manage to do 2^40 calls deep without blowing up the stack, you have TCE, if you do blow up the stack, you don't.

And you can tell if you blew up the stack by setting up an exception handler (or lisp condition or whatever equivalent).


> you can tell if you blew up the stack by setting up an exception handler (or lisp condition or whatever equivalent).

That only works if there's some way to tell whether the stack blew up (as you say: catching exceptions, lisp conditions, or whatever equivalent). I'd put those language features in the 'messy' category: useful on occasion; not used by most code; makes all code harder to reason about (hence the subjective "reasonably-unobservable" category above)


There are lots of optimisations which have ultimately observable side-effects e.g. in most JS implementations the closed-over environment is optimised to just the free variables (and there's no closure at all if there's nothing being closed over statically), this is extremely noticeable if you drop into a debugger in the inner function because you will not be able to access any variable which wasn't closed over. Likewise, if the compiler decides to reuse part of the stackframe because it sees two variables are only used consecutively, you won't be able to access the first variable despite it still purportedly being "in scope".


> this is extremely noticeable if you drop into a debugger

Is the debugger part of the language specification? No.

When I say 'can you observe the difference as a programmer' you can mentally expand that to 'can you observe the difference as a programmer who is working off a copy of the language specification'. Otherwise someone armed with a debugger can look at the machine code and then we might as well say all implementation decisions are observable by the user, and then I think we're being a bit silly.

> you won't be able to access the first variable

I'm not familiar with this part of JavaScript off the top of my head sorry - do you mean using a debugger again here?


> Is the debugger part of the language specification? No.

Neither is "the stack trace".


You're mistaken - in many language it is.

(I've worked professionally with multiple language specifications and implementing them.)


So you’re saying inlining doesn’t count as an optimization, then?


No, that doesn't follow.

Inlining doesn't change what a call stack looks like - the call stack frames are still there, yes possibly just a metadata annotations on real call stack frames and recovered when needed.

With tail calls, the call stack frames are completely gone, and cannot be recovered.

Try this thought experiment - can you implement tail calls in Java and meet the same language specification? I think you cannot. Therefore it is not an optimisation. If you can please email me how and I will implement it and you can publish the paper and get the glory!


> can you implement tail calls in Java and meet the same language specification? I think you cannot.

Is this because Java exposes stack traces in a way most languages don't?


That's one reason - there are also a few more minor technical places in the spec where the depth of the call stack is observable in semantics such as I think the security manager.

But I don't think it's unusual for a language to expose call stacks? Ruby, Python, C#, etc all do it.


I think a Java implementation could do tail calls, but it would have to create a fake stack trace when needed.

(I haven’t looked, but I would expect modern Java implementations already make up quite a bit of the stack trace only when it is needed, for example when inlining calls)

And, traditionally, it _is_ unusual to expose call stacks to calling programs for Algol-style languages (implementations typically dumped a call stack and might add a way for a program to walk its stack, but that would be implementation-dependent)

Without exceptions or a repl, where would you use them, other than in a crash dump?


> (I haven’t looked, but I would expect modern Java implementations already make up quite a bit of the stack trace only when it is needed, for example when inlining calls)

Yes they do. But they can't make up the stack for tail calls because the information is simply gone. If you've got a set of mutually functions that are tail calling, and you suddenly need to generate a stack trace, how do you know how many times the functions have been called and which order they were called? That info was never recorded - it's just done. Inlining is static info and much simpler - it is recorded and isn't gone.


You’re not going to handle the Ackermann function that way, but for both simple recursive and mutually recursive calls, I would think you can often infer up front how many loops there will be, or just maintain a single counter counting the number of recursive calls, instead of counting using a linked list of stack frames.

Now, whether that’s worth the implementation effort, of course, would depend on whether the construct is used a lot, and that’s a catch-22.

For functions jumping into other functions, I guess that, when the program counter says you’re in baz while the first stack frame says foo got called, you could infer that foo called bar and bar called baz from metadata that the compiler would have to add.

Now, if foo can get into baz along multiple routes, you would have to record something, but I would guess there are plenty of those simpler cases. For those, the main challenge would be to infer recursion count for a recursive call.


If you only need to know how many times things were called, you could probably record that in much less space. Something like O(log n) of the depth of the stack, instead of O(n) that you need for the actual stack frames.

But I assume that kind of information is probably not enough for Java.


I thought C#/.Net was much more permissive than Java/JVM regarding 'lying' about the call stack, in optimised release environments. [0] From a quick search it looks like tail-call optimisation is supported in the CLR but only when the IR is tagged appropriately, which the F# compiler takes advantage of, but the C# compiler does not. [1]

[0] https://www.hanselman.com/blog/release-is-not-debug-64bit-op... , ctrl-f for this stack trace is showing the runtime reality.

[1] https://github.com/dotnet/runtime/issues/2191


> With tail calls, the call stack frames are completely gone, and cannot be recovered.

You can absolutely record these things under TCO, in much the same way you'd record virtual stackframes when inlining.


Well I don't know what to say apart from I don't think this is true.

If you think it is, then you should go ahead and publish it because you've discovered a new language implementation technique.


Of course you can, you just add a record whenever you do the jump, exactly like what you’re describing with inlined calls. It’s not hard to implement, it’s just that nobody does it, because it’s dumb.

You seem to have a very Java-centered view of the universe, because it is absolutely not true in most languages that inlined calls preserve call stacks. Call stacks are debug information, and there’s absolutely no requirement for compilers to preserve debug information while doing optimization. Requiring that would be lunacy, and it would prohibit a huge number of optimizations (or require costly bookkeeping).

I don’t know what the situation is in Java-land, but if Java preserves inlining information in call stacks using bookkeeping, then good for it, I guess. It can’t come for free, so you’re taking a performance hit for that pleasure. It’s certainly not something that is common, or commonly understood as a requirement for optimization.


> you just add a record whenever you do the jump, exactly like what you’re describing with inlined calls

Add another record where? On some kind of stack of records from calls? That will grow linearly with the number of calls? So that's a call-stack is it? The whole point of TCO was to eliminate the stack and you've re-invented it there! If you have an infinite tail-call (like a web server accepting requests in a loop) you'll have an infinitely long list of records with your idea.

> It’s certainly not something that is common, or commonly understood as a requirement for optimization.

In many languages (Python, Ruby, Java, etc) they are not debug information - they're part of the program semantics and (from experience!) if you change them you'll break programs.


I don't know if that counts... I can read the 1s and 0s of a compiled binary in any language. Does that mean nothing a C compiler does is an optimization?

I wouldn't consider the memory layout of something like a stack to be part of the semantics of the language, unless that language actually includes features to directly refer to the stack (get its contents or size, etc).


> I can read the 1s and 0s of a compiled binary in any language

But the 1s and 0s are not specified by the language (in almost all cases.) They're out of scope.


But where is the line? Is the stack size specific by any languages? Many language's compilers will eliminate variable assignments as an optimization. Since I can view a stack trace in some of those languages, then would that make that not an optimization? Like, if I just inserted `let x = 3;` in the middle of my function, but never used `x`, it would disappear from the compiled code. If I look at the stack trace, I wont see my `x` variable. Is that an optimization or a semantic language feature? Does it matter?


> But where is the line?

The line is: what has been formally specified by the language spec as being part of the language.


What about copy-elision 'optimisations' in C++, which can result in changes to observable behaviour?


Sorry I'm not an expert in C++ so I couldn't say. But on the face of it yeah if it changes observable behaviour and it isn't covered by the spec then I don't see how it can be an optimisation. Maybe the spec covers what you're talking about?


The C++ standards documents are famously paywalled (and I have to admit to having never read them), but here are some summaries. [0][1][2] The kicker is that according to the spec, the compiler is permitted, but not obligated, to elide object-copy operations, even when doing so results in the elision of a call to a copy-constructor function with side-effects.

Of course, the optimizer is still constrained by the spec in the obvious ways: copy operations can only be elided when the compiler can show that they are redundant (in some precise sense that the spec presumably lays out).

It follows of course that this 'optimisation' may impact observable behaviour. This makes some sense as C++ code shouldn't be doing 'real work' in copy-constructors, but it's still a rather ugly concession to pragmatism. edit Which, come to think it, describes the whole C++ language ;-P

[0] https://stackoverflow.com/a/12953129/

[1] https://en.wikipedia.org/wiki/Copy_elision#Return_value_opti...

[2] https://en.cppreference.com/w/cpp/language/copy_elision


The specs are a bit weird.

For example, they don't mention stack overflows at all. But since standards compliant compilers spit out code that suffers from them, I assume they are allowed.

So I assume you could make a standard compliant compiler that doesn't nothing but immediately overflows the stack.


> So I assume you could make a standard compliant compiler that doesn't nothing but immediately overflows the stack.

Interesting thought. I imagine just about every language spec (with the possible exception of assembly languages) states something like If memory is exhausted, a handler is executed or If memory is exhausted, the behaviour is undefined, and it's always going to be up to the compiler/interpreter to make reasonable use of the available memory on the specific target platform.

Would a C compiler be non-compliant if it generated code that used 100x the memory that would be used by code generated by typical C compiler? How about 1,000,000,000x so that its generated programs always failed immediately?

Java is an interesting case for this, as it famously doesn't require that a garbage collector needs to be included at all (unlike .Net which does). In Java, not only are you permitted to have a conservative GC, you're permitted to have no GC whatsoever (something OpenJDK now offers [0]). Apparently [1] Java requires that the collector (if it exists) will run before the JVM throws OutOfMemoryError.

I imagine the formal methods folks must have done some thinking on this topic, as their whole field is about doing better than just go ahead and test it out. Could a standards-compliant C compiler used in a safety-critical domain generate code that, only very occasionally, uses a billion times the memory it typically uses?

Somewhat related: one of the motivations behind the Zig language seems to have been a frustration with clumsy handling of memory-exhaustion. [2][3]

[0] https://openjdk.java.net/jeps/318

[1] https://www.kdgregory.com/index.php?page=java.outOfMemory

[2] https://youtu.be/Z4oYSByyRak?t=300

[3] https://news.ycombinator.com/item?id=18422631


> Interesting thought. I imagine just about every language spec (with the possible exception of assembly languages) states something like If memory is exhausted, a handler is executed or If memory is exhausted, the behaviour is undefined, and it's always going to be up to the compiler/interpreter to make reasonable use of the available memory on the specific target platform.

I would have expected some words to those effects, but extensive browsing and grepping in the C++ spec did not reveal such language. (They do not mention the call stack at all, which is a fair enough decision.)

I was first looking into this, because I had hoped the standard would specify a way to check for whether the next call would blow up the stack before you actually engaged in the call.

> Would a C compiler be non-compliant if it generated code that used 100x the memory that would be used by code generated by typical C compiler? How about 1,000,000,000x so that its generated programs always failed immediately?

You could ask the same for speed of execution. I think it would be standards compliant, but not very useful. Standard compliance ain't the only thing people look for in a compiler.

In practice, you could make a C++ compiler that does the standard compliant thing to virtually all of people's programs, by just emitting the nethack executable regardless of input. After all, virtually all real world C++ programs have undefined behaviour somewhere, and undefined behaviour is allowed to 'travel backwards in time'. (Ie undefined behaviour anywhere makes the whole execution undefined, not just after it is encountered.)

Hey, you could even just start nethack instead of producing any files with your compiler at all. Enough undefined behaviour goes all the way to compile time, like eg not closing your quotes, I think.

> I imagine the formal methods folks must have done some thinking on this topic, as their whole field is about doing better than just go ahead and test it out. Could a standards-compliant C compiler used in a safety-critical domain generate code that, only very occasionally, uses a billion times the memory it typically uses?

I think for safety-critical code, you solve this conundrum by relying on extra guarantees that your compiler implementation makes, not just on the standard.


RVO are guaranteed as of C++17 and mentioned in your cppreference link.

As for the standards, the drafts are public and usually 98% like the final ones.

The missing 2% are clarifications only relevant for compiler vendors and language lawyers, whose employers can afford sponsoring ISO work anyway.


> RVO are guaranteed as of C++17 and mentioned in your cppreference link.

Yes, but other instances of copy-elision are still at the discretion of the compiler, right?


Yes, just RVO are guaranteed, because what happens in practice is the return location is already prepared before the call.

So even if move semantics aren't supported for the specific type, the target type gets built inplace.


It only changes semantics of Scheme programs that are recursive in appearance. In non-recursive Scheme functions (those that don't loop and are not mutually recursive), it's merely an optimization, just as it is in C.

Edit: Added mutual recursion explicitly.


Except unlike C, it is a required feature.


Well, any program that neither loops nor is recursive (the same thing in Scheme), only has a very finite runtime. So those are not very interesting Scheme programs, I guess?


There are only two hard things in computer science: cache invalidation and naming things.


> There are only two hard things in computer science: cache invalidation and naming things.

There are only two hard things in computer science: off-by-one errors, cache invalidation, and naming things.


It is misleading to suggest that an implementation of TCO follows directly from the instruction set architecture: it depends even more on what calling convention is followed [1]. gcc supports several calling conventions for x86 [2].

In the Scheme world, often a hybrid stack-heap calling convention called Cheney on the MTA is used in which all calls are tail calls [3].

[1]: https://en.m.wikipedia.org/wiki/Calling_convention

[2]: http://redstack.net/blog/x86-calling-conventions.html

[3]: https://news.ycombinator.com/item?id=12776198


Doesn't that part of Scheme implementation interact with continuations?


It is essentially continuation-passing style, which makes implementing first-class continuations very easy.


One sentence summary: "(in the x86 ISA) Tail call optimization happens when the compiler transforms a call immediately followed by a ret into a single jmp"


So, what do we call it when the compiler transforms a call immediately followed by a ret into a fall through into the to-be called function? And which compilers ever do that?


That's called inlining, unless I misunderstood you :)

Most compilers do that, for most compiled languages.


He means that if you have two functions 'a' and 'b' and a tail calls b, the call instruction could be eliminated because it's just going to the next instruction anyway.

This is not just inlining, it's a function with multiple entry points, which is not popular.


I would call it two functions with multiple tails.

And that was very popular when memory was scarce. The typical BASIC for the 6502 used it. See for example http://hackzapple.org/scripts_php/index.php?menu=14&mod=8517... (search for “The CHARGET Subroutine”). Here, it is used for speed, too (otherwise, at least one of the functions would have to do an indirect load, which is cumbersome and slow on a 6502. The self-modifying code is a lot faster)

It also often was used when one had, say, a function to output a character and another function that printed a specific character. Combined with a creative use of the BIT instruction, one could even have functions printing _any_ character, a space, a question mark, or a CR share their tails (https://retrocomputing.stackexchange.com/a/11132)

But yes, nowadays I guess it is rare, although, with an ABI designed for it, it could be used to make a call with a default argument value fall through into the more generic code)


One issue I can think of is that compilers/linkers assume they can reorder and delete unused symbols in a C program.

You'd turn it off at the object file level if you're writing asm and doing tricks there (x264 asm has some fallthroughs like this), but there isn't a way to say only these two functions need to be in the same order.


You wouldn’t have to say it. Linkers could figure it out on their own (but compilers dropping hints would make it easier)

Some linkers already can merge functions that happen to compile to the same code, even though that corrupts stack frames (foo calls bar, but the linker makes it call baz instead. https://stackoverflow.com/a/61865960)

It’s ‘just’ a matter of learning the build system new tricks.

I found a short discussion on this at https://gcc.gnu.org/legacy-ml/gcc/2000-02/msg00575.html that contains https://gcc.gnu.org/legacy-ml/gcc/2000-02/msg00599.html, which says

“The P3 SSE stuff we've done generates multiple entry points, but it does that within the backend prologue expander. There's nothing to generate multiple regular prologues. Though it shouldn't be that hard to do.

It's not impossible to believe that the bulk of the compiler would work with them, as long as flow knows how to properly create the CFG. LABEL_ALTERNATE_NAME was invented for this, though it appears that the code to properly deal with it is sitting on a branch waiting for accounting to say it has been paid for.“

That was in February 2000. I wouldn’t know whether that comment was close to the truth or what gcc currently supports.

For llvm, the thread at https://lists.llvm.org/pipermail/llvm-dev/2018-October/12715... gives some cases where this could be used, but reading it, it doesn’t look it is solved in the way I gave (see for example https://lists.llvm.org/pipermail/llvm-dev/2018-October/12717.... https://lists.llvm.org/pipermail/llvm-dev/2018-October/12715... calls it “more like separate functions with a common tail.”, so I’m not alone in that.

That thread also indicates that DWARF has support for helping debuggers figure out the correct name of such a function.


Inlining?


makes perfect sense, if you happen to have just read the article :)


Joe Armstrong explaining the concept on the erlang mailing list in 2016, https://erlang.org/pipermail/erlang-questions/2016-October/0...


I really like how Joe Armstrong speaks and writes so plainly. He always makes me feel good to learn from and not dumb.


Wow. I wish everything was explained so clearly. I understood everything in that post except this paragraph.

""" Languages like Erlang must implement tail call optimizations, since persisted data is stored as "loop variables" in infinite loops. This happens when we write code like this:

     loop(Data) ->
          ....
 ...
 loop(newData).
When I see code like this I mentally "see" the last call as a "jump" to the start of the code, rather than a recursive call to loop. """

What is the loop(Data), loop(newData) doing? Would be great if someone could elaborate on this point.


For a high-level overview, I found this talk [0] at !!Con 2019 both entertaining and really informative.

[0] Tail Call Optimization: The Musical: https://www.youtube.com/watch?v=-PX0BV9hGZY


There really should be more talks in musical form!


ha, that video is amazing!


I'm doing the CS:APP3e course and out of curiosity searched about this exact same topic yesterday, crazy!

I'm learning a great deal from the responses here, thank you.

If you are in the same position I was (self-taught dev looking to improve your general understanding) I can't recommend this course/book highly enough:

- Book: http://csapp.cs.cmu.edu/3e/home.html - Lectures: https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.a...


This is actually (in the abstract) exactly how I’ve read and understood it in FP, with the necessary added note that when you TCO without recursion you lose history for eg stack traces and debuggers.

It’s also astonishing to me that people in a position to explain even moderately complex computing ideas continue to treat recursion as somehow hard to reason about. It’s calling a function. A function that calls itself is still just a function calling another function. Two mutually recursive functions calling each other are just two functions calling two other functions. That’s so much easier to follow than state changing in a single function call.


> It’s also astonishing to me that people in a position to explain even moderately complex computing ideas continue to treat recursion as somehow hard to reason about. It’s calling a function.

Calling a function can be easily, and incorrectly, modeled by representing the arguments as variables stored at a fixed position in memory. (As function calls used to work.) Once you introduce recursion, this model breaks down and you are forced to think in terms of activation records/stackframes.

Probably those people learnt function calls in the former way first, meaning that recursion represented a conceptual leap for them.


Maybe it’s more of me showing my naivety but I still think of a recursive function call as bindings and computation. My mental model is f(x) -> f(y(x)) = f(x) -> g(x), where g is a function wrapping the body of f. It’s surely not what the compiler does, but it makes thinking about recursion no more complicated than thinking about any old function. Then the only thing I need to think about beyond that is “does my environment support TCO, or am I certain my data fits in the call stack?”


Consider that some processors were too small to fit a stack. Some languages (notably BASIC), only had what you might call 'static' variables (to use the C name), so when you recursed into them, the variables would be shared across all calls to that function.


And that’s a good case for supporting recursion but eliminating branching! You can goto and rebind to your heart’s content as long as you don’t have to back out of a condition


Or expressions.


IMO, the two challenging parts of recursion are:

First, realizing that you can have separate instances of functions at the same time -- different stack frames. Making the mental leap from "the content of x in function f" to "the content of x in f(0), and also the content of x in f(1)" is a leap of abstraction, but not necessarily a tremendous one.

Second, mentally modeling recursion when it gets complicated. Simple recursive cases are easy to model in your head, because either there is little state to keep track of, or because you can extrapolate the pattern between function calls, mentally compressing the state of the whole call tree.

Tail-call optimization, for example, is really easy to mentally model because you only have to keep track of the state of one function's variables at a time -- unless you start passing continuations or thunks into your function calls, in which case you're carrying a stack of unlimited complexity into each function call. It's still just a function call, with the same number and types of arguments, but now one or more of those arguments contains turing-complete, recursive programs of unlimited complexity.

This similar to the worst-case mentally-complicated recursion without TCO, where you can build a complex graph of function calls, scopes, shared data accesses, and mutated copies of what used to be the same data, that can all be present at once, and where you might need to keep track of all of them in order to track down and understand something that you need to understand.

You can, of course, build similarly tangled-up structures in a purely-iterative manner, but the upside is that the complexity will be confined to the data state -- there will be no call tree to mentally model, and no need to track variable scope.

---

Shorter version: if the efficiency and elegance of recursion (with appropriate compiler/runtime optimizations, like TCO) is like zip compression for your mental model of code complexity, then there exist recursive-program equivalents of zip bombs.

---

Even shorter version: recursion itself isn't hard to reason about, but recursive programs can be very hard to reason about


>It’s also astonishing to me that people in a position to explain even moderately complex computing ideas continue to treat recursion as somehow hard to reason about.

This is because you don't understand humans. Which is more clear to you?

    repeat special-task 5 times. 

    (do special-task n times) = (do special-task n - 1 times)
    (do special-task 0 times) = Don't do anything. 
    do special-task 5 times.
Clearly when we communicate with others using regular language we use a repeat keyword rather then recursive grammar. No natural language on the face of the earth ever communicates the concept of repetition via recursion* There's always some keyword or number for repeating something.

*(not 100% sure on this, HNers, prove me wrong if possible, I'm open to being wrong).


> This is because you don't understand humans.

I don’t think it’s about so large and broad a group, or about human language. No one speaks in dynamic property access or dependency inversion or binary compilation or a zillion other concepts in programming that aren’t misrepresented as “hard to understand”.

My point isn’t that it’s something that would be immediately obvious to a beginner, but that its difficulty is often overstated when introduced.


Fair point. My point is that looping should be immediately obvious to a beginner due to a 1 to 1 mapping with human communication. Thus recursion when compared to iteration is way harder.


I don't think anyone's opposed to having a function for "repeat this thing 5 times". But if you did see a function that called itself, why would that be any more confusing than seeing a function that calls another function? That's all it is.


I think it's because for beginners it's hard to see why it would be useful (or not just turn into infinite regression). Proof by induction is often similarly difficult for maths students to grasp. The idea of solving a problem by breaking it down into smaller problems until each part of it becomes trivial is not something which comes naturally to most people, it's a learned and ingrained habit in those who have been taught it and used it.


> I think it's because for beginners it's hard to see why it would be useful (or not just turn into infinite regression).

And that, in turn, I think is because most beginners are first introduced to mutation and statements and all kinds of “do” stuff instead of trivial algebra with trivial types.

If the core primitives at the start are:

1. Computation producing output

2. Computation of input producing output

3. Wrapping #2 in a function

I could see introducing recursion in an easily understandable way on day one.


Only in hindsight. In reality looping is much more easily grasped due to the one to one mapping with human language. Looping should be taught on day 1, not recursion.


In hindsight lots of unfamiliar concepts are easy. Teaching programming with imperative constructs first encourages imperative thinking. Teaching programming with computational concepts encourages computational thinking. Both require some reasoning about syntax and control flow before they sink in enough for beginners to use them effectively. But I really don’t think human language has any bearing here. People are also very likely to already understand expressions, as they’re taught from a very early age in arithmetic and it’s very common for people to at least have some algebra education. Expressions are fundamentally easier to reason about, and there’s no reason they would be more difficult to a beginner.

I’d even say that people usually have a good grasp on recursion even if they don’t know that’s what it’s called. Multiplication is just recursive addition. Exponents are just recursive multiplication. And so on.


>Teaching programming with imperative constructs first encourages imperative thinking. Teaching programming with computational concepts encourages computational thinking

Imperative thinking is computational. All forms of thinking are computational by definition.

> Expressions are fundamentally easier to reason about, and there’s no reason they would be more difficult to a beginner.

What does expressions have to do with anything. Recursion can be both defined in an expression and as an imperative jump instruction. The concept is orthogonal to expressions.

>I’d even say that people usually have a good grasp on recursion even if they don’t know that’s what it’s called. Multiplication is just recursive addition. Exponents are just recursive multiplication. And so on.

But people don't think of multiplication this way. The teacher doesn't teach multiplication to you in terms of recursion she literally teaches you it with the "times" keyword. What's 5 * 6? Add 5 to 5, 6 times.

It has bearing on language. Nobody outside of programming/mathematics communicates concepts recursively. Our communication is a reflection of how we think naturally. Recursion takes training. Looping is just learning syntax for a concept we already know about: repetition.


> Imperative thinking is computational. All forms of thinking are computational by definition.

I meant in the mathematical sense. Math doesn’t have imperative statements and side effects.

> What does expressions have to do with anything. Recursion can be both defined in an expression and as an imperative jump instruction. The concept is orthogonal to expressions.

In case it wasn’t clear, I’ve been arguing this whole time for teaching pure FP concepts. Starting with computing a value, then computing a value from input, then with a function returning a value computed from its input... then with a function computing a value recursively from its input. Recursion is the fundamental building block for repetition in FP. Even if you can express it with a loop-like expression it eventually desugars to recursion.

> But people don't think of multiplication this way. The teacher doesn't teach multiplication to you in terms of recursion she literally teaches you it with the "times" keyword. What's 5 * 6? Add 5 to 5, 6 times.

That... is recursion. It’s mind boggling to me that it’s not plain as day, here in this context. It’s literally:

    (+ 5 (+ 5 (+ 5 (+ 5 (+ 5 5))))))
And that’s precisely how I was taught multiplication. It was just written with infix:

    5 * 6 = 5 + 5 + 5 + 5 + 5 + 5
Which, the teacher demonstrated can be partially unwrapped:

    5 * 6 = 5 + 5 * 5
And so on. All you have to do to teach that as recursion is declare those operators as functions and call them. You can also use this to teach reduce, which is just a generalized recursive loop expression.

> It has bearing on language. Nobody outside of programming/mathematics communicates concepts recursively. Our communication is a reflection of how we think naturally. Recursion takes training. Looping is just learning syntax for a concept we already know about: repetition.

From a learning perspective, they’re literally just different syntax for the same thing. Recursion is just moving the repetition intent to the body of the loop and eliminating intermediary values. There’s no reason other than tautology that one syntax is easier to learn than the other.


In math iteration and recursion are defined as sibling isomorphic concepts. You can desugar recursion into iteration and you can “desugar” iteration into recursion. See Turing completeness and lambda calculus. The topic at hand is which angle provides a better view of this singular isomorphic concept (iteration vs. recursion) that makes it easier to understand. The isomorphism itself (which is what your expose gets into) is already well known.


Look at my example. if you literally communicated with someone that way would the other person be confused? The answer is yes.


You're not comparing like with like - you're comparing a repeat function (or keyword) with the implementation of a repeat function. The imperative equivalent would be something like "make an int called i, then repeatedly add one to i, checking whether i is less than five, and each time you do that do the special task", which is by no means clearer. And if anything writing a "repeat" function is easier in the functional world than the imperative one, since traditional imperative languages don't have first-class functions.


The human intuition starts out naturally oriented around cause and effect, linear timelines of events, and state change over time, because that's how we've evolved to conceptualize the physical world around us. When you're designing large systems this model doesn't scale, but it's absolutely the most intuitive on the small-scale. This is why languages like Python and Basic are so accessible for newcomers.


I’d say that you can model conceptually the same way with a time traveling debugger and get really comfortable. A recusing function is just an event loop where you know how the data is being mutated.


I've seen people hung up on the calling yourself while defining yourself concept. Talking through that often highlighted a mental model of functions where they were sort of almost cut'n'pasting the other function into their code. Basically their model was too low level and the understanding of defining and calling functions were unnecessarily conjoined.


I guess the way I’d try to help those people understand it is with the fiction I tell myself (which I alluded to upthread): I mentally model the “outer” function and the “inner” recursion as referencing different functions that have the same signature and the same body. The definition is concrete and self sufficient, it just dependency-injects a contract in its own image to delegate future work to. It’s even easy to literally implement that way in languages with first class functions, because you can just bind fn a to b and call b within a.


you lose history for eg stack traces and debuggers.

I remember javascriptcore solving this issue by using a "shadow stack"


A better mind bender is an async promise calling itself.


Wait, I’m confused. Like one promise/future/whatever monadish-eventually-fulfilled value referencing its own value? Or a polymorphic promise/etc referencing the same construct that created it to map its resolution into a new value?


how do modern compilers like clang/gcc handle the debug info while still doing TCO? Any description or links appreciated.


Clang will very aggressively optimize with -O2 (or whatever the max speed optimization is). It think gcc and MSVC as well.

I was experimenting with a simple indirect-threaded interpreter, which I wanted to write in a "nice" style (1 function per opcode) but have compiled in an "optimized" way (state machine). Playing around in Godbolt, I found out that the compilers managed to detect & optimize even indirect tail calls, i.e. tail calls to function pointers. This was indeed on a toy example (i.e. like 5 functions, not a real interpreter with 100 different opcodes) but still truly amazing.

The reason for this was, I was trying to replicate LuaJIT2 - Mike Pall was complaining that the reason interpreters written in C are slow is, that the compiler cannot optimize & register-allocate a large function containing a switch statement for 100 opcodes. Instead, he wrote the interpreter in asm, and made sure all important variables are kept in registers. My thesis was that we could do the same thing by having a bunch of functions tail-calling each other, with all important variables as parameters (which would be optimized into registers). Better, actually, because the compiler could improve register allocation within each "opcode"/function.


They don't. The whole point of TCO is to discard stack frames, which means local variables cannot be recovered. Optimize or debug, pick one.


My recollection (without references) is that they can do some amount of guess work. They might see the stack says A->C, but can see the code of A actually calls B, so logically it must have been A->B->C at some point.

Or you can use the rr-debugger, and just return back to the exact sequence of calls, and reject the need to pick between optimization and debugging :D


Well, the point of TCO is to discard stack pressure. Discarding stack frames is just an unfortunate side effect.


clang will actually optimize non tail recursive function as well. Check out the code in this question. It will not stack overflow.

https://stackoverflow.com/questions/65225761/trying-to-delib...


Pun intended?


People who say "no pun intended" are cowards. Intend your puns weaklings.


I usually say “punexpected” because I think what’s meant by “no pun intended” is usually “I only recognized the pun after I said it”. But I agree! Puntention is best.


Perhaps we can soon get a post on how to write a post about tail call optimization?

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Not complaining.. but curious that this one feature gets so much visibility. Is it just because it is quite easy to understand?


If you ever looked at different call conventions and how call stack frames are used with them, it would be immediately obvious that it's recursion via goto top making a loop using the same stack frame's local variable memory.


One of the things that proceed naturally from TCO is transforming general recursive functions into mutually recursive functions that call functions only when they return; i.e. transforming them into continuation-passing style (CPS), so that even general recursion does not encounter stack overflow.

Some FP compilers such as Guile compile to a CPS intermediate language, and the concept of CPS is known within people writing compilers and the FP community, but quite unheard of in the general programming community it seems.


> After executing the nop instruction the PC will be advanced by 1, because that's the size of nop and advancing it by 1 causes it to point to the next instruction

Nit: the IP will be advanced by 1 before the nop is executed. IP always points to the instruction following the one which is currently being executed. Which is important for IP-relative addressing and, as mentioned later, calls.


Can someone clarify something for me in that second assembly dump: It does:

  xor %eax %eax (2 bytes)
  mov $0x0,%edi (5 bytes)
Why does it decide to use 2 different ways to zero a register, the second one less efficient than the first? Or will the second one be patched while linking, i.e. the 0 is a placeholder for some other address?


Exactly, the 0 is a placeholder for the address of the format string (first argument for printf).


Update: Looking further, there seems to be a lot of space wasted in the disas factorial example: It is 32+1=33 bytes long, or even 48 if the next function will be aligned on a 16 bytes boundary. There are 2 nop's in there, consuming 6 and 4 bytes. So at least 10/33=about 30% of the memory is traded for aligning jumps on 8 (or 16?) byte boundaries. This seems a big waste of L1I cache.

Is this a normal ratio? Does -Os much better?


Functions are usually longer than what you're seeing there.


True, but you would expect it is the ratio branches vs other code which is relevant here.


I wish that Rust would follow the lead of some languages in allowing one to explicitly mark a function as recursive. (Not exactly the same as TCO, I understand, but related)

I understand the argument against doing TCO in a low level language like Rust (debugging stack traces, etc). But as an opt-in? Yes, please!

OCaml, which is like Rust's muse/parent, does it. Kotlin and Clojure do that as well, IIRC.


From my point of view TCO is exactly most useful when you are dealing with mutually recursive functions.

Like eg a state machine where one state = one function. That's also the example in the corresponding Lambda the Ultimate paper.

But yes, getting TCO for simple self-recursive function is better than nothing.


TCO is exactly as useful as async/await. Could you write an equivalent state machine manually? Yes. Do you want to? No.


What do you mean by 'exactly as useful'? I can see how they are both useful, and there are even some problems were either approach works. But I don't see an exact correspondence?


Since you understand the argument against TCO in Rust, would you mind explaining it to those who don't?


The function that performs the tail-call won't show up in stack traces. This is intentional (and needed) when it's done to eliminate an infinite number of recursive calls, but if it's just a utility function then it can be confusing.

As an ObjC example, there's an objc_msgSend call between every two methods in a stack trace, but you never see it.

In practice this has never been a problem for me.


I believe the article is describing sibling calls, which is a special case of tail calls.

https://llvm.org/docs/CodeGenerator.html#sibling-call-optimi...


I've seen the term "tail call elimination" applied to this general case and TCO applied to the recursion case. Is that still common? TCE is the more general one obviously .. and is what we almost always want.


> However, the compiler still generated code for factorial_accumulate in my object file.

This is because it is not declared static, and thus is visible outside that translation unit.


is gcc clever enough to optimize

a() { b(); i++; }

or should we code for tco?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: