Carp – A statically typed Lisp, without a GC, for real-time applications

soegaard · on Oct 15, 2021

For the history buffs:

    Pre-Scheme: A Scheme Dialect for Systems Programming
    Richard A. Kelsey, 1997

Abstract

Pre-Scheme is a statically typed dialect of Scheme that gives the programmer the eciency and low-level machine access of C while retaining many of the desirable features of Scheme. The PreScheme compiler makes use of type inference, partial evaluation and Scheme and Lisp compiler technology to compile the problematic features of Scheme, such as closures, into C code without signi cant run-time overhead. Use of such features in Pre-Scheme programs is restricted to those cases that can be compiled into ecient code. Type reconstruction is done using a modi ed Hindley/Milner algorithm that allows overloaded user-de ned functions. All top-level forms in Pre-Scheme programs are evaluated at compile time, which gives the user additional control over the compiler's partial evaluation of a program. Pre-Scheme has been implemented and used to write a byte-code interpeter and associated support code for a complete Scheme implementation.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.3....

stan_rogers · on Oct 15, 2021

The letter combination "fi" seems to have been deleted throughout your quote, there. Perhaps it was a ligature in the original.

HPsquared · on Oct 15, 2021

I was thinking it lacks a foreign function interface (FFI)

a0-prw · on Oct 15, 2021

I just thought it was so effficient, it left out the effs ;)

soegaard · on Oct 15, 2021

I believe it has one today.

soegaard · on Oct 15, 2021

Sorry about that!

I copied the Abstract from the pdf. I believe everything looked correct in the input form where I submitted the text.

kragen · on Oct 15, 2021

Of course the obvious question is "What kind of memory management does it use?"

It uses a "linear" (actually affine) type system for memory management, with borrowed refs similar to Rust, but evidently without tracking mutability: https://github.com/carp-lang/Carp/blob/master/docs/Memory.md

The omission of the kind of mutability tracking that eases multithreading seems like a significant drawback for its target application domain.

p_l · on Oct 15, 2021

If the target is "real-time systems with low enough latency bounds to make GC an issue", the question starts to be "does it have predictable pause times" (Rust, afaik, doesn't due to backing mechanisms behind some of the lower level aspects of memory management and/or use of Rc/Arc, though of course someone could get things better there)

doppioandante · on Oct 15, 2021

Do you mind posting some reference on the unpredictability of pause times in rust? I'm interested in real-time systems and I'd love to know more.

p_l · on Oct 15, 2021

This is more specifically about any reference counted system, though it also includes unique owner approaches and others.

Namely, when you free an object, whether it involves manually written code to release anything it holds recursively, or through some form of reference counter, you do not know the size of object graph you're freeing. If you add any custom release logic to the mix (destructors, any kind of shutdown logic, etc), you also don't know how much extra work unrelated to simple memory management will be also involved.

This means that naive manual memory management, or reference counting systems, have usually unbounded and unpredictable pause times (in addition to various costs involved in implementation details, whether it's malloc()/free() or refcounting). You do not know if object X going out of scope isn't the sole owner of a huge object graph.

This is one of (though not only) reason manual memory management and refcounting aren't considered good options in real-time compared to static allocation of everything (object pools, static stack analysis & limits) or RT Garbage Collector algorithms, though of course it depends on how deep into the rabbit hole you go.

Garbage Collector systems tend to have more control over pauses, but majority of algorithms used optimise for throughput, not latency or predictability. Thus you have "huge GC pause" as GC uses fastest space/time algorithm and "pays down the debt" for making allocation stupidly fast (often single Compare-And-Swap operation, sometimes not even that if each thread uses separate nursery and can keep local heap-end pointer).

However you can also setup environment towards low latency and predictability, probably the most famous example being IBM Metronome (available in J9 and recently open sourced), which uses a static quanta of time for GC and minor variability is the noise of multiple threads of execution synchronizing. This works by dividing available time between mutator (your "work" code) and collector (GC) statically, so that GC will for example always run for 30ms then mutator for 70ms then GC for 30ms and so on. Similar techniques exist for amortizing pathological performance of reference counting by pushing actual deallocation into separate collector thread or similarly scheduled slice (depends on whether you're fine with possibly unpredictable contention between two threads or want more strict timing).

A lot of very hard real time is more about predictability in the end.

kragen · on Oct 16, 2021

It's worth mentioning that (1) typically even ordinary reference counting avoids large pauses, because typically objects that go out of scope aren't the sole owners of a huge object graph, and that this is often but not always a significant improvement over (other simple kinds of) garbage collection; (2) there's a readily available tradeoff with reference counting known as "deferred decrement" which can meet hard real-time deadlines --- but at the cost of sacrificing the relatively predictable and compact memory usage you usually get from reference counting.

Deferred decrement enqueues objects whose reference counts drop to zero onto a "destruction queue" instead of deallocating them immediately. Every time you allocate an object, you work through some fixed amount of the destruction queue, such as two objects, unless it's empty; this amounts to destroying each of the references in the destroyed objects, decrementing whatever objects they refer to. This may drop those objects' reference counts to zero, in which case you add them to the destruction queue. This is very simple, it strictly limits the amount of work needed per allocation or deallocation, thus completely avoiding any large pauses, and it's guaranteed to empty out the queue eventually.

However, it's still reference counting, so it's still inefficient, and it still uses an unpredictable amount of memory, which means that any allocation can fail. Many real-time systems are not amenable to recovering from failures.

p_l · on Oct 17, 2021

The main difference in favour of GC in the first case is that GC gets you centralised place to control it.

Generally GC offers more explicit, centralised control over allocation/deallocation, and I've seen it favoured over other schemes if dynamic memory management is necessary in critical code.

kragen · on Oct 19, 2021

That's interesting! Is the idea that, for example, your interrupt handler can freely allocate memory and update pointers and stuff, and you make sure to run the GC often enough that there's always enough memory available for the interrupt handler when it fires? Has this been written up anywhere?

I usually think that the main advantage of GC is not a performance thing; it's that your program can pass around arbitrarily large and complex data structures as easily as it can pass around integers, without you having to think about how much memory they use or when it gets freed. Of course, not thinking about that can easily result in programs that use too much memory and die, or run your machine out of RAM; but for very many programs, either "compute a correct answer" or "die" are acceptable outcomes. You wouldn't want to program your antilock braking system that way, but lots of times people are willing to put up with programs crashing occasionally.

p_l · on Oct 19, 2021

For very tight code, like interrupt handler context, you usually avoid allocation at all (with some object caches to provide storage possibly), but yes.

Metronome is all about partitioning run-time into static slice of GC/mutator ensuring that mutator code has predictable performance - it is slower than if mutator was running full-tilt, but it means that unless you manage to horribly exceed GC performance in collection, you're not going to see a pause.

That said, GCs are faster than manual management usually thanks to pauses - if you look at total runtime, wall-clock time, and not latency. The most common algorithms trade latency for throughput and total wall-clock time.

kragen · on Oct 20, 2021

Yeah, that's what I thought about interrupt handlers. (Also if you're allocating in your interrupt handler, your background thread can't allocate without either disabling interrupts or using some kind of lock-free synchronization.)

I still haven't read the Metronome paper.

I don't know if I'd go so far as to say GC is usually faster than manual memory management, even if we measure by throughput, but at least "often faster" is justifiable. I think usually you can find a way for manual memory management to be faster.

rurban · on Oct 15, 2021

An overview of it's deficiencies: https://github.com/carp-lang/Carp/issues/1317

protomikron · on Oct 15, 2021

Can somebody please explain or give me some links how FP is supposed to work without a GC? For example Rust has different types of function pointers (Fn, FnMut, FnOnce), to guarantee the possibility of lifetime analysis (so it is arguable to consider it a functional programming language). On the other hand the most common FP languages (OCaml, Haskell) all come with a GC.

Or am I wrong in assuming this is a functional language?

vnorilo · on Oct 15, 2021

Lisp is not necessarily functional. There's a lot of mutation in Common Lisp even if the community favors a functional style.

I think the question is: how does Lisp function without garbage collecting cons cells?

For one, I'm not sure they even rely on cons cells like "real" Lisp. Clojure doesn't either. They cite ML as inspiration.

Carp language guide states object lifetimes are statically inferred [1] which my guess is they allocate on stack (or malloc/free by scope) and detect use-after-free at compile time.

Another, more theoretical approach is using linear types which require all values to be "consumed" [2]

1: https://github.com/carp-lang/Carp/blob/master/docs/LanguageG...

2: https://www.cs.utexas.edu/users/hunt/research/hash-cons/hash...

protomikron · on Oct 15, 2021

Thanks for the links. There's also this Reddit discussion [0] from 2 years ago (it mentions Carp btw.) and an explanation about how Carp manages memory [1].

[0] https://www.reddit.com/r/haskell/comments/d5d13i/is_it_possi...

[1] https://github.com/carp-lang/Carp/blob/master/docs/Memory.md

alexisread · on Oct 15, 2021

In particular, the reddit discussion mentions ASAP, which is a set of static analysis algorithms that can work with mutability, generics (polymorphism) and linear types. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-908.pdf

As far as I'm aware, this really only applies to single threaded systems. However, if you implement threadsafe modules and keep the shared stuff internal (use the ASAP algos here), you can fit the majority of use-cases including hard-realtime constraints.

Composita is the module system I'm referring to. http://www.composita.net/Composita.html It allows concurrency with managed memory and no GC, through static analysis of the module (component) interface queues.

timdeve · on Oct 15, 2021

You're correct, somewhat like clojure's Vector, the default collection at runtime in Carp is an Array which is heap allocated C array with an attached length. At compile time however the dynamic language uses more standard lispy lists.

timonoko · on Oct 15, 2021

Me knows:

"TN-tools in Nokia were automatically compiled into C-code to be run in VAX-computers. Compiled C-code did not have garbage collector, there was separate reclaim-command for discarding used data. If you managed to run your program without ever hearing the BEEP caused by garbage collector, your program was ready for VAXing."

https://timonoko.github.io/Nokolisp.htm

mumblemumble · on Oct 15, 2021

In general, most lisps are imperative, or support multiple paradigms including imperative.

First-class functions are a necessary feature of FP, but the mere presence of the feature does not make a language functional. Some counterexamples include Fortran and Smalltalk.

It's an easy misconception because most the well-known newer entries to the lisp family - Scheme, Racket, and Clojure - are all mostly functional, and because most of the major non-functional dialects of lisp died out 30 or 40 years ago.

astrange · on Oct 15, 2021

As long as you forbid or mark cycles (…or ignore the problem) lifetime analysis can be done statically. Which is better anyway. Automatic memory management doesn't need a GC.

silon42 · on Oct 15, 2021

It's really surprising that it's rare in functional languages. Immutability seems like it should guarantee no cycles (?), so reference counting could be used.

jbjohns · on Oct 15, 2021

Why would immutability guarantee no cycles? Here is a line of valid haskell:

        star e = let (sp, a) = (Split a e, atom sp) in sp

EDIT: I guess I should probably explain it: star is a function that takes an expression "e" and returns the value "sp" which is "Split a e" where "a" is the results of calling the "atom" function on "sp". This is creating a representation of a regex star operator. Note that the tuple defined in the let definition is only to define a name for the two values of the tuple so that they can refer to each other.

JoelMcCracken · on Oct 15, 2021

I mean generally tying the knot is a useful technique, but I think these scenarios all require/exploit non-strict, which is in itself not really immutable in the sense most people use it.

But yes, such code is often useful so that e.g. a parent xml node can refer to its childen nodes while also children nodes can refer to their parents.

Anyway, I'm not sure about this, but I think you can't have circular data structures in the context of strict evaluation (or can you? maybe by defering execution via anonymous functions? I wonder....)

jbjohns · on Oct 23, 2021

I'm fairly certain I've done similar things in Ocaml (in fact, I think it's where I learned this technique).

dwohnitmok · on Oct 15, 2021

The big thorn is closures, which show up all over the place in FP. You either need to limit closures vs ordinary functions (as e.g. Rust does), have manual memory annotations (such as e.g. what Swift does) or you basically need a GC. The former two choices are annoying if you really take advantage of functions as first-class citizens.

silon42 · on Oct 17, 2021

Yes, I thought that might be a problem... I guess Rust is the choice for me.

kragen · on Oct 15, 2021

Reference counting is usually very expensive, because even reading a variable updates the reference count, and ending a scope involves testing the reference count of every variable defined inside the scope and conditionally deallocating the referent.

Without reference counting, here's the end of a hairy function scope that deallocates 15 local variables and restores two callee-saved registers:

        11a6:       48 83 c4 78             add    $0x78,%rsp
        11aa:       5b                      pop    %rbx
        11ab:       41 5e                   pop    %r14
        11ad:       c3                      retq

Now imagine looping over 15 local variables to decrement each reference count, test it, and do a conditional jump based on the result; if that's open-coded, your epilogue is humongous, and if it's in a reference-count-decrementing subroutine, it's going to have mispredicted branches all over the place, costing you maybe 15 cycles each, a total of about 100 cycles for this function. We're talking about adding an order of magnitude of cost to subroutine call, or more. (I think this function doesn't really need 15 local variables; that's the compiler's fault.)

This gets worse with multithreading, because writing to the same reference count on different cores would even in the best case require one core stealing the cache line from another in order to modify it, which may stall the core; but often even an atomic reference count increment is more expensive even than that because it involves a memory barrier.

Reference counting can become reasonably cheap if it's done at large granularity (filesystem files or COM objects, not conses); if you can elide almost all of the reference-count updates, as in Rust; or if your language runtime is just so dog-slow at everything that the extra cost of reference counting isn't that important, like CPython.

30 years ago or more, before generational GC had gone mainstream, reference counting was a more reasonable choice, because GC was going to be very slow in any case, and ref counting at least used less memory—especially important on machines without cache or virtual memory.

(Purely immutable (applicative) languages like Haskell and Miranda are usually lazy, since that's the payoff for completely abjuring side effects. But lazy evaluation is implemented at the machine level by mutation.)

kragen · on Oct 16, 2021

WRT the multithreading issue, https://lwn.net/SubscriberLink/872869/0e62bba2db51ec7a/ mentions that using atomic memory operations in CPython for Py_INCREF and Py_DECREF slows down the entire CPython interpreter by 60%.

naasking · on Oct 15, 2021

> Reference counting is usually very expensive, because even reading a variable updates the reference count

There are many papers out there on how to elide most ref count operations on locals, and runtimes that use ref counting for automatic memory management typically defer ref count updates in various clever ways (like Nim IIRC). You give up some latency for significant throughput gains.

kragen · on Oct 15, 2021

astrange · on Oct 19, 2021

> Reference counting is usually very expensive, because even reading a variable updates the reference count, and ending a scope involves testing the reference count of every variable defined inside the scope and conditionally deallocating the referent.

This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.

kragen · on Oct 19, 2021

> This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Yes, this is true, and in fact with Rust's lifetime analysis you can get excellent performance with reference counting in many places. There is more information about this in https://news.ycombinator.com/item?id=28879401 and https://news.ycombinator.com/item?id=28884775.

> Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.

It's true that a tracing GC usually requires significantly more memory than manual allocation or reference counting, though occasionally it doesn't, since to implement compacting you basically have to implement a tracing GC, so a GC can sometimes give you lower fragmentation.

It's not true that you have to scan all of memory for pointers, for two reasons:

1. You may have a region of memory that contains no pointers, like Erlang's binary storage.

2. Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s. Except...

3. Things like Erlang partition the heap into separate per-process heaps for a large number of lightweight processes. Each of these heaps can be GCed independently because you can't have pointers from one heap to another. "What about unreferenced processes?", you might ask. Well, Erlang doesn't GC processes; you have to explicitly create and destroy them, like a larger-granularity malloc/free, typically using a Rust-like ownership hierarchy to avoid process leaks. This works better than it sounds like it should.

astrange · on Oct 19, 2021

> Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s.

I was going to come back and edit this to say "…when you violate an assumption of generational GC" but you somehow managed to instantly reply to my 3 days late post!

Though I'm not too familiar with how GC runtimes are really implemented; I thought generations were more or less based on allocation age but don't know if it actually does analysis to separate allocations that can't reference each other.

kragen · on Oct 19, 2021

I was just lucky! I didn't mean to play gotcha!

(I should write a HN user-agent to watch old threads, though.)

I should preface this by saying I've never actually written a GC, or maintained one someone else wrote, so I might be off-base here; this is just my understanding from reading books and papers:

The awesome thing about generational GC is that (normally) the generations are physically separate in memory, so the GC can avoid even looking at objects in generations older than the one it's emptying out. (Unless they might contain pointers to objects in that generation, that is, which is detected by the write barrier.)

This combines nicely with the potential killer advantage of copying collectors: they don't need to even look at garbage. So if your nursery (or Garden of Eden, if you prefer) is 8MiB and contains only 128 live objects averaging 64 bytes each, a non-concurrent nursery GC consists of walking the stack and any marked older objects, copying those 8192 bytes into the next generation while adjusting the relevant pointers, then resetting the allocation pointer to the beginning of the nursery. The other, say, 130,944 objects that were allocated in the nursery aren't even looked at by the GC, because nothing points to them. (Unless they have finalizers (destructors).) So the GC examines and copies on average 0.06 bytes per object allocated.

Initializing the objects in the first place, as you must do whether they're allocated by decrementing a stack pointer or anything else, is far more expensive.

Of course, most programs retain a larger fraction of their nursery objects than that, stack allocation makes better use of memory caches than a multi-megabyte nursery, and if your program keeps retaining nursery objects long enough, it will eventually need to GC one of the other generations. So GC still isn't free even in the throughput sense. But it's competitive with manual dynamic allocation.

wyager · on Oct 15, 2021

There are two ways to get cycles in Haskell.

One is through “tying the knot”. E.g. to create an infinite list of 1,1,1,1,1,…

ones = 1 : ones

This will actually be compiled to a list cell with a pointer back to itself. You can construct more complicated self-referential data structures thanks to laziness.

The other way you could get a cycle is that it actually does have mutable data structures, although their use is restricted so they can’t have any observable mutable effect in pure code. But you have e.g. IORef which is basically a mutable pointer.

If you wanted no cycles you would need to eliminate some subset of laziness, knot-tying transformations, recursion, and any support for mutable data structures. But yes, I think it could be done.

cmrdporcupine · on Oct 15, 2021

From the Carp docs:

Memory management is handled by static analysis, a value is owned by the function where it was created. When a value is returned or passed to another function the initial function will give up ownership of it and any subsequent use will lead to a compiler error. To temporarily lend a value to another function (for example to print it) a reference must be created, using the ref special form (or the & reader macro).

and

achieved through a linear type system where memory is owned by the function or let-scope that allocated it. When the scope is exited the memory is deleted, unless it was returned to its outer scope or handed off to another function (passed as an argument).

andi999 · on Oct 15, 2021

So how is memory fragmentation avoided?

pjc50 · on Oct 15, 2021

That's a question for the underlying allocator, surely?

(Quite often the answer is "it isn't")

andi999 · on Oct 15, 2021

Not just for the allocator. I always thought a main point of a garbage collector was heap compactification (shuffling things around so there is more space), but maybe I am wrong.

cmrdporcupine · on Oct 16, 2021

Nah, only copying / generational collectors do heap compactification. A simple Mark&Sweep collector doesn't, for example. Nor does reference counting. Both of which are used by many Lisp or Lisp-like languages.

Nothing can substitute for a really good allocator.

andi999 · on Oct 16, 2021

So what happens to long running lisp programs? Running out of memory due to fragmentation eventually?

xxs · on Oct 16, 2021

One would expect at least a provision for a stop the world phase. where all mutator threads get to wait for a massive defragmentation.

xxs · on Oct 16, 2021

Not every GC has compaction phase though, but generational ones do by design.

RobertKerans · on Oct 15, 2021

& Rust is surely a highly imperative language that has absorbed functional idioms (where appropriate), rather than something that falls under the banner of "functional language"?

amelius · on Oct 15, 2021

There's a reason why functional languages all use a GC. And that reason makes Rust not such a good language for functional idioms unless you stay within very strict lines.

pharmakom · on Oct 15, 2021

use value semantics for everything is one way

mlang23 · on Oct 15, 2021

Lisp is definitely not as functional as for instance Haskell is. Side-effect-freeness was never really a topic for Lispers.

medo-bear · on Oct 15, 2021

You are right, Lisp is far more versatile. There is even a statically-typed language Coalton [1] embedded into Lisp.

[1] https://coalton-lang.github.io/20211010-introducing-coalton/

xvilka · on Oct 15, 2021

There is also interesting CL implementation that is based on the LLVM framework - Clasp[1]. It can be natively compiled and designed for performance.

[1] https://github.com/clasp-developers/clasp

simiones · on Oct 15, 2021

CLASP is a really interesting project.

For one, it's created and written by a Chemistry PhD researcher (Christian Schafmeister), not someone with a traditional CS background.

For another, it's one of the few languages that has ever attempted to interface with C++ at the template level - you can instantiate C++ template classes from CL, and catch C++ exceptions etc.

For yet another, he does compacting garbage collection for the C++ objects used in the CL implementation, with pointer patching and everything else.

There's a nice Google Tech Talk about it [0], that goes into both why he did this, and how he implemented this.

[0] https://www.youtube.com/watch?v=8X69_42Mj-g

gavinray · on Oct 15, 2021

Dr Schafmeister is occasionally present on the LLVM Discord. The CLASP project is something else. That hour-long talk he gives on Youtube about it blew my mind.

pjmlp · on Oct 15, 2021

I guess it helps that it is pretty much a Lisp for the research work that they are doing, instead of trying to cater to everyone.

aidenn0 · on Oct 15, 2021

Last I checked SBCL and CCL (both open source native compiled CL implementations) are quite a bit faster than Clasp.

However, if you are primarily e.g. calling computational chemistry libraries written in C++, Clasp is the way to go. It's the only non-C++ language implementation I know of with actually usable C++ support.

pjmlp · on Oct 15, 2021

C++/CLI comes to mind, as means to merge .NET and C++ worlds together.

aidenn0 · on Oct 15, 2021

Oh gosh, I forgot about that. Is that Windows only though? I remember MSVC had managed/unmanaged C++ and allowed calling between the two.

pjmlp · on Oct 15, 2021

Sadly yes.

And although they don't have much love for it, they still keep it relatively up to date.

It was one of the .NET Core 3.1 main milestones.

However I would say for the purpose of binding .NET and C++, probably it doesn't need to understand everything anyway.

int_19h · on Oct 16, 2021

The C++/CLI compiler is definitely Windows-only. But I'm not sure whether the generated code would be, if you compile with /clr:pure (which can still handle everything except for setjmp/longjmp).

/clr:pure has been deprecated for some time now, unfortunately. But older compilers are still around...

bjoli · on Oct 15, 2021

The creator was interveiwed in the podcast "Functional geekery": https://www.functionalgeekery.com/episode-96-erik-svedang/

NoGravitas · on Oct 15, 2021

This seems like fun. I could never get into Rust because of the syntax (there's so much of it), but this seems doable. I don't need it for anything, though; I don't do anything where SBCL isn't plenty fast enough.

hcarvalhoalves · on Oct 15, 2021

This is interesting, but I believe a big point of LISP is actually having a live system that can be introspected, and GC is part of making that work. In the end isn’t this just <insert sys prog language> but with s-expressions?

TeMPOraL · on Oct 15, 2021

S-expressions also enable proper macros, as long as it has them, it's already ahead of <insert sys prog language>.

hcarvalhoalves · on Oct 15, 2021

I guess so, I think my point is: are s-expression enough to qualify as a LISP?

TeMPOraL · on Oct 15, 2021

Depends who you ask, really. For me, s-expressions and macros are enough, though it would feel very weird without a REPL and image-based development.

dragontamer · on Oct 15, 2021

So "cons", and the associated "car" and "cdr" functions are innately the "thing" that needs to be garbage collected in Lisp-like languages.

So I immediately went to the manual, and CTRL+F'd for cons. Carp has the following restriction: https://github.com/carp-lang/Carp/blob/master/docs/Manual.md

> If you're calling a dynamic function (something defined with defndynamic, or a built in command) it will be executed right away. This works very much like a classic, dynamically typed Lisp interpreter. The dynamic functions are not available in compiled code! Their main usage is in macros and to programatically control your build settings.

cons, car, and cdr are all "Dynamic" functions that can only run _DURING COMPILE TIME_. Which means the GC is not needed during runtime: all garbage-collection occurs in the compiler.

kazinator · on Oct 15, 2021

That's a Lisp preprocessor for a non-Lisp language.

If you program in C using the Common Lisp c-mera preprocessor, or any of the other similar systems, it's the same thing.

You're writing everything in S-exps, and the expansions use conses, but the output is C; so that of course that cannot call cons at run time.

https://github.com/kiselgra/c-mera

fuzzythinker · on Oct 15, 2021

Anyone who had used or tried Carp and Janet (that was on frontpage the other day)?

What are some strong points on Carp vs. Janet?

pgt · on Oct 15, 2021

Janet is dynamic with a garbage collector.

Carp is a GC-less static Lisp with deterministic memory allocation via a Rust-like borrow checker.

I only wish Carp had stayed nearer Clojure's syntax. A more familiar syntax would make it more accessible.

kbd · on Oct 15, 2021

As someone who doesn’t know any lisp well, it’s odd to hear you highlighting significant differences in syntax between them. I thought the character of a lisp is that it doesn’t have much syntax. I’d be interested in an example.

pjmlp · on Oct 15, 2021

Clojure adds [], {} and #{} as additional syntax for arrays, hashmaps and sets.

rcarmo · on Oct 15, 2021

I've been reading through the examples and it's readable enough. What is it that you find odd?

lf-non · on Oct 15, 2021

I have not used carp, and janet only briefly. However, other than both using lisp syntax, there are very little similarities between the two.

Janet has a garbage collector, no static types, no ownership tracking - the perf. is similar to lua, so is not be ideal for low-level/realtime systems, which carp is targeting.

shakna · on Oct 15, 2021

> the perf. is similar to lua, so is not be ideal for low-level/realtime systems

Lua runs on a bunch of low level systems, and not just for hobbyists (it's at the core of VxWorks, for example). Can you expand on why you think the language doesn't meet your performance criteria?

m12k · on Oct 15, 2021

I think generally Lua is considered a scripting language - it is suited for gluing together performance critical components built in faster languages like C, C++ or Rust - the kinds of GC-less languages you can make a VM, game engine or OS in. So Carp is intended to join that list of fast GC-less languages that you make the fast foundation that others then call from scripts using languages like Lua.

lf-non · on Oct 15, 2021

Not familiar with VxWorks, but I have mostly seen lua being used as a plugin/extension language around a core implemented in native code.

saikyun · on Oct 15, 2021

I think this is LuaJIT, or? Iiuc Janet is comparable to non-JITed Lua. But I actually have no clue.

shakna · on Oct 15, 2021

LuaJIT is x86, ARM, PPC and MIPS only.

However, Lua itself can run on anything with the C stdlib and a C99 compiler. Which means for realtime systems you're not generally going to be talking about LuaJIT.

The speed of Lua will depend on the GC in use. Whilst that is an incremental one be default, Lua is designed to allow you to disable it altogether, use the older reference counting one, replace it entirely with your own allocation system, and so on, so that it can meet hard realtime requirements.

saikyun · on Oct 16, 2021

Aha, thanks for the info. Then I believe Janet could fill a similar role. What little profiling I've seen puts them in a similar category: https://github.com/MikeBeller/janet-benchmarksgame/blob/mast...

Though I imagine it's hard to say without actually trying it.

fuzzythinker · on Oct 15, 2021

Thank you and @pgt. I thought Janet compiles to C or native bytecode due to its strong interop with C.

saikyun · on Oct 15, 2021

Last time I checked, Carp doesn't have a REPL that can interact with a running process. In Janet you can do things like change functions during runtime.

yellowapple · on Oct 16, 2021

Importantly: has anyone tried embedding Janet in Carp to get the best of both worlds?

ducktective · on Oct 15, 2021

"Real-time" has a specific meaning in embedded and computing context. Does it mean that programs written in Carp have any guarantee on their order or duration of execution?

xxs · on Oct 15, 2021

For "real-time", they would need a real-tune OS, datastructures with forward guarantees, non-blocking IO and so on. GC is like the least of the problems.

jstimpfle · on Oct 15, 2021

Not an expert at all here, just a few thoughts and experiences.

I think "forward guarantees" (meaning "lock-free"?) can be implemented anywhere. Non-blocking IO can still be slow in my experience, for example a write() syscall on Linux to a non-blocking TCP socket on Linux sometimes took dozens of milliseconds when I measured. I don't have enough experience to know if one can get better timing guarantees with tuning or newer kernels, but maybe it's not a huge issue in all cases.

After all, the OS is only at the outside layers of an application, so one might be able to process the OS I/O in separate threads and allocate sufficiently large buffers there.

GC however, in typical usage means that random internal codepaths could be interrupted by the GC doing OS interaction in places where it's really not affordable.

xxs · on Oct 15, 2021

lock free does not guarantee all participants have a fair treatment, e.g. one thread can keep losing the CAS (CompareAndSwap / LoadLinked/StoreConditional / etc.) and effectively spinning or need to yield. releasing the CPU resources.

A non-blocking write should just copy the data to the socket buffer, definitely not taking milliseconds.

Allocate buffers in separate threads - ok, that's the crux of it - the memory is shared amongst the threads within a process, so the allocator has to be non-blocking as well or the large allocation would prevent allocations in other threads. Next release memory to the OS (or unmapping memory mapped files) - mumnmap requires TLB flush for all cores assigned to the process. There is a lot that can 'block' sometimes and void the "real-time" properties.

> interrupted by the GC doing OS interaction

Normally GCs trigger only at 'safe points' and unless they need to allocate more memory (should never be the case for a real-time application), GCs should have no OS interaction. Copying and moving memory within a process is no OS functionality. Concurrent GC with read-barrier exist as well, so no need for stop-the-world pauses. (Edit: creative use of memory mapping hardware to avoid copying may need OS calls)

It seems quite common to see on hacker news - "No GC <language> for real-time". Writing a good/multi-threaded real-time application is a lot harder than just GC's dreaded stop-the-world.

jstimpfle · on Oct 15, 2021

> A non-blocking write should just copy the data to the socket buffer, definitely not taking milliseconds.

That's the theory, right? It could be I measured something wrong, but sometimes dozens of ms is what I got, and I concluded that non-blocking I/O avoids indeterminate blocking (such as reading from a TCP socket until the sender sent N bytes) but does it completely avoid taking in-kernel locks etc? Probably not.

I didn't find any other measurements on the internet, please point me to them if you find them.

Thinking back again, another explanation could be false sharing effects that I didn't know well at the time.

> Allocate buffers in separate threads - ok, that's the crux of it - the memory is shared amongst the threads within a process, so the allocator has to be non-blocking as well or the large allocation would prevent allocations in other threads.

I've written a SRSW lock-free ringbuffer for example. The ringbuffer memory is allocated at startup. The cost to allocate from this ringbuffer is very little. It's probably not super important, but the ringbuffer even caches the read or write pointer from the other thread, so it only needs a cache line transfer when the cached value is not sufficient to accomodate the read or write.

This is way outside my experience, but I think it should be possible to do some similar stuff with multiple readers and/or writers? I think you should still be able to allocate from a ringbuffer without live-locking (unless the buffer is full of course, in which case events are dropped anyway).

> Next release memory to the OS (or unmapping memory mapped files) - mumnmap requires TLB flush for all cores assigned to the process.

I think most programs do only "static" allocation at program startup. There's no OS interaction for memory management from then on.

> Normally GCs trigger only at 'safe points' and unless they need to allocate more memory (should never be the case for a real-time application), GCs should have no OS interaction.

Ok, if the GC is configured to never return memory to the OS, it makes sense that probably the GC in an event handling system won't need to get more memory from the OS, or only rarely, once it is "warm".

Still, GCs are generic systems that have to be less efficient than specialized systems. And what are 'safe points'? I'd rather decide myself. I think a GC trace of <1ms, as some new GC allegedly do (is this widely acknowledged? I would assume it depends a lot on the allocation patterns / granularity etc), could be enough, but I'd rather control this myself (for the somewhat real-time app that I've been working on, I need << 10ms latency).

tzs · on Oct 15, 2021

> It could be I measured something wrong, but sometimes dozens of ms is what I got, and I concluded that non-blocking I/O avoids indeterminate blocking (such as reading from a TCP socket until the sender sent N bytes) but does it completely avoid taking in-kernel locks etc?

Perhaps what you were seeing were occasions where the system decided during your system call that your processes' time slice had expired or some I/O that some higher priority process was waiting for completed and it gave the CPU to some other process?

When timing things on preemptive multitasking systems it is often best to make a lot of measurements and then look for clusters. The time of the smallest cluster should be the time the operation itself takes.

jstimpfle · on Oct 15, 2021

Preemption can happen even outside of system calls of course, but I guess my takeaway has been to just avoid calling into the system where a lot of things can happen that I don't really understand or control. That, plus setting my threads to high-priority and possibly pinning them to the right cores.

xxs · on Oct 15, 2021

>That, plus setting my threads to high-priority and possibly pinning them to the right cores.

What language was that - [normally] Java does not respect priorities at all for instance. Running with higher priority may require root as well.

jstimpfle · on Oct 15, 2021

Language was C, running as root on some Linux 3.x on some ARM/FPGA development board. By setting threads to high-priority I mean something like pthread_setschedparams(... SCHED_FIFO ...).

xxs · on Oct 15, 2021

>I didn't find any other measurements on the internet, please point me to them if you find them.

It has been years since I have written thousands sockets servers (used in forex), yet even a couple milliseconds per write would have made the entire operation useless.

>another explanation could be false sharing effects that I didn't know well at the time.

False sharing sucks, of course, but milliseconds seems way way too much again penalty. You'd need all cores updating the same cache line, even then I doubt it'd be that bad.

Could it be the virtualization layer? (again I have run 'that' on bare metal only)

>I think it should be possible to do some similar stuff with multiple readers and/or writers

Multiple writes should be avoided entirely, contention on writing is what prevents scaling - false sharing is pretty much that. A writer honoring the readers is pretty simple - each reader has its own pointer and the write should fail (or spin) when the buffer is full. I think that's quite a classic structure and indeed - it requires no extra memory to communicate/hand-off.

About latency - this is java's current GC that does still has a compaction phase. https://malloc.se/blog/zgc-jdk16

pjmlp · on Oct 15, 2021

Correction, one of Hotspot's GC implementations, there are plenty of Java implementations to chose from.

Including ones with soft real time GC for performance critical deployments like PTC and Aicas.

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/wp/products-services/jamaicavm

Anyone picking a traditional JVM for such workloads is doing it wrong.

xxs · on Oct 15, 2021

What I meant is that low latency GC is sort of a commodity/mainstream now. Not advising to use java or a specific JIT+Collector.

pjmlp · on Oct 15, 2021

Ah, fair enough, got it wrong.

wahern · on Oct 15, 2021

Unless it has a scheduler for a concurrency or threading framework, what other characteristics beyond memory management would be relevant? I guess maybe certain types of built-in data structures--you'd probably want a map/dictionary based on red-black or AVL trees rather than a hash table.

xxs · on Oct 15, 2021

Any shared datastructure needs some kind of forward gurantess; e.g. being lock-free might not be sufficient.

KMag · on Oct 15, 2021

... or a hash table using RB/AVL trees rather than linked lists for hash buckets.

xxs · on Oct 15, 2021

the doubling of the sizes is an issue. Yet, I'd consider all single threaded datastructures a non-issue. The hard part is multithreading.

Flip note: java's hashmap uses red/black tree for nodes when there are too many collisions (and the keys are Comparable)

hawk_ · on Oct 15, 2021

but java's hashmap still resizes when breaching the loadfactor. so any similar implementation, even for singlethreaded cases can't achieve (soft)realtime. I am not aware of any linear hashing or similar approach for in memory hash tables to mitigate such latency spikes.

xxs · on Oct 15, 2021

of course it does; it's possible to pre-allocate the entire table though (c-tor with size, rounded to the next power of 2, compensated by the load factor). Massive waste of memory but no resizes.

billpg · on Oct 15, 2021

I was thinking of building my own LISP-like language, but seeing so many pop up on here I don't think I'll bother.

pjmlp · on Oct 15, 2021

Millions of CS students build their own language for compiler classes every year.

As learning exercise of everything that goes into making a compiler/interpreter is very worthwhile endevour.

As means to make the next big thing, maybe not.

brabel · on Oct 15, 2021

> Millions of CS students...

Are there millions of CS students in the world?? I would guess a few hundred thousand, at most.

kortex · on Oct 15, 2021

Probably. There's 20-30ish million software developers in the world [1]. People work for 30-50 years, suggesting at least a million freshman CS students a year. It's a rapidly growing field. Plus all the lifetime learners. It's not unreasonable to estimate at least 2M CS students worldwide write or work on a compiler each year. Maybe a tad high, but also it's an idiom.

https://www.daxx.com/blog/development-trends/number-software...

robertlagrant · on Oct 15, 2021

I have no actual opinion here. Just stopping by to point out that not (nearly) all software developers started as CS students.

kortex · on Oct 17, 2021

Of course, I don't mean to imply that all software developers went to school for cs, or went to school at all. Just doing a Fermi estimation.

pjmlp · on Oct 15, 2021

Might be, I don't have the numbers.

> A figure of speech is a deviation from the ordinary use of words in order to increase their effectiveness. Basically, it is a figurative language that may consist of a single word or phrase. It may be a simile, a metaphor or personification to convey the meaning other than the literal meaning.

https://en.wikipedia.org/wiki/Figure_of_speech

fsloth · on Oct 15, 2021

As stated by others, it's a very enjoyable exercise. Think about it like Marathon for programmers. Yes, lot's of people run marathons. But people usually don't go "Maybe I should run a marathon - but so many others are doing it. Meh, maybe not."

tonyg · on Oct 15, 2021

Try building a Smalltalk-like language. It throws a different set of concerns into relief, and besides, it's a better design.

jstx1 · on Oct 15, 2021

You probably should anyway - most of it is for learning and personal satisfaction. It's not like any of these projects are competing for actual marketshare in the programming language landscape.

rcarmo · on Oct 15, 2021

Spent my lunch break investigating this a bit and it is quite nice, with the only caveat (for my use case) being that networking is handled by a third-party socket library that, alas, does not handle multicast sockets (don't ask...).

Otherwise, examples are interesting, and certainly seem to cover most gamedev scenarios (haven't found any audio or MIDI ones yet).

savant_penguin · on Oct 15, 2021

I wonder if this deterministic memory management instead of a garbage collection would have helped discord with their spiking latencies

https://blog.discord.com/why-discord-is-switching-from-go-to...

guerrilla · on Oct 15, 2021

> (the Int x)

I like the cute syntax for type annotations.

aidenn0 · on Oct 15, 2021

I had to double-take because CL also has a THE also for declaring types, but it's a bit different; THE declares the type of an expression, not a variable (there is a separate type declaration for variables).

so e.g. (the fixnum (+ x 1)) would assert that (+ x 1) produces a fixnum.

timdeve · on Oct 15, 2021

That is also the case in Carp, x being an expression here.

aidenn0 · on Oct 15, 2021

Ah, okay. So is there a way to declare that a variable will always be of a given type in Carp?

timdeve · on Oct 15, 2021

Carp is statically-typed with type inference so writing (the Int x) would be enough for the compiler to forbid any usage of x as another type. Writing (Int.+ x 1) would accomplish the same as Int.+ only accepts Int.

You can also annotate function with a type signature.

  (sig add (Fn [Int Int] Int))
  (defn add [x y] (+ x y))

Is the same as:

  (defn add [x y] (the Int (+ (the Int x) (the Int y))))

kortex · on Oct 15, 2021

What's the difference between `let` and `the`? Skimmed the language guide and docs but it wasn't clear. `the` seems to be the typical type inference variable declaration.

I'm not really familiar with lisps so this might be a documentation oversight.

Looks really neat overall.

timdeve · on Oct 15, 2021

`let` is for declaring variables while `the` is a noop at runtime and only give type information to the compiler.

romesmoke · on Oct 15, 2021

Was it always the case that, on a weekly basis (if not more often), new languages would sprout? It seems to me like considerable amounts of energy are invested on re-inventing-the-wheel instead of solving real problems with existing tools.

DeathArrow · on Oct 15, 2021

What's the difference between it and Common Lisp? Isn't Common Lisp also AOT compiled?

jdc · on Oct 15, 2021

First of all, Common Lisp is a language with several implementations, some of which offer AOT compilation. Secondly, a garbage-collection (aka automatic memory management) does not exclude AOT.

jdc · on Oct 15, 2021

See also the Boehm GC, a garbage collector for C and C++.

https://en.wikipedia.org/wiki/Boehm_garbage_collector

baq · on Oct 15, 2021

AOT doesn't necessarily mean GC-less, right?

brabel · on Oct 15, 2021

Maybe they actually mean VM-less, which would probably imply GC-less because if you have a GC, you pretty much have a VM doing the GC (unless it's simple ref-count). For embedded, you need VM-less, I think, as VMs tend to be too heavy.

bitwize · on Oct 15, 2021

Gambit has a VM and a GC, and is still AOT-compiled to C. (The Gambit VM is an intermediate representation of Scheme code which is then compiled down to C code.)

white-flame · on Oct 15, 2021

VMs imply a non-native bytecode execution. Garbage collection doesn't require any of that.

It's just normal library code usually called by the allocator, and well-specified memory object/pointer layouts.

mlang23 · on Oct 15, 2021

The name is strange. Reminds me of a product-naming story I heard years ago. Papenmeier from Germany was once planning to create a small notetaker with built-in braille display. They originally planned to name it "Braille Assistant", but the US distributors objected on the ground that they can already see users shortening the product name to "Braille Ass". I have a similar feeling with Carp, isn't the obvious transposition an issue? :-)

rcarmo · on Oct 15, 2021

Given the kind of niche, self-deprecating humor often found in our ecosystem, it was probably seen as a plus :)

1500100900 · on Oct 15, 2021

Doesn't seem to happen to the Common Address Redundancy Protocol too much.

kortex · on Oct 15, 2021

It'll be good company with git, GIMP, LISP, ploopy.co, etc.