how does the stack copying technique (as used in lthread) compare to segmented stacks (as used in go)? The swap technique is very simple but I would assume that the stack copy shows up in profiles and is significantly slower than the segmented stack technique... Are there any benchmarks?
Segmented stacks will take much more space. One of the reasons I went with stack copy is because it allows me to create a million lthreads if I wanted without worrying about memory.
Lthread is in C, an environment where you control/manage your memory. Using slab allocators and high performance malloc like jemalloc, you can avoid allocating a lot of your variables on the stack and the stack copy will become minimal. In most of the production code I have running using lthread, stack copying is on the average of 300 bytes.
Lthread is a cool experiment but coroutines in C is a dubious ideal. If you're writing a program in C/C++, presumably you want ultimate control and the fastest possible execution. You wouldn't introduce extra runtime complexity and a speed hit for easier to use APIs (blocking I/O functions). You'd just use epoll and non-blocking sockets (or libuv). That's not to say that coroutines/green threads/goroutines in high-level languages where you've already decided to take a speed hit for easier use isn't useful. Said another way: if your concern is about easy to read code, C is not the right choice; if your concern is speed: copying a stack on every i/o function is not the right choice.
When you said segmented stack, It registered stack per lthread for me. Ignore my previous reply.
Segmented stacks aren't possible in C, because you don't have control over the stack's growth and the MMU when a function is running.
Coroutines simplifies your code a lot compared to callbacks(via libuv, epoll) and with lthread, you barely have any performance hit if you don't have sizable variables on the stack. I've written web servers and proxies in lthread that performs as fast as nginx (and sometimes faster). I think that's a pretty good deal.
> You wouldn't introduce extra runtime complexity and a speed hit for easier to use APIs (blocking I/O functions).
Your program is running in a kernel. You already have lost a lot of performance, and the complexity got introduced. Kernel does a gazillion thing while running your program, so if you really want raw performance, make sure to run your code on a bare metal CPU without kernel intervention. But that's not convenient, hence you compromise, pay a penalty, and run your program in a kernel. But does that mean you might as well use a high level language because you compromised on performance? no.
> if your concern is about easy to read code, C is not the right choice
I disagree. They aren't mutually exclusive.
> if your concern is speed: copying a stack on every i/o function is not the right choice.
lthread doesn't copy a stack on every IO function call. The lthread stack is copied only after the socket blocks or it reached its fair share. Sometimes this means an lthread can do 2 to 5 calls before it yields. And when it yields, depending on how deep you are in your call, the stack gets copied from there. If you don't have big variables on your stack, the stack copy can be just few bytes. Note that it's not the whole scheduler stack that's copied (4MB by default), but only what was consumed by lthread (usually from few bytes to 300 bytes, it varies based on the code).
You are over-simplifying your choices to this: if you want performance, use C & callbacks otherwise use a high level language. There's a wide spectrum that you are missing out in this over simplification.
It's still in dev branch, but I'll be merging it soon:
http://webmon.com/blog/2013/02/19/lthread-now-supports-userl...
https://github.com/halayli/lthread/blob/dev/src/lthread_io.c