More

gosu · on July 17, 2016

How useful is a lock-free (not wait-free) approach in a real-time context? The point is that any thread contending on lock-free data is liable to starve, so the worst-case behavior is still quite bad, no? The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?

On the other hand, if you combine interrupt-disabling with a fair lock, then your lock wait time can be bounded in every thread. This is probably only an option inside the kernel, though.

In userspace, and with more threads than CPUs, the lock-free approach probably will save you a lot of context-switching time. It'll probably also be at least as fine-grained as the most fine-grained locking scheme you can come up with, probably with less overhead.

So I'd say that the lock-free approach is an optimistic strategy, and I don't see the benefit except for performance. I say this as someone who really likes lockfree programming.

kasey_junk · on July 17, 2016

> The only benefit I see is freedom from priority inversion, and that's a fixable problem anyway. Are there other properties of the system that you somehow use to get real-time guarantees?

Priority inversion was what I was thinking of. When you say its a fixable problem, the common way to fix that is to have a real time scheduler deal with it. Lots of systems don't have real time schedulers available to them, so lock free algos make sense there and you deal with the starvation issue instead of the inversion issue.

That said, I'm no expert on real time systems and my use of lock free algos has fallen into the exact category you are talking about, which is wanting to avoid context switches in user land, but that is more a property of the OSes I use than the algorithms themselves.

mfukar · on July 18, 2016

Both lock-free and wait-free algorithms/implementations offer the same advantage, as a corollary of their definition: latency guarantees. These are central to real-time systems.

PS. The difference is that lock-free algorithms improve throughput _at the expense of single-thread latency_, and thus are more useful for satisfying system-wide throughput requirements, e.g. for a DBMS. Wait-freedom additionally guarantees freedom from starvation.

gosu · on Nov 16, 2014

Probably the fact that it eagerly coalesces newly freed blocks where high performance allocators like tcmalloc and jemalloc use slabs of fixed-sized blocks which are never coalesced. Coalescing will load adjacent blocks into cache, and it probably implies doubly linked lists where slab allocators can get away with cache-friendlier single links or bitmaps etc. (Because you probably have to remove a block from the middle of a free list if you've just coalesced it with a newly freed block). It turns out cache footprint is key for speed and scalability in allocators, so that's a big problem.

I'm embarassed to link this twice in a row now, but at one point I played around with the perf of a couple allocators, including glibc's ptmalloc. I wound up answering exactly this question, at least for some simple use cases: http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html.

Comments in ptmalloc sources say they basically stick to the algorithm described here: http://gee.cs.oswego.edu/dl/html/malloc.html.

gosu · on Nov 16, 2014

You don't need to have a single frontier, and IIRC no one does these days. A stack of slabs per "arena" suffices, and the extra cost is only a read from a cacheline you'll soon need anyway. It looks like jemalloc uses a tree of "runs" for that.

I don't know exactly what you're imagining, but I don't see how you can pass this off to the kernel. You need a slab which has free memory. How do you know where to find it? You could write a nasty new syscall to ask the kernel for a ZFOD page(s), and then cry because your program is slow.

Also, unless your slabs are huge and your allocator is being used to fuel some kind of pre-caching, faulting pages is going to be suboptimal. You incur at least twice as many mode switches: one for the mmap call, one for each faulting page. I've measured this, and it gets pretty expensive. I'm kind of dubious about the performance of schemes which stray from the principle of "allocate what you actually need; free it when you don't."

The bottom line is that a slab allocator should be solving a different, mostly less general problem than the kernel. It needs finer granularity and it's optimizing for common block sizes that are small enough to fit into slabs of known sizes. So it should optimize for allocating page chunks in a small subset of possible sizes (namely, the slab size(s)), while the kernel should be worried about unbounded, oddly-sized page chunks.

I toyed with scalable allocators for a course a few years back. It's a little embarassing, but if you want to see what the costs actually are: http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html

You can do lockfree slab allocators pretty easily, so the game in scalable malloc is really about avoiding inter-CPU cacheline sharing and fitting your data structures into per-CPU caches. Slabs happen to be really good for that, for non-obvious reasons.

nkurz · on Nov 16, 2014

It certainly wouldn't be the first time that I've suggested something ridiculous. One easy explanation for the lack of papers I can find is that the idea is so obviously useless that no one has written about it!

That said, it just seems unlikely that the best solution for 32-bit systems with limited virtual address space and slow interrupt generating system calls is also the best solution for 64-bit systems with practically-limitless virtual address space and fast hardware-supported syscall/sysenter.

And I think what I'm suggesting is roughly analogous to what's being done in Azul's (successful?) C4 garbage collection: provide fast userspace for per-process page table manipulation, and offload as much as possible to the hardware: http://www.azulsystems.com/sites/default/files/images/c4_pap...

But I'm definitely too fuzzy about some of the key concepts to suggest anything particularly coherent at this point. I'll check out your writeup, and try to learn more. Thanks!

gosu · on Oct 14, 2014

I'm a very happy user. It's pretty well designed, and extensible in javascript. It's nice having EVERYTHING about the browser at your fingertips in fully customizable way, not just the the link-following mechanism. I use lots of vim hotkeys, just like I do in emacs.

Conkeror has sometimes struggled to keep up with xulrunner development, and there are sometimes bugs.

gosu · on July 24, 2014

I suppose it was bound to happen.

gosu · on May 26, 2014

By storing the struct in union with an integral type - or, if you need a value but not an address, casting to such a union. You might also think to just abuse pointer casting, but IIRC that breaks C strict aliasing assumptions and causes GCC's optimizations to do bad things to your program.

gosu · on April 9, 2014

But, like you said, you still need protection of kernel data, which means that you need to execute kernel code on a stack not writeable by the user, at a greater permission level that allows ONLY the kernel code to modify kernel pages. Single-address-space or not I can't see these needs being met in a cheaper way than what we already have, presuming that kernel code/data are in never-invalidated global pages available in every address space.

Edit: Oh, I see from your other comment that you're talking about the benefits to people who don't feel this need.

gosu · on Feb 27, 2014

Lockfree OS kernel. Give it a couple weeks.

papaf · on Feb 27, 2014

How do you even start writing something like that?

gosu · on March 5, 2014

Well, I'd done a small kernel for a class. Later, I wrote a lockfree malloc and realized that large lockfree programs are actually pretty managable, and that was the start. It's generally just a matter of designing around simple data structures (which are the only ones you can really do lockfree) and a repetition of some key tricks like refcounting, generation counting, and type-stable memory. Writing the data structures and primitives was tough, but otherwise you don't have to worry about locking and things come out quite neatly as a result.

gosu · on Jan 1, 2014

For that price, I can buy 10 pairs of jeans on ebay, in a style that I like. Or I can go to the thrift store and get 50. I'll have wet ankles when I bike in the rain in a poncho, but whatever.

cko · on Jan 2, 2014

It's probably a personal preference type of thing. I don't like the feeling of having ten pairs of pants. I have this romantic notion that twenty years from now, I'm still be wearing the same $200 pair of pants (aged to perfection like a Barbour jacket), as opposed to a rotation of ten disposable pairs of pants.

I also happen to have two T shirts, one of which I wear for weeks in a row. It's merino wool. In the summer I switch to the other one for exercise, and that just gets handwashed when I'm done.

I realized that I'm gushing so hard about these pants, but I've been in search of the perfect pair of pants for the past five years and these are almost perfect.

Is that OCD? Wanting that one perfect thing that you can keep forever? Not sure, but I can't have it any other way.

veemjeem · on Jan 2, 2014

Obviously one can also make that same argument for electric cars like the tesla, or SSDs when they were first released. The intel X18-M 80GB SSD was $700 when it was first released in 2008. New tech always starts out with low demand and high price.

stefan_kendall · on Jan 2, 2014

SSDs have the exception that they make me more effective at building product, which pays for the investment. I can't say the same for a pair of pants.

veemjeem · on Jan 2, 2014

For hackers, pants are not so incredibly useful. However, I'm guessing this company is not targeting computer professionals as their target market. If you're doing long distance cycling or you're a backpack/hiking kind of person, I'm sure odor/water resistant pants would be highly useful.

gosu · on Oct 31, 2013

CMU's got a fairly nice one. It guides students as they build a series of compilers for increasingly large subsets of C, effectively from scratch. Course materials are public, I believe.

http://www.cs.cmu.edu/afs/cs/project/fox/mosaic/people/fp/co...