Parallelizing the Naughty Dog Engine Using Fibers [video]

corysama · on May 26, 2016

Similar talk: CppCon 2014: Jeff Preshing "How Ubisoft Develops Games for Multicore - Before and After C++11"

https://www.youtube.com/watch?v=X1T3IQ4N-3g

Ubisoft's talk spends more time getting into the weeds with atomic ops. Naughty Dog's is more of an architecture discussion. If you can only watch one, I'll recommend Naughty Dog's.

k__ · on May 27, 2016

Cool.

Me and a friend did this at university with the Ogre 3D engine and failed miserably, because I didn't know much abut C++ or thread safety.

We tried the actor model, every game object became an actor and sent messages around. The actors would then be spread over the amount of CPUs and the performance should rise with every CPU. In the end we got it running on multiple cores, but the message overhead killed the performance, haha.

wolfgke · on May 27, 2016

> Me and a friend did this at university with the Ogre 3D engine and failed miserably, because I didn't know much abut C++ or thread safety.

Fiber safety is much more easy to assure than thread-safety. Under Windows using fibers is very simple (just use ConvertThreadToFiber on your current thread if you have not done already and then call CreateFiber) in opposite to POSIX systems where you will have to roll up your own implementation of fibers (or use some existing library which is not part of the POSIX standard).

gpderetta · on May 27, 2016

makecontext/swapcontext was the POSIX approved way to do fibers. It was removed from the last POSIX standard because who uses coroutines or fibers today? Around the same time every other mainstream language was adding support for corutines/generators/fibers.

wolfgke · on May 27, 2016

makecontext/swapcontex doesn't have automatically growing stack (opposed to their WinAPI equivalents). Yes, you can do this, too, I know. But if you don't want to go into the gory details here you either have a alloc a "large enough" buffer for your coroutines thus probably wasting memory.

gpderetta · on May 27, 2016

technically you only waste address space which should be plenty enough at least for a while.

djhworld · on May 27, 2016

Naughty Dog must be doing something completely off the charts in comparison to other developers, I challenge anyone to look at Uncharted 4 and provide to me a better looking game.

sangnoir · on May 27, 2016

I think CD Projekt RED are definitely in the same league as Naughty Dog - The Witcher 3 looks amazing (disclaimer: I haven't seen it on the PlayStation)

Narishma · on May 30, 2016

It has poor performance on consoles.

fleetfox · on May 27, 2016

It doesn't look that great compared to fresh PC titles. Are other PS4 games looking worse?

segf4ult · on May 27, 2016

Which "fresh PC titles" are you referring to?

fleetfox · on May 27, 2016

I guess last Tomb Raider is a fair comparison? Witcher and just released Doom also seem impressive.

dpc_pw · on May 26, 2016

For fibers (AKA Coroutines) in Rust, see mioco https://github.com/dpc/mioco

rawnlq · on May 26, 2016

An older but similar approach from Doom III engine: http://fabiensanglard.net/doom3_bfg/threading.php

I wonder if there is a good open-sourced C++11 project for this pattern? (a job/task queue)

Also how does this pattern compare with just using future/promises with a parallel executor?

https://code.facebook.com/posts/1661982097368498/futures-for...

or grand central dispatch from objective C?

lbrandy · on May 26, 2016

Our (facebook's) folly library has lots of components for doing stuff using fibers[1]. That sits with folly::futures[2] and wangle[3].

That said, without proper language/compiler support, though, it's difficult to make them "safe" (for some definition). Even within fb, given these libraries, we are pretty wary of using fibers with C++ unless absolutely necessary. You gotta really need it. This is where the coroutines proposals working their way through the C++ committee can make lives better.

[1] https://github.com/facebook/folly/tree/master/folly/fibers

[2] https://github.com/facebook/folly/tree/master/folly/futures

[3] https://github.com/facebook/wangle

warmwaffles · on May 26, 2016

I still don't quite understand what he means by a "fiber". Is this basically 160 blocks of memory allocated on the heap?

corysama · on May 26, 2016

A fiber is a thread in that it executes code and it has it's own stack, but it does not have premption. The OS won't thread-switch between fibers for you. You must manually switch in and out of executing it. Running in a fiber doesn't mean you've stopped using threads. A fiber runs in to context of a thread and that thread still has premption. But, inside the thread, you can call functions to switch stacks in order to manually pause and resume execution inside the fiber.

2bitencryption · on May 26, 2016

I've noticed that lots of recent advances in threading are all just ways to avoid context switching of threads and instead handle those switches manually (e.g. "green threads" in Python).

lmm · on May 27, 2016

OS-level threads are a crude big hammer way of doing parallelism - want to run a task? Just switch out the whole 8kb of stack, then switch back to a stack that's 99% identical. Cooperative multitasking was always known to be much more efficient if you could ever get it to work safely - but one misbehaving task killed everything

Ironic that with deployment we're moving in the opposite direction - we've decided we really do need the complete isolation of separate VMs/containers and are willing to pay the performance overhead because process separation isn't enough.

pvg · on May 26, 2016

These aren't particularly recent. Java originally had green threads; I think the term itself is from there.

dragonwriter · on May 27, 2016

Wikipedia agrees with you on the source of the term. Both the M:N and N:1 threading models themselves are older than Java, but the term "green threads" seems to have come from there.

pvg · on May 27, 2016

Oh I didn't notice the wikipedia thing was actually sourced. I think that pretty much settles it.

http://web.archive.org/web/20080530073139/http://java.sun.co...

wavesplash · on May 27, 2016

Green threads were a thing long before Java.

pvg · on May 27, 2016

The term, not the variety of threads. I can't find any mention of it pre '95 or so, at least.

markkanof · on May 27, 2016

'95, so about when significant amounts of information started being put on the internet?

pvg · on May 27, 2016

I'm not so sure about the term 'green threads' but the internet was definitely around well before '95. There doesn't seem to be any mention of 'green threads' in available USENET archives before then, for instance. But if you know of one, anywhere, by all means, I'd love to hear about it.

markkanof · on May 30, 2016

What I was getting at was that around '95 was when the internet was really starting to come into generalized usage and exponentially more information published on the internet at that time than before. So maybe an internet search wouldn't turn up results, not because the term wasn't in existence prior to '95 but because the term wasn't published on the internet prior to '95.

fit2rule · on May 27, 2016

I think MIPS RISC/Os had 'green' threads in the 80's, or I could be mixing it up with Tandem terminology, where Fiber was a constant type, I seem to recall .. either way, the idea of a userspace-managed thread scheme is as old as the hills.

The question has always been: who deals out the work, the OS or the App? and as we can see, the question will continue to be asked, and un-answered, probably ad infinitum ..

pvg · on May 27, 2016

Did it call them 'green threads'? If not, the terminology itself probably doesn't come from there.

fit2rule · on May 31, 2016

I do remember Tandem and/or Wang talking about green (transportable) threads who were okay to suspend/resume across processor units .. I wish I could find more info, but I really do recall the term being applicable way back when ..

polskibus · on May 27, 2016

Erlang had green threads from the very beginning, in the 80s.

pvg · on May 27, 2016

Erlang processes aren't threads and aren't called 'green' so I don't see how the term itself could have originated there.

ekimekim · on May 27, 2016

It's not just about context switching costs. One advantage to non-preemtable concurrency is that data races are much easier to reason about, because any operation that you know won't block becomes atomic by default (this doesn't apply to all models, eg. ones that have m:n scheduling).

There's also other resource overheads. For example, I would think nothing of spinning up 10000 greenlets/green threads in python. They cost maybe 1KB of memory each. But spinning up 10000 OS threads? That's a little more dicey.

warmwaffles · on May 26, 2016

Oh that actually makes a lot of sense

Aeolos · on May 26, 2016

Fiber[1] is another term for coroutines. They are essentially an implementation of cooperative multithreading.

[1] https://en.wikipedia.org/wiki/Fiber_%28computer_science%29

warmwaffles · on May 26, 2016

Any simple C libraries that demonstrate this? I'm rather curious.

corysama · on May 26, 2016

There's no official support for fibers in Posix. There are lots of implementations anyway, but it's hard to get completely right. Long ago, the Win32 team got tired of people rolling their own subtley incorrect/incomplete fiber implementations for Windows and added official support to put an end to that support issue https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

btrask · on May 26, 2016

Actually, there is: https://en.wikipedia.org/wiki/Setcontext

A library I personally recommend, which supports several platforms and back-ends: http://byuu.org/library/libco/

david-given · on May 27, 2016

A few years back I built an app which used coroutines extensively (<plug> greylisting SMTP proxy, now abandoned because I don't run my own mailserver any more, http://cowlark.com/spey/ </plug>). I originally used getcontext/setcontext.

Superficially it was really easy, but turned into a goddamned nightmare. The problem is that pthreads gets on really badly with user-allocated stacks; some pthreads implementations find the current thread ID by looking at a magic value on the current thread's stack, and of course with a user-allocated stack this wasn't there.

But why was I using pthreads alongside coroutines, you ask? Well, I wasn't. But I was linking to a library which was referring to pthreads symbols, even though it wasn't calling them; which made the linker automatically pull in the thread-safe glibc, so malloc/free ended up wrapping in mutexes...

The net effect was bizarre and infrequent crashes that only showed up on machines which weren't mine and that I didn't have access to. I spent months debugging that.

Eventually I switched to simulating coroutines using pthreads with a big lock so that only one thread could run at a time. (Which actually turned out to be better, because now I could do some operations properly in parallel.)

Beware of setcontext! Bewaaaaaare!

sillysaurus3 · on May 27, 2016

setcontext isn't Posix, so corysama is the best kind of correct. But setcontext exists on most flavors of Linux, FreeBSD, and OS X, so you can basically assume it's supported.

gpderetta · on May 27, 2016

setcontext was posix until POSIX.2008.

near · on May 28, 2016

Just a fair warning, setcontext is horrendously slow. It seems to be doing kernel transitions or something. It's about 600 times slower than longjmp with jmpbuf stack pointer hacking, which is itself already about twice as slow as an ideal assembly version.

shinymark · on May 27, 2016

Our team just got hit by this. We are using fibers in a custom C++ game engine without issue on Windows, OSX, and some console platforms. Recently we began the work to bring the code base to iOS. Guess what? No fiber support. Doh!

corysama · on May 27, 2016

Looks [1] like http://byuu.org/library/libco/ (mentioned below) might work on iOS. I would definitely be interested in fiber libs that are confirmed to work on iOS and/or Android if anyone knows any. libco looks like it might fit the bill. But, it's not completely clear and I have not tried it.

[1] http://libretro.com/forums/archive/index.php?t-785.html

clevernickname · on May 27, 2016

libco relies on being able to map WX pages for a rather silly reason[1], so it won't run on iOS without some work.

[1] The author didn't like requiring assemblers to build his projects, so instead the coroutine switch machine code is embedded in the binary as a byte array, which is then copied to a page with WX permissions when the library is initialized.

https://github.com/creationix/libco/blob/master/amd64.c

near · on May 28, 2016

Why do you feel [1] is a silly reason? You can build my software right now in one step: "install TDM-GCC"; anything additional complicates things and reduces portability. There's also versions that don't require assembler code at all (setjmp/longjmp version), and you can also make an ios.c version that has inline assembler since that target will only ever have a single compiler anyway.

clevernickname · on May 28, 2016

Ironic in light of your comments on the OpenBSD forced W^X thing, but I happen to agree that strict W^X should be used everywhere possible. It may not matter for higan (though it would be funny to see someone exploit it), but it could be a deal breaker for more security-conscious projects that might want to use libco (say, for a cothreaded server).

As for assemblers, I just don't see how requiring gas or maybe NASM/YASM for x86 would complicate the build process so much. Heck, gas is a dependency of gcc. The only problem I see is for Visual Studio users, which could be solved with using NASM/YASM for Windows (which is not really a big deal when the co_swap_function has to be different due to the calling conventions anyway), or a separate MASM implementation for Visual Studio (giving you duplicate implementations with VS Windows and gcc Windows on x86 and x86_64 respectively).

I know you're consistent with your views on code clarity and that this is a hobby for you etc, but libco seems like a tiny, static enough library that it would make sense to accept a little ugliness in order to Do The Right Thing™.

(PS: Thank you for higan, I've learned so much from it over the years.)

near · on May 30, 2016

> I happen to agree that strict W^X should be used everywhere possible

I agree. I just feel there are exceptions, one of which is dynamic recompiling emulators.

> it could be a deal breaker for more security-conscious projects that might want to use libco (say, for a cothreaded server).

I was worried that since mprotect works on entire pages, that by clearing the write flag, I might be removing writability to certain variables. But, theoretically, static char arrays should always end up in the .data section, so it should be fine. Still feel uneasy about it.

Unfortunately, there's no POSIX way to read the current protections, so I can't just apply |X; plus that's probably not safe.

I could allocate three _SC_PAGESIZE blocks of memory (for alignment), mark the middle block R|W, copy the function to it, and then mark the block R|X. As long as OpenBSD allows those two calls, that should work.

> it would make sense to accept a little ugliness in order to Do The Right Thing™

I don't see how requiring Windows users to download yasm (64-bit) or nasm (32-bit) is doing the right thing. Every single additional step required to compile my software will result in more and more people giving up or not bothering.

shinymark · on May 27, 2016

We ended up integrating some Boost ARM assembly code that did the trick (both 32 and 64 bit). I don't personally have the details because I didn't do the integration but after a few iterations and some tweaks it worked for us.

near · on May 28, 2016

It has an ARM implementation, but there are lots of subtle ARM variations. You'd probably have to tweak it a tad, but the core ARM cothread switching code is a mere three instructions, so it shouldn't be too hard to adapt:

    0xe8a16ff0,  /* stmia r1!, {r4-r11,sp,lr} */
    0xe8b0aff0,  /* ldmia r0!, {r4-r11,sp,pc} */
    0xe12fff1e,  /* bx lr                     */

If you do, send me a patch and I'll merge it upstream.

MarkSweep · on May 26, 2016

See libtask: https://swtch.com/libtask/

aidenn0 · on May 27, 2016

And mongrel2, which I hack on occasionally uses libtask to good effect for this. It's architecture is much simpler than this presentation though because mongrel2 is one thread per process, so inside the process there is only one fiber executing at any given time.

notduncansmith · on May 27, 2016

Libmill could be worth a look. https://github.com/sustrik/libmill

threeseed · on May 26, 2016

Basically they are fine grained tasks that get executed by a pool of threads.

aidenn0 · on May 27, 2016

A note is that input latency will be 17ms longer with the pipelining versus running at 30fps. That's still almost certainly a huge win though, as the smoother rendering at the higher rate will make it perceived as more responsive, particularly since input latency doesn't go much below 100ms these days.

w0utert · on May 27, 2016

Why? As he explains in the Q&A session after the talk, two frames of latency at 30Hz (2 x ~33ms) is still more than 3 frames latency at 60Hz (3 x ~16ms). You will always have at least 2 frames latency if you split the game logic and render stages like they did for TLoU on PS3, so in fact their PS4 engine has better framerate and latency.

aidenn0 · on May 28, 2016

Before they implemented pipelining, they had all of the stages running in under 33ms on the ps4. Yes, this is lower input latency than on the ps3, but that's beside the point.

tomlu · on May 27, 2016

Super interesting, but the audio level is too quiet for my laptop even with everything set to max. Can it be downloaded from anywhere?

wallacoloo · on May 27, 2016

FWIW, youtube-dl can easily download the video. i.e. `youtube-dl http://www.gdcvault.com/play/1022186/Parallelizing-the-Naugh.... You can use it almost everywhere, despite the name.

endergen · on May 27, 2016

I had the same issue on my Mac. Ended up buying Boom 2 to boost the volume. I wouldn't have just for one talk, but it's been happening a lot lately where I find content volume to be too low.

hyperpallium · on May 26, 2016

Is there a downloadable version?

felixguendling · on May 26, 2016

Not directly. I think it is part of the Playstation SDK (but I'm not a game developer). There are libraries that are based on the same concept (many execution contexts per thread):

  * https://github.com/RichieSams/FiberTaskingLib  
  * https://swtch.com/libtask/  
  * https://github.com/halayli/lthread  
  * https://github.com/stevedekorte/coroutine

RethinkDB are using something like this: http://rethinkdb.com/blog/improving-a-large-c-project-with-c... http://rethinkdb.com/blog/making-coroutines-fast/

clevernickname · on May 26, 2016

Coroutine libraries are in the class of libraries that are small and easy enough to write (if you know assembly) that most people roll their own. libco (used in the bsnes/higan emulator) is a good, portable one.