Hacker News new | past | comments | ask | show | jobs | submit login
Parallelizing the Naughty Dog Engine Using Fibers [video] (gdcvault.com)
120 points by Splines on May 26, 2016 | hide | past | favorite | 63 comments



Similar talk: CppCon 2014: Jeff Preshing "How Ubisoft Develops Games for Multicore - Before and After C++11"

https://www.youtube.com/watch?v=X1T3IQ4N-3g

Ubisoft's talk spends more time getting into the weeds with atomic ops. Naughty Dog's is more of an architecture discussion. If you can only watch one, I'll recommend Naughty Dog's.


Cool.

Me and a friend did this at university with the Ogre 3D engine and failed miserably, because I didn't know much abut C++ or thread safety.

We tried the actor model, every game object became an actor and sent messages around. The actors would then be spread over the amount of CPUs and the performance should rise with every CPU. In the end we got it running on multiple cores, but the message overhead killed the performance, haha.


> Me and a friend did this at university with the Ogre 3D engine and failed miserably, because I didn't know much abut C++ or thread safety.

Fiber safety is much more easy to assure than thread-safety. Under Windows using fibers is very simple (just use ConvertThreadToFiber on your current thread if you have not done already and then call CreateFiber) in opposite to POSIX systems where you will have to roll up your own implementation of fibers (or use some existing library which is not part of the POSIX standard).


makecontext/swapcontext was the POSIX approved way to do fibers. It was removed from the last POSIX standard because who uses coroutines or fibers today? Around the same time every other mainstream language was adding support for corutines/generators/fibers.


makecontext/swapcontex doesn't have automatically growing stack (opposed to their WinAPI equivalents). Yes, you can do this, too, I know. But if you don't want to go into the gory details here you either have a alloc a "large enough" buffer for your coroutines thus probably wasting memory.


technically you only waste address space which should be plenty enough at least for a while.


Naughty Dog must be doing something completely off the charts in comparison to other developers, I challenge anyone to look at Uncharted 4 and provide to me a better looking game.


I think CD Projekt RED are definitely in the same league as Naughty Dog - The Witcher 3 looks amazing (disclaimer: I haven't seen it on the PlayStation)


It has poor performance on consoles.


It doesn't look that great compared to fresh PC titles. Are other PS4 games looking worse?


Which "fresh PC titles" are you referring to?


I guess last Tomb Raider is a fair comparison? Witcher and just released Doom also seem impressive.


For fibers (AKA Coroutines) in Rust, see mioco https://github.com/dpc/mioco


An older but similar approach from Doom III engine: http://fabiensanglard.net/doom3_bfg/threading.php

I wonder if there is a good open-sourced C++11 project for this pattern? (a job/task queue)

Also how does this pattern compare with just using future/promises with a parallel executor?

https://code.facebook.com/posts/1661982097368498/futures-for...

or grand central dispatch from objective C?


Our (facebook's) folly library has lots of components for doing stuff using fibers[1]. That sits with folly::futures[2] and wangle[3].

That said, without proper language/compiler support, though, it's difficult to make them "safe" (for some definition). Even within fb, given these libraries, we are pretty wary of using fibers with C++ unless absolutely necessary. You gotta really need it. This is where the coroutines proposals working their way through the C++ committee can make lives better.

[1] https://github.com/facebook/folly/tree/master/folly/fibers

[2] https://github.com/facebook/folly/tree/master/folly/futures

[3] https://github.com/facebook/wangle


I still don't quite understand what he means by a "fiber". Is this basically 160 blocks of memory allocated on the heap?


A fiber is a thread in that it executes code and it has it's own stack, but it does not have premption. The OS won't thread-switch between fibers for you. You must manually switch in and out of executing it. Running in a fiber doesn't mean you've stopped using threads. A fiber runs in to context of a thread and that thread still has premption. But, inside the thread, you can call functions to switch stacks in order to manually pause and resume execution inside the fiber.


I've noticed that lots of recent advances in threading are all just ways to avoid context switching of threads and instead handle those switches manually (e.g. "green threads" in Python).


OS-level threads are a crude big hammer way of doing parallelism - want to run a task? Just switch out the whole 8kb of stack, then switch back to a stack that's 99% identical. Cooperative multitasking was always known to be much more efficient if you could ever get it to work safely - but one misbehaving task killed everything

Ironic that with deployment we're moving in the opposite direction - we've decided we really do need the complete isolation of separate VMs/containers and are willing to pay the performance overhead because process separation isn't enough.


These aren't particularly recent. Java originally had green threads; I think the term itself is from there.


Wikipedia agrees with you on the source of the term. Both the M:N and N:1 threading models themselves are older than Java, but the term "green threads" seems to have come from there.


Oh I didn't notice the wikipedia thing was actually sourced. I think that pretty much settles it.

http://web.archive.org/web/20080530073139/http://java.sun.co...


Green threads were a thing long before Java.


The term, not the variety of threads. I can't find any mention of it pre '95 or so, at least.


'95, so about when significant amounts of information started being put on the internet?


I'm not so sure about the term 'green threads' but the internet was definitely around well before '95. There doesn't seem to be any mention of 'green threads' in available USENET archives before then, for instance. But if you know of one, anywhere, by all means, I'd love to hear about it.


What I was getting at was that around '95 was when the internet was really starting to come into generalized usage and exponentially more information published on the internet at that time than before. So maybe an internet search wouldn't turn up results, not because the term wasn't in existence prior to '95 but because the term wasn't published on the internet prior to '95.


I think MIPS RISC/Os had 'green' threads in the 80's, or I could be mixing it up with Tandem terminology, where Fiber was a constant type, I seem to recall .. either way, the idea of a userspace-managed thread scheme is as old as the hills.

The question has always been: who deals out the work, the OS or the App? and as we can see, the question will continue to be asked, and un-answered, probably ad infinitum ..


Did it call them 'green threads'? If not, the terminology itself probably doesn't come from there.


I do remember Tandem and/or Wang talking about green (transportable) threads who were okay to suspend/resume across processor units .. I wish I could find more info, but I really do recall the term being applicable way back when ..


Erlang had green threads from the very beginning, in the 80s.


Erlang processes aren't threads and aren't called 'green' so I don't see how the term itself could have originated there.


It's not just about context switching costs. One advantage to non-preemtable concurrency is that data races are much easier to reason about, because any operation that you know won't block becomes atomic by default (this doesn't apply to all models, eg. ones that have m:n scheduling).

There's also other resource overheads. For example, I would think nothing of spinning up 10000 greenlets/green threads in python. They cost maybe 1KB of memory each. But spinning up 10000 OS threads? That's a little more dicey.


Oh that actually makes a lot of sense


Fiber[1] is another term for coroutines. They are essentially an implementation of cooperative multithreading.

[1] https://en.wikipedia.org/wiki/Fiber_%28computer_science%29


Any simple C libraries that demonstrate this? I'm rather curious.


There's no official support for fibers in Posix. There are lots of implementations anyway, but it's hard to get completely right. Long ago, the Win32 team got tired of people rolling their own subtley incorrect/incomplete fiber implementations for Windows and added official support to put an end to that support issue https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...


Actually, there is: https://en.wikipedia.org/wiki/Setcontext

A library I personally recommend, which supports several platforms and back-ends: http://byuu.org/library/libco/


A few years back I built an app which used coroutines extensively (<plug> greylisting SMTP proxy, now abandoned because I don't run my own mailserver any more, http://cowlark.com/spey/ </plug>). I originally used getcontext/setcontext.

Superficially it was really easy, but turned into a goddamned nightmare. The problem is that pthreads gets on really badly with user-allocated stacks; some pthreads implementations find the current thread ID by looking at a magic value on the current thread's stack, and of course with a user-allocated stack this wasn't there.

But why was I using pthreads alongside coroutines, you ask? Well, I wasn't. But I was linking to a library which was referring to pthreads symbols, even though it wasn't calling them; which made the linker automatically pull in the thread-safe glibc, so malloc/free ended up wrapping in mutexes...

The net effect was bizarre and infrequent crashes that only showed up on machines which weren't mine and that I didn't have access to. I spent months debugging that.

Eventually I switched to simulating coroutines using pthreads with a big lock so that only one thread could run at a time. (Which actually turned out to be better, because now I could do some operations properly in parallel.)

Beware of setcontext! Bewaaaaaare!


setcontext isn't Posix, so corysama is the best kind of correct. But setcontext exists on most flavors of Linux, FreeBSD, and OS X, so you can basically assume it's supported.


setcontext was posix until POSIX.2008.


Just a fair warning, setcontext is horrendously slow. It seems to be doing kernel transitions or something. It's about 600 times slower than longjmp with jmpbuf stack pointer hacking, which is itself already about twice as slow as an ideal assembly version.


Our team just got hit by this. We are using fibers in a custom C++ game engine without issue on Windows, OSX, and some console platforms. Recently we began the work to bring the code base to iOS. Guess what? No fiber support. Doh!


Looks [1] like http://byuu.org/library/libco/ (mentioned below) might work on iOS. I would definitely be interested in fiber libs that are confirmed to work on iOS and/or Android if anyone knows any. libco looks like it might fit the bill. But, it's not completely clear and I have not tried it.

[1] http://libretro.com/forums/archive/index.php?t-785.html


libco relies on being able to map WX pages for a rather silly reason[1], so it won't run on iOS without some work.

[1] The author didn't like requiring assemblers to build his projects, so instead the coroutine switch machine code is embedded in the binary as a byte array, which is then copied to a page with WX permissions when the library is initialized.

https://github.com/creationix/libco/blob/master/amd64.c


Why do you feel [1] is a silly reason? You can build my software right now in one step: "install TDM-GCC"; anything additional complicates things and reduces portability. There's also versions that don't require assembler code at all (setjmp/longjmp version), and you can also make an ios.c version that has inline assembler since that target will only ever have a single compiler anyway.


Ironic in light of your comments on the OpenBSD forced W^X thing, but I happen to agree that strict W^X should be used everywhere possible. It may not matter for higan (though it would be funny to see someone exploit it), but it could be a deal breaker for more security-conscious projects that might want to use libco (say, for a cothreaded server).

As for assemblers, I just don't see how requiring gas or maybe NASM/YASM for x86 would complicate the build process so much. Heck, gas is a dependency of gcc. The only problem I see is for Visual Studio users, which could be solved with using NASM/YASM for Windows (which is not really a big deal when the co_swap_function has to be different due to the calling conventions anyway), or a separate MASM implementation for Visual Studio (giving you duplicate implementations with VS Windows and gcc Windows on x86 and x86_64 respectively).

I know you're consistent with your views on code clarity and that this is a hobby for you etc, but libco seems like a tiny, static enough library that it would make sense to accept a little ugliness in order to Do The Right Thing™.

(PS: Thank you for higan, I've learned so much from it over the years.)


> I happen to agree that strict W^X should be used everywhere possible

I agree. I just feel there are exceptions, one of which is dynamic recompiling emulators.

> it could be a deal breaker for more security-conscious projects that might want to use libco (say, for a cothreaded server).

I was worried that since mprotect works on entire pages, that by clearing the write flag, I might be removing writability to certain variables. But, theoretically, static char arrays should always end up in the .data section, so it should be fine. Still feel uneasy about it.

Unfortunately, there's no POSIX way to read the current protections, so I can't just apply |X; plus that's probably not safe.

I could allocate three _SC_PAGESIZE blocks of memory (for alignment), mark the middle block R|W, copy the function to it, and then mark the block R|X. As long as OpenBSD allows those two calls, that should work.

> it would make sense to accept a little ugliness in order to Do The Right Thing™

I don't see how requiring Windows users to download yasm (64-bit) or nasm (32-bit) is doing the right thing. Every single additional step required to compile my software will result in more and more people giving up or not bothering.


We ended up integrating some Boost ARM assembly code that did the trick (both 32 and 64 bit). I don't personally have the details because I didn't do the integration but after a few iterations and some tweaks it worked for us.


It has an ARM implementation, but there are lots of subtle ARM variations. You'd probably have to tweak it a tad, but the core ARM cothread switching code is a mere three instructions, so it shouldn't be too hard to adapt:

    0xe8a16ff0,  /* stmia r1!, {r4-r11,sp,lr} */
    0xe8b0aff0,  /* ldmia r0!, {r4-r11,sp,pc} */
    0xe12fff1e,  /* bx lr                     */
If you do, send me a patch and I'll merge it upstream.



And mongrel2, which I hack on occasionally uses libtask to good effect for this. It's architecture is much simpler than this presentation though because mongrel2 is one thread per process, so inside the process there is only one fiber executing at any given time.


Libmill could be worth a look. https://github.com/sustrik/libmill


Basically they are fine grained tasks that get executed by a pool of threads.


A note is that input latency will be 17ms longer with the pipelining versus running at 30fps. That's still almost certainly a huge win though, as the smoother rendering at the higher rate will make it perceived as more responsive, particularly since input latency doesn't go much below 100ms these days.


Why? As he explains in the Q&A session after the talk, two frames of latency at 30Hz (2 x ~33ms) is still more than 3 frames latency at 60Hz (3 x ~16ms). You will always have at least 2 frames latency if you split the game logic and render stages like they did for TLoU on PS3, so in fact their PS4 engine has better framerate and latency.


Before they implemented pipelining, they had all of the stages running in under 33ms on the ps4. Yes, this is lower input latency than on the ps3, but that's beside the point.


Super interesting, but the audio level is too quiet for my laptop even with everything set to max. Can it be downloaded from anywhere?


FWIW, youtube-dl can easily download the video. i.e. `youtube-dl http://www.gdcvault.com/play/1022186/Parallelizing-the-Naugh.... You can use it almost everywhere, despite the name.


I had the same issue on my Mac. Ended up buying Boom 2 to boost the volume. I wouldn't have just for one talk, but it's been happening a lot lately where I find content volume to be too low.


Is there a downloadable version?


Not directly. I think it is part of the Playstation SDK (but I'm not a game developer). There are libraries that are based on the same concept (many execution contexts per thread):

  * https://github.com/RichieSams/FiberTaskingLib  
  * https://swtch.com/libtask/  
  * https://github.com/halayli/lthread  
  * https://github.com/stevedekorte/coroutine
RethinkDB are using something like this: http://rethinkdb.com/blog/improving-a-large-c-project-with-c... http://rethinkdb.com/blog/making-coroutines-fast/


Coroutine libraries are in the class of libraries that are small and easy enough to write (if you know assembly) that most people roll their own. libco (used in the bsnes/higan emulator) is a good, portable one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: