Game Programming Patterns: Double Buffer (2014)

x1000 · on Aug 27, 2018

I thought this would only apply to rendering graphics, but after reading the "When To Use It" section, I realized I've done double buffering on entire game states before (~2 years ago project). At the beginning of my update loop, I'd (deep) copy the current game state into a new object and begin incrementally updating the copy. Then I'd reassign right before Thread.sleeping (or whatever language idiom) until the next game loop "tick".

Wasn't too fond of my C# deep copy solution: var serialized = JsonSerializer.Serialize(this); return JsonSerializer.Deserialize<GameState>(serialized);

I took an interested in functional programming, pure functions, immutability, etc. soon after.

Sharlin · on Aug 27, 2018

"Double buffered" (transactionally updated) game state is basically required if you want to avoid strange nondeterministic glitches as entities update based on partially updated world state. Persistent immutable data structures are great for this purpose as they typically attempt to minimize actual copying. UI logic in regular applications also greatly benefits from immutable state, which is indeed what things like React are based on.

Jare · on Aug 27, 2018

Double buffering and determinism are not necessarily related. Determinism requires that everything that could affect the execution of logic is captured as part of the input, and (this is the annoying and error-prone part), that anything that is not captured as part of the input does NOT affect execution of logic.

If execution of logic is always performed in the same order, then the partially updated world state is not a problem for determinism (though it may be for correctness). Double buffering may let you reorder the execution of logic in different ways and still keep it deterministic. This would likely be needed for logic that executes in parallel threads, which is increasingly important for high-performance game engines.

Sharlin · on Aug 27, 2018

Yes, true, ”unpredictable” is probably better. Depending on how exactly entity lists are managed, from player/dev point of view it can seem close to nondeterministic in practice even if a theoretical 1:1 replay would give identical results.

a1369209993 · on Aug 27, 2018

Another option is to build a set of diffs (Unit::MoveTo,Score::Add,...) based on the existing (immutabled viewed) state, then apply them mutably (possibly checking for conflicts on moveto et al):

    diff = differential(state)
    state.update(diff) # or state = update(state,diff)
    # or state += ∂_∂t(state) if you're feeling overly mathy

x1000 · on Aug 27, 2018

Is there a reason you specified partial derivative with respect to time? Why not just d/dt(state)?

I've been tempted lately to model application state as the integral over time of all events (deltas) that have occurred. For example, imagine a simple game state:

  type GameState = {
    x: float; // x coordinate
    t: float; // current time
    v: float; // velocity variable
  }

If initial game state =

  { x: 0,
    t: 0,
    v: 0, };

and you have some events (deltas):

  [{t: 1, v:1}, {t:2, v:-1 /*this is deltaV, so back to v: 0*/}, {t: 3, v:2}, {t:5, v:-2}]

You could "sum" them up (integrate over time) and get a game state of x = 5, v = 0 where t >= 5.

I've found this approach kinda sucks when you add things like collision detection. Your application would have to emit velocity deltas when objects would collide. If you've got a point that bounces around in a box with perfectly elastic collisions, then over time you'd have an infinite number of these collision/velocity delta events. But as you receive new events, all your precomputed collision events are for nothing (if your player logs out). So the traditional update loop simulate each tick as it occurs seems to work best.

Is there a more general way of thinking about this integration? Maybe with respect to another variable? Perhaps it would address this problem I'm having where I feel forced to quantize my game state onto ticks.

Sharlin · on Aug 27, 2018

A related idea—representing state as pseudo-continuously varying values—is Conal Elliott's original formulation of functional reactive programming in the context of animation [1].

[1] Elliott, Conal; Hudak, Paul (1997), "Functional Reactive Animation" <http://conal.net/papers/icfp97/>

dahart · on Aug 27, 2018

This is a very nice article on double buffering, and I especially love the extension to game state. I never thought of that at double buffering, but yes I’m frequently keeping “previous frame” state for a lot of different systems including physics, controllers, AI state machines, etc. They are usually ad-hoc buffering. Each system handles it’s own. This makes me wonder if it would make any sense to unify previous frame memory, what it would look like, and whether there might be advantages.

> In their hearts, computers are sequential beasts. [...even with several cores, only a few operations are running concurrently.]

My one small nitpick is that the presentation at the beginning felt very single-core CPU centric, even with the side note on threading. GPUs are both parallel processors, and they’re being used in parallel with the CPU; many games are running things like CPU physics in parallel while the GPU is doing rendering. And to really see this point, many games use triple buffering and are doing some amount of overlapped rendering of two frames at the same time, not just one.

aidenn0 · on Aug 27, 2018

Interestingly enough, there were many games in the 8 bit era that you could watch the scene render from back to front (golf games were notorious for this), and going back even further, early video game systems lacked a frame buffer at all.

mikekchar · on Aug 27, 2018

Though on the Apple II, I suspect that most games were essentially double buffered (although there is no video hardware to speak of). I wrote an arcade game when I was in high school (probably around 1985???) and the advice I got was to draw into a separate piece of memory and then copy that memory into the video memory -- otherwise you'd always have trouble with video artefacts (especially because the colour it output was dependent upon the bit patterns in memory).

It was actually important from another aspect: the video memory was not contiguous. IIRC, it was divided in 3. So the first row was the first scan line, the second row was 1/3 the way down the page and the third row was 2/3 down the page. In order to compose graphics on the screen, you'd have to have this weird algorithm for drawing into memory. Compositing right in the video memory would have resulted in lots of timing problems.

mpweiher · on Aug 27, 2018

> draw into a separate piece of memory and then copy that memory into the video memory

While the Apple ][ had no graphics hardware to speak of, it did have two graphics buffers that you could switch between.

So you would draw into the non-displayed buffer and switch to that.

If you had installed the vertical-blank mod, you could even do that without tearing. (I think the //e came with that).

Steve44 · on Aug 27, 2018

One extreme example I remember very well is The Hobbit on the ZX Spectrum. It was an illustrated text adventure using graphics to add atmosphere and interest. I suspect they used a drawing programme with scripted commands to minimise memory usage at the expense of time and sometimes fine detail to draw them.

There are a couple of good examples at 4:57 onwards. Link at https://www.youtube.com/watch?v=3f3PFfK-9Gk

Edited to add there is a video here https://www.youtube.com/watch?v=D9qrZC7WUio of the 128k Spectrum version. Looking at the speed I suspect they pre-rendered the images in the background and used the extra memory to store them and then copy the completed image to display when needed.

kayamon · on Aug 28, 2018

Hi, I’m the guy who did the 128K port.

The images are all hand-drawn art, stored as compressed data and simply decompressed when needed. We don’t render (or prerender) anything.

Steve44 · on Aug 28, 2018

Ah OK, I hadn't appreciated that video was a remake and had assumed that they were still looking to keep a limit on loading times.

Nice work, good to see these old systems still being played with and experimented on.

Sharlin · on Aug 27, 2018

As ddingus implied, in the C64 era you'd actually have state-change logic executed as the beam moved across the screen. For instance, the VIC graphics chip only had support for eight hardware sprites at a time. But by using the hsync interrupt you could reuse those sprites and draw each at different coordinates as long as they were in different regions of the screen, vertically.

[1] https://en.wikipedia.org/wiki/Raster_interrupt

rusk · on Aug 27, 2018

This also enabled more colours, as you were limited to something like 8 colours (2 per sprite and 1 background) in multicolour mode. At different vertial positions you could swop out the banks between scans and this worked particularly well for horizontal scrollers ...

yason · on Aug 27, 2018

The golf games (Leaderboard Golf comes to mind: https://www.youtube.com/watch?v=4jA5eKk10Wo&t=803) had to do this because it took incredibly long time to render a 3D map on the screen. No chance those machines could've rendered the whole scene within a few frames.

The games just made it a visual gimmick for the player to watch the drawing instead of offering a waiting screen, rendering to a backbuffer and then copying the image over.

Frankly, I wouldn't be surprised if the rendering was even made a bit slower so that the player can see the areas fill up in human-scale time. For example, a C64 with 6502 can fill polygons faster than what you see on the screen.

rusk · on Aug 27, 2018

I don't know ... I remember using a paint program where something like a flood fill, actually ran before your very eyes... at something like 4-5MHz there isn't a lot of CPU to do your bidding so game logic, music, graphical rendering all takes a slice and you don't even have any kind of preemptive scheduler to make sure everyone stays in time. I'd say with this kind of game the developers were more focused on the game mechanics. The graphics were just "good enough" for what they were trying to do.

Narishma · on Aug 27, 2018

The C64 ran at 1Mhz. Other contemporary CPUs that were clocked faster tended to take multiple clock cycles per instruction so it came down to the same.

rusk · on Aug 28, 2018

Was it the spectrum that ran at 4MHz? Thanks for the correction!

Steve44 · on Aug 29, 2018

Not the poster but yes, the Spectrum was at about 3.5Mhz.

Steve44 · on Aug 27, 2018

> I wouldn't be surprised if the rendering was even made a bit slower

Trust me, everything was flat out then. I posted a link in another reply about something similar. Have a look at this flood fill, this was zipping along at full speed.

> There are a couple of good examples at 4:57 onwards. Link at https://www.youtube.com/watch?v=3f3PFfK-9Gk

yason · on Aug 28, 2018

That looks like a traditional flood fill which is a different case but I might be wrong.

Leaderboard had vector graphics so the program knew analytically where the polygon edges are and it could simply write bytes in the bitmap, in most cases without reading back pixels.

Let's say C64 had roughly, in practise, about 15k cycles per frame. Quick calculation shows that filling the entire 320x200 screen in a trivial loop is 7-8k cycles for instructions alone, plus any delays should the graphics chip read the memory at the same time. Add to that vector math and the fact that not the whole screen will be filled and we should be able to count filling the scene in terms of frames, not several seconds.

Leaderboard might have sloppy routines to make it 'watchable' but I suspect C64 can fill faster than that if made to.

Filled vectors were notoriously hard on C64 as both the bandwidth usage and cpu cycles spent on math rack up. Yet games like Rescue in Fractalus had simpler but still filled vectors in real time.

veebat · on Aug 28, 2018

Early 3D games are primarily limited on how they handle occlusion. If your scene is designed favorably(e.g. wireframe spaceships), you can ignore occlusion, but that doesn't describe most interesting scenes.

Leaderboard is using a kind of painter's algorithm to do everything, which is flexible but requires using overdraw, and by itself, will produce rendering errors in many scenes because the depth will be sorted per object or per triangle, versus per pixel. Filled triangle rasterization is particularly expensive to compute and requires a fair amount of numeric precision, both of which compound the issue of using overdraw.

Fractalus does not use these techniques for terrain - although the exact implementation is idiosyncratic[1], it's ultimately filling an outline that defines the horizon, and then adds texture to the landscape with dot patterns. Bear in mind, it doesn't render to the full screen either, it's mostly HUD, while Leaderboard is a little more expansive.

[1] A similar technique is described for Captain Blood: http://bringerp.free.fr/RE/CaptainBlood/main.php5

criminalpatto · on Aug 28, 2018

A similar technique was used in Last Ninja 2, where the 3d isometric screens would be visibly redrawn, with the objects drawn one over another. It was actually a very cool effect. Don't know if the underlying technical reasons were similar though.

ptero · on Aug 27, 2018

Another problem with double buffering is the need to perform the copy of the buffer. This is trivial now, but not quite so in the early years. I grew up way behind the tech curve -- we had a few 1MHz computers in our non-US university lab and a single 4 MHz machine (with a color monitor, wow) which seemed super fast.

As frames are often similar (only a small portion really changes), on those slow machines many folks preferred to update a small portion of the single buffer at the right time instead of copying the whole thing.

kd5bjo · on Aug 27, 2018

Often, the buffer was never copied. Instead, the video card is told to read from the old back buffer (which is the new front buffer), and the old front buffer (which the video card was drawing from last frame) becomes the new back buffer. This only requires a few instructions, so can easily be done during the monitor's vertical blank.

Steve44 · on Aug 27, 2018

That is entirely dependant on the hardware. Some machines could, some couldn't, and some could but mucked it up.

The Spectrum couldn't so you either wrote to the live video display or used some method of copying.

The QL could in theory switch video display between two buffers but they put a section of IO/System memory (I can't remember the details) in the middle of one of the buffers so it was useless.

Some systems also had hardware sprites which weren't part of the main video ram, they were overlaid by hardware. You could also do some fun stuff with those such as rewriting them as the scanline progressed down the screen to fudge a lot more hardware sprites.

I'm out of the industry now but one of the major parts of games programming was pushing the hardware way beyond what it was designed to do. Now, as far as I understand, you write for an API.

ddingus · on Aug 27, 2018

Race that beam!

For simpler things, not 3D, sorting objects and a single buffer render still work just fine.

wiz21c · on Aug 27, 2018

Well, except that to prevent tearing, you have to draw just behind the monitor's refresh line (well, at least when we had cathodic monitors). And that required an awful lot of super accurate timing (on PC at least)

ddingus · on Aug 27, 2018

Or, draw just ahead of it. I think that is a matter of perspective.

The idea being to have the graphics there for the display device to render them. Once it has done that, they are no longer needed, until next frame.

Doing that actually needs only a small buffer, but is timing intense, also compute and write demanding.

If one has a sprite capability, and or a larger, or full frame buffer, just write from the top down, starting at the beginning of blank from the previous frame.

Drawing behind works simplest assuming a full frame buffer, so the new follows the old once the display has pushed those pixels to eyeballs.

Another scheme, used in DEFENDER and ROBOTRON, among many others, is to divide the screen into top and bottom regions. Or more, depending.

Render into the one the display is not using. This is more coarse, but can be less complex to arrange.

The timing depends. If one is working scanline by scanline, yeah. Need accurate timing.

When working with screen regions, or just a blanking period, the requirement is relaxed. Really, all you need is to understand the speed of the object draw engine and a quick sort to insure the initial part of the frame is drawn in time for the display.

Screen regions also work well when the renderer cannot get a full frame drawn every display frame. There, a priority system will draw what is needed, delay that which is not.

I enjoy doing ths kind of programming. A full double buffer is a luxury :D

sasaf5 · on Aug 27, 2018

Cool stuff! Perhaps one could have even written some routines to kinda abstract that?

HelloNurse · on Aug 27, 2018

An "engine" could take care of synchronizing rendering with reading, but the difficult part is application specific: ensuring that rendering fits in the specified number of CPU cycles.

ddingus · on Aug 27, 2018

For many cases, one ends up writing a little kernel that manages the display.

Break the display into whatever elements make sense, and insure timing is all sorted. Run tests to get maximums.

Once it is running, the rest feeds it with pointers to objects, and lists or the objects get generated dynamically at display time.

Various schemes apply to overtime requirements. In old Williams arcade games, one method was to just move baddies to a non displayed region. Aggresive players would fire off a ton of shots to trigger this.

Another is to delay animation or movement.

The kernel can do that, but it gets more complex and slow. Dumber kernels are faster. Higher level code can manage what the kernel has to do.

One thing is smart to do though, and that is to have the display kernel fail gracefully. Use hardware timers, or loop counters to prevent overdraw and the loss of display sync normally associated with that condition.

shoo · on Aug 27, 2018

there's a bit of discussion about isometric tile rendering nonsense, although not specific to vsync, in David Brevik's diablo postmortem https://www.youtube.com/watch?v=VscdPA6sUkc

along with a whole lot of other fascinating stuff

ddingus · on Aug 28, 2018

That was a great video. Enjoyable talk.

At one point, they were drawing tiles, the walls associated with them, objects, etc...

Moving objects would end up between tiles and the sprite would be cut off by a different tile draw. Worse, there was no sort order that worked for all cases. They ended up creating, what the speaker called "an ai" to move the object away from the draw conflict, get it all drawn, then move the player back, just ahead of the actual motion...

In hindsight, they could have just decoupled the walls from the tiles and it all could have been a lot simpler!

bartread · on Aug 27, 2018

The 16-bit era meant that machines had a bit more memory to play with so double-buffering became a common technique on platforms like the Amiga.

unwind · on Aug 27, 2018

True, and on the Amiga it was especially nice since the location of the frame buffer was under software control, so no need to copy a full screen's worth of pixels anywhere.

To swap, just tell write the new frame buffer base address to the appropriate custom chip registers, and you're done.

Since the swap should be synchronized to the beam, the actual write typically happened in a copper list anyway, so that's what you really write to in order to do the swap.

It all brings me back to being a teenager and discovering these things. :) Amiga <3.

thebosz · on Aug 27, 2018

If this article tickles your fancy, you should read the whole book. Bob did a great job on it! It's all available on the website for free, or you can pay for a nice copy if you'd like to support his efforts.

He's also writing a new book about creating interpreters: https://craftinginterpreters.com/

munificent · on Aug 27, 2018

Thank you for the compliment! :)

trevortheblack · on Aug 27, 2018

Thank you for the book! The chapter on the prototype pattern was the first time that I fully grokked the Factory pattern. And, your's is the only resource I've found that has thoughtfully constructive thoughts on singletons.

Neil44 · on Aug 27, 2018

On games this can be quite expensive in terms of frames per second and smoothness because of how it interacts with the monitors refresh rate. If the next buffer isn’t quite ready in 1/60th of a second for the start of the monitors refresh then you basically have to sit there doing nothing until the monitor is ready for you again. Now the user sees the same image for two frames then a jump and your frames per second has fallen off a cliff. You can try to mitigate that with triple buffering but that’s starting to take a lot of memory up now that you want to use for other things.

AMD has this standard for interacting with the monitor refresh now called FreeSync. Now the game can communicate with the monitor to tell it when the buffer’s ready, to get around the problems caused by a fixed 60hz refresh. A lot of ‘gaming’ branded screens are supporting it now.

paulmd · on Aug 27, 2018

(a) this uneven frame-pacing is called "judder"

(b) you only get judder with vsync=on, vsync=off just flips to the new buffer as soon as it's available. The downside is that you tend to have a visible seam or "tear" in the image when the buffer flips.

(c) the memory for triple-buffering is trivial, that's not the problem. The real problem is that you need to be pushing a framerate at least as high your monitor's refresh rate, and ideally 2x your refresh rate. That's undesirable in a world where you don't have infinite money to spend on hardware.

(d) NVIDIA actually pioneered this technology, AMD came in with a copycat implementation after the fact. AMD has the advantage of not needing special hardware in the monitor, but it usually comes with the disadvantage of only being able to sync over a narrow range, such as 40-60 Hz. Some monitors have as narrow as a 10-hz sync range, and many of them tend to flicker once they get down to the lower end of their sync range. There are currently a total of three FreeSync monitors on the market that don't totally suck.

http://techreport.com/news/25867/amd-could-counter-nvidia-g-...

https://gpunerd.com/guides/freesync-monitor-list

Nimelrian · on Aug 27, 2018

> with the disadvantage of only being able to sync over a narrow range, such as 40-60 Hz.

1) This is not the problem of FreeSync itself but of the panels and their support for VRR.

2) I have 2 FreeSync monitors at home which both do 40-144Hz (Acer XF270HUA and Benq XL2730Z). In addition, more than 2 years ago AMD added LFC to help with the classic case of "FPS drops below FreeSync range" issue. I don't notice any issues with judder when my framerate goes below the range for 1 or 2 frames.

Your info is out of date. Nowadays FreeSync is just as good as GSync if you get a monitor of good quality (and you still pay 150-300$ less just because it isn't GSync).

paulmd · on Aug 27, 2018

LFC only works on models that already have a wide sync range, which is the reason that stuff like 40-60 Hz panels are a problem in the first place. The top end of the sync range needs to be at least 2.5x the rate of the bottom end of the sync range (it works by doubling frames up when you go off the bottom end of the sync range).

XF270HU/HUA (-A is a later revision of the same model) is one of the handful of good models (wide sync range). The Nixeus EDG has a wide sync range plus is the only FreeSync model known to have Adaptive Overdrive, which is important for preventing ghosting.

There are a very small handful of others, but only about 1/3 of FreeSync monitors even support LFC, and the EDG is the only (!) one that does Adaptive Overdrive, which is a basic feature on all GSync monitors. So no, this is not "out of date" at all, there really are only a handful of FreeSync monitors that are "just as good as GSync".

ByThyGrace · on Aug 27, 2018

Are FreeSync and GSync interchangeable? Can a FreeSync monitor work with an NVIDIA card, or viceversa?

ghusbands · on Aug 27, 2018

No. When NVidia created GSync, they enforced quality far beyond what the specification could manage and used proprietary technology and consultant engineers to accomplish it. AMD copied their efforts with Freesync by getting some of it added to the DisplayPort protocol. NVidia sees it as an inferior copy and won't support it. AMD and many others see GSync as an exclusionary market grab through tie-ins.

Despite many claims to the contrary from Freesync supporters, it's still the case that only a handful of Freesync displays will give near the same experience as nearly all GSync displays.

Steve44 · on Aug 27, 2018

> If the next buffer isn’t quite ready in 1/60th of a second

As a UK developer this caused a real problem writing for the US market. We'd wring the most performance we could developing on the 50Hz refresh rate we had here in the UK and suddenly lost 20% CPU time for the US market. Not always a problem but certainly it was a consideration in how screen refreshes were timed.

kristiandupont · on Aug 27, 2018

Having grown up with this, I think it's part of the reason why React feels so natural and "right" to me.

Whenever I've used frameworks based on data-binding, my mind tries to rely on an analogy like modular synthesis where elements are connected with wires (https://modularsynthesis.com/DJB-synth.jpg). That works great for tiny demos but poorly for highly dynamic UI's. Maybe I just need a different analogy to properly grasp it.

k__ · on Aug 27, 2018

Yes, I think unidirectional data flow is the main selling point of React.

sfvisser · on Aug 27, 2018

Weird comparison. The native DOM already performs some kind of buffering. All changes to the DOM tree in one single JS thread are rendered at once. React just has a slightly simplified virtual version of the DOM so it can be smarter about reuse, possibly causing speedups.

kristiandupont · on Aug 27, 2018

I just meant that the virtual DOM feels like a backbuffer that you can render "from scratch" with every frame. I am well aware that the technicalities are quite different.

gfo · on Aug 27, 2018

Great read! I'm still missing something though...

Is there synchronization between the GPU and the FrameBuffer to tell it that the GPU has finished drawing the current frame? This may be a rare case but I'm getting stuck at the Sample Code.

I'm assuming we're reusing the same FrameBuffer object so it seems like we're using the same two pixel array references. If the GPU can't tell the game it's not done with the current buffer, couldn't the game easily start writing into the buffer the GPU is currently working with if it's just swapping in its own loop? This assumes the game has already drawn the next frame and has swapped back to the former frame.

In the reverse case I guess the game loop could fall behind and we keep writing the same frame and introducing lag.

NanoWar · on Aug 27, 2018

This is also build-in into slower hardware, such as the TI83 calculator. There is a command to copy the buffer to the actual screen memory, so it updates all at once :)

ijidak · on Aug 27, 2018

What language is the code sample in? Looks very clean.

wtetzner · on Aug 27, 2018

It looks like C++ to me.

munificent · on Aug 27, 2018

Yes, I used a subset of C++ for the book. The goal was to be low-enough level to convince die hard game programmers that the examples were valid but clean enough for non-C++ programmers to read.

One of the things I find really interesting about C++ is that idiomatic, modern C++ is really gnarly looking. Lots of `std::` everywhere, tons of templates, etc.

But if you choose to eschew convention and roll your own stuff, you can actually write C++ in a pretty simple, clear style. There are significant downsides to doing that, of course. The prevailing style is prevailing for a reason. But I think it says something interesting about the language itself that it's so accommodating to a wide range of styles and levels of readability.

ijidak · on Aug 28, 2018

Wow. Very finely crafted. Didn't even recognize it at first.

Lacks all of the clutter I tend to associate with c++.