Overkilling the 8-queens Problem

colmmacc · on Feb 26, 2014

People often think that the N-queens problem is just an abstract exercise, good sample work that stretches the mind. But when we were building Route 53, we actually use the N-queens problem, it's part of how we assign Shuffle Shards;

     https://github.com/awslabs/route53-infima#shuffleshard

The solution we included in Infima is a recursive search with backtracking because of some other needs;

     http://git.io/332VLA

but it's the N-queens problem, in a real practical setting.

applecore · on Feb 26, 2014

Clickable:

https://github.com/awslabs/route53-infima#shuffleshard

http://git.io/332VLA

primitivesuave · on Feb 26, 2014

When I was a kid, my late uncle gave me and my cousins this big speech about what the N queens problem is and how it was applied in designing chips - I never really understood his verbal description. The link you just shared is so well written and gives such a great perspective on applications of N queens - thanks for sharing it.

mjn · on Feb 26, 2014

As an asm tutorial / case study, I found this neat. In terms of a conclusion, I found it potentially neat, but found myself wondering: cool, but how much of a win is all this special-cased effort, versus some kind of off-the-shelf constraint solver written in portable code? It'd be interesting to see a benchmark so we could have an idea if we're talking about a 50% win, orders of magnitude win, etc.

To get a rough baseline for what a generic solution could take, I took nqueens in Answer-Set Programming [1], set n=8, and ran it through clingo [2] on an Amazon micro instance, which gets me 15ms.

Another interesting question might be whether the two solutions have different scaling properties as n increases. The ASP solution scales like this (averaging a few runs and rounding):

   n   time
   8     15ms
  10     19
  15     50
  20    130
  25    290
  30    600
  35   1050ms [1.05s]

[1] http://www.hakank.org/answer_set_programming/nqueens.lp. This encoding can be optimized a bit as well, but is a decent starting point.

[2] http://potassco.sourceforge.net, a menagerie of logic-programming tools, written in C++. 'gringo' takes a high-level encoding and turns it into something more SAT-like, 'clasp' solves the SAT-like (variable-free) programs, and 'clingo' staples the two steps together into one binary. I used the version in Debian stable package 'gringo', which reports itself as "clingo 3.0.4 (clasp 1.3.10)"

mzl · on Feb 26, 2014

N-queens, while quite a cute problem, is typically not that interesting as a constraint programming benchmark. The way to make it fast is to have a bit-set representation of the domains, which is not the best representation for many other important cases.

I tested the standard Gecode (http://www.gecode.org) implementation from version 4.2.1 without any special options on a MacBook Pro with a 2.6GHz Intel Core i7. Note that Gecode does not have bit-set representations for domains, so it is not optimized for this special case. Tests were run for 500 iterations and repeated 5 times to get deviations below 2.5%. I get the following timings for your sizes plus 50 and 100:

    n   time
    8     0.0368 ms
   10     0.0243 ms
   15     0.0221 ms
   20     0.1263 ms 
   25     0.2677 ms
   30     0.1621 ms 
   35     0.2433 ms 
   50     2.3234 ms 
  100     0.5509 ms

The timings are of course all strongly dependent on the heuristic used for finding the solution which is shown by the varying times above.

Constraint programming is typically also very dependent on the the pruning power of the constraints. Adding the option "-icl dom" which makes the pruning for the all-different constraints domain consistent, gives the following timings.

    n   time
    8     0.1138 ms
   10     0.0603 ms
   15     0.0879 ms
   20     0.2801 ms 
   25     0.6612 ms
   30     0.4657 ms 
   35     0.7553 ms  
   50     2.7687 ms 
  100     8.7687 ms

As can be seen, domain consistency did not actually help us in this case time-wise. The number of search-nodes visited range between 17 and 1066 for the normal consistency level, while the number of nodes visited for the domain consistent version range between 14 and 286. The reduction in search tree size was however offset by the increased complexity of the propagator.

mzl · on Feb 26, 2014

As an update I saw that the original post was about finding all the solutions to the problem, compared to the single first solution searched for above. Running for finding all solutions, I got the following timings (same computer as above, same settings, default consistency).

    n   time           search tree  solutions
    8      0.8752 ms       767          92
   10     14.2759 ms     11431         724
   12    306.7616 ms    232163       14200
   14   8782.4620 ms   6391931      365596

Downloading the c-versions from https://github.com/davidad/8queens/tree/%2Bc_comparison gives a reference-time using the time built-in command of around 750 ms for 10k iterations. This means that the C solver takes around 0.075 ms per iteration which is a factor of ten faster than the Gecode version. That is not unreasonable given that the Gecode version is a high-level model in a general framework.

I did not manage to build the assembly-version unfortunately.

EDIT: Updated the timings and comments of the C-solver to represent that it actually solved the problem 10k times. Lesson, never do benchmarks quickly :-)

davidad_ · on Feb 26, 2014

See [1]; it is about a factor of 7 win vs. a reasonable C implementation.

[1] https://github.com/davidad/8queens/tree/%2Bc_comparison

ufo · on Feb 26, 2014

How do those numbers compare to the usual backtracking approach from the OP?I'm curious, specially with how the backtracking scales as N increases (this would be when all the heuristics from the sat solvers would really kick in)

_wc0m · on Feb 26, 2014

I am said Hacker Schooler who issued the challenge. I thought I'd done well with a Rust solution that ran in 42ms [0].

I was well and truly pwned, as they say (but at least I got some sleep last night).

[0] http://rosettacode.org/wiki/N-queens_problem#Rust

dbaupp · on Feb 26, 2014

Looks cool!

BTW, the convention would be for `static SIZE: i8 = 8;` (i.e. upper case), and I would strongly recommend you don't use `i8` for that, because SIZE * SIZE = 0 (but not in const-exprs, it seems). `uint` is the type of the smallest size that is guaranteed that to work for all array indexing.

Also, you can use `bool` instead of i8 for your array elements. (A Rust bool is a u8/i8 that is guaranteed to be either 0 or 1, i.e. exactly what your code has.)

Lastly, one conventionally matches on just `None` rather than `None()`. (This actually lead me to file https://github.com/mozilla/rust/issues/12560 to make it require that there be no `()`, so thank you for prompting that.)

JoeAltmaier · on Feb 26, 2014

Beautiful! I derived this solution back in 81 when Byte Magazine had a contest. It took 20 minutes to run in Basic on an HP 2000.

anfractuosity · on Feb 26, 2014

That's a really fascinating article! I'd love to see how a C implementation would match up to that.

As an aside there seems to be lots of really interesting concepts involved involved in implementing chess engines.

From representing the chessboard in a highly compact manor through the use of bitboards (http://www.frayn.net/beowulf/theory.html#bitboards) to searching for 'best' possible moves.

bluecalm · on Feb 26, 2014

My old code: http://codetidy.com/8255/

windows.h is for QueryPerformanceCounter used only in main(); compile with gcc -Ofast

It iterates/counts all solutions + prints first of them (as 01 board) in 0.23ms on my i7 (if you redirect output to file). You can also run it for N queens changing #define Qs to something bigger. The code as you see is silly (not needed conditional in main loop among other things) and not optimized, still it's pretty fast :) I am not sure what hardware they used but it takes 10x less time on my computer vs theirs.

davidad_ · on Feb 26, 2014

Hi bluecalm! I tweaked both your code and mine to uselessly solve the problem 10000 times, and not pretty print anything, so the difference in raw 8-queens-solving speed is readily apparent. Here are the results:

    bash-3.2$ make
    gcc -O3 8q_C_bluecalm.c -o 8q_C_bluecalm
    nasm 8q_x64_davidad.asm -DLOOPED=10000 -f macho64 -o 8q_x64_davidad.o
    ld -o 8q_x64_davidad 8q_x64_davidad.o

    time ./8q_C_bluecalm  ; echo $?

    real	0m0.802s
    user	0m0.801s
    sys 	0m0.001s
    92

    time ./8q_x64_davidad ; echo $?

    real	0m0.113s
    user	0m0.112s
    sys 	0m0.000s
    92

If you'd like, I'm happy to post this as a branch of my repository so others can replicate. (I only hesitate because it's a derivative work of your code.)

bluecalm · on Feb 26, 2014

Hi, I am happy to see that asm magic gains over naive code :) Feel free to host/use the code in any way you like!

It will be nice addition to see how hand tuned code beats naive C implementation. (I hope your code iterates over all solutions as well, the easiest way to change mine to stop after 1st is a goto from solve to main :))

EDIT: Yeah, I meant if it counts all of them and I've just noticed 92 exit code :) I didn't expect that big a speed-up, impressive!

EDIT2: to answer children post: That's great, please change -O3 to -Ofast though if you are using new GCC. It shouldn't matter much but it sometimes does; I believe -march=native won't help with generated assembly though I seems that your code from original article spent all its time in pretty printing. I am happy to see some real comparison now; I am for sure learning a lot here once I understand everything in the article.

davidad_ · on Feb 26, 2014

It counts up every solution, thus the "92" exit code. However, the non-pretty-printing version does take some shortcuts - it can't reconstruct the entire current board state in O(1) time from its actual register state. So I'm not sure exactly what you mean by "iterate over all solutions". (It certainly doesn't bail out when it finds the first solution, if that's your question.)

davidad_ · on Feb 26, 2014

Here [1] is the comparison repository, for anyone interested.

[1] https://github.com/davidad/8queens/tree/%2Bc_comparison

EDIT: I don't seem to have -Ofast, but I added a note to the readme to ask people to try it. I also added the recommended flags for Sandy Bridge, i.e. -march=core2 -msse4.1 -msse4.2.

anonymoushn · on Feb 26, 2014

You can go at least twice as fast if you use symmetries to prune the space. I messed with the C version:

  $ make
  gcc -O3 -march=core2 -msse4.1 -msse4.2 8q_C_bluecalm.c -o 8q_C_bluecalm
  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.376s
  user	0m0.373s
  sys	0m0.001s
  92
  time ./8q_x64_davidad ; echo $?
  
  real	0m0.129s
  user	0m0.128s
  sys	0m0.001s
  92

Of course this doesn't matter if the point is to compare the best asm dfs to the best C dfs of the same space, but if the point is to make the best asm 8queens solver, you can get down below 6μs like this :)

Edit: Using C++ templates to inline the DFS also helps

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.260s
  user	0m0.258s
  sys	0m0.001s
  92

Unfortunately this sort of technique doesn't really apply to the asm >_>

Edit again: I tried to go a bit further by replacing the board struct with your method of passing 3 bytes between levels of the dfs, but at that point the compiler was able to optimize out the whole program.

Removing the templates causes the compiler to actually emit a dfs, which by now is faster than the asm (!). Removing the 2x speedup from just not looking at half of the tree slows us back down to ~1.1x the runtime of the asm:

  time ./8q_C_bluecalm  ; echo $?
  
  real	0m0.140s
  user	0m0.138s
  sys	0m0.001s
  92

Here's this last solution, the one that's ~1.1x slower than the asm on my machine: http://pastie.org/8784768

And with the free 2x speedup: http://pastie.org/8784770

One last time now: I tried getting rid of the x arg and using a global. This was much slower, so I tried getting rid of the SOLUTIONS global and using a return value. This caused the compiler to optimize out the whole program again :(

davidad_ · on Feb 26, 2014

Wow! I like your C solution. I'm glad that at least the speedup is still measurable, but it's a close shave.

I implemented your "free 2x speedup" in asm [1]. Sure enough, solution time improved to 4.6us on my machine [2].

[1] https://github.com/davidad/8queens/commit/61119a7f0019e4a85e... [2] https://github.com/davidad/8queens/tree/free-2x-speedup

bluecalm · on Feb 26, 2014

It seems to me that using a bitboard instead of the struct is the biggest gain. I like your version, very elegant and terse.

nkurz · on Feb 26, 2014

Thanks for putting this together!

My immediate question was whether another compiler might be more competitive with the assembly. On my Sandy Bridge processor, I found that although GCC was the slowest, the assembly was still the clear winner:

  gcc 4.8.0: 0.801596113 seconds
  icc 14.0.1: 0.739297534 seconds
  clang 3.2: 0.706446818 seconds
  assembly: 0.104038212 seconds

I was surprised by Clang here. It's an older version than the other two, yet fastest. A quick glance at 'perf stat' (which I used for the timing) says that although it's executing fewer instructions per cycle, it's managing to use fewer instructions than the other two.

Although it doesn't seem to make much of a difference here, those are odd flags for Sandy Bridge. Core2 is a previous generation, and it supports AVX which came out after SSE4.2. If you did to specify SB, you'd want the unwieldy "-march=corei7-avx". But probably better just to use '-mavx' or 'march=native', along with -Ofast or -O3.

a_e_k · on Feb 26, 2014

I couldn't resist the challenge, so I wrote a non-recursive solution [0] in C++. It takes about 0.73s on an E5-4650 to loop 10000 times. Since writing it I've tried testing it against bluecalm's solution and how they compare seems to be fairly sensitive to the particular compiler and hardware combination. But it's at least a slightly different take.

[0] http://pastebin.com/EciqQBg5

curiousDog · on Feb 26, 2014

Would it have been easier to code it in C, compile with -O3 and then further tweak the assembly?

davidad_ · on Feb 26, 2014

The semantics of C map poorly onto this kind of code. The types of decisions that need to be made here, like combining certain state values into single registers so they can be operated on in parallel, and the particular structure of branching, would be difficult to make post-hoc to a C-style program.

If you're excited about it, though, I'd encourage you to give it a shot and write a blog post about your experience! However hard it turns out to be, I bet we'd all learn something new. (:

TheLoneWolfling · on Feb 26, 2014

Perhaps. But code coming out of a C compiler (especially code compiled with -O3) can be really difficult to grok.

bluecalm · on Feb 26, 2014

I wonder how my old C code stack up to this as well. 15ms seems like eternity btw. My naive old code without any thought in it (no bit magic, no intrinsics) and only with -Ofast flag on i7 3770 Windows 7 iterates all the solutions in 10.1ms as well as prints first of them on the screen.

If redirected to file it takes 0.232ms, again to iterate all the solutions and print first of them.

It would be nice if authors provide some way to compare it to their code. Preferably on say 15 queens instead of 8 so pretty printing time could be ignored :)

userbinator · on Feb 26, 2014

This is the sort of thing that really shows why someone who knows Asm well can beat even the best compiler - because high-level languages, in hiding the details of the machine under layers of abstraction, are also hiding those details that make it possible to write efficient code. Examples include using the flags and the stack in "unusual" (to HLLs) ways, no requirement to conform to any calling conventions, etc.

That said, the size optimiser in me sees a noticeable absence of rbx, rsi, rdi, and ebp; those registers should be used before going into the r8~r15, since the "extended" registers require a prefix byte to the instruction every time they're used. In fact since only r8, r10, r13, r14, and r15 are used, this code could be rewritten to use only one extended register and you wouldn't need the "mov rdi, r8" at the end.

nkurz · on Feb 26, 2014

those registers should be used before going into the r8~r15, since the "extended" registers require a prefix byte to the instruction every time they're used.

It's hard to argue against "shorter", but since the modern processors this code targets all have a decoded instruction micro-op cache, there is no front-end bottleneck and the instruction length doesn't actually slow anything down.

Also, the four registers you mention are all callee-save on Windows (just RBX and RBP for Mac and Linux) , so avoiding them means you don't have to save and restore them. Given a choice, use a scratch register. That said, R12-R15 are callee-save too, so this may not apply here.

Personally, I dislike the messiness of the historical names when using them for non-historical purposes. I think if it was more acceptable to alias them to R1, R2, etc I'd prefer them more than I do.

userbinator · on Feb 26, 2014

but since the modern processors this code targets all have a decoded instruction micro-op cache, there is no front-end bottleneck and the instruction length doesn't actually slow anything down.

Right, this code is running entirely in the core and that's part of what makes it so fast.

Also, the four registers you mention are all callee-save on Windows (just RBX and RBP for Mac and Linux) , so avoiding them means you don't have to save and restore them.

Callee-save and caller-save are just a calling convention; and one of the advantages of using Asm is you don't have to care about calling conventions except if you're interfacing with some other language, which isn't the case here - it's pure Asm. There's not even a single function call in it.

Besides, saving and restoring those registers (if you really need to) only takes 4 bytes each (2 in 32-bit mode) - a push and a pop. This tradeoff pays off if you're going to use them in more than 4 instructions.

dfc · on Feb 26, 2014

I do not think I have ever seen so many stylistic/rare/historic ligatures used on a web page. Does anyone know if something changed in Iceweasel Aurora? Is this something that can be turned on in css?

UPDATE: If I allow javascript the web page is rendered using some google fonts which probably do not provide any ligatures. However if I do not allow JS and Iceweasel uses my default serif font (Minion Pro) the page is littered with all sorts of contextual/historic/stylistic ligatures.

TheLoneWolfling · on Feb 26, 2014

text-rendering: optimizelegibility

dfc · on Feb 26, 2014

Thanks. I think I might have to rethink my choice of default fonts. It seems that Iceweasel uses all available ligatures. I have to use fontspec to explicitly turn on the rare/obscure ligatures in latex. Hopefully Mozilla will give a little thought to enabling all ligatures by default.

Examples: https://imgur.com/a/7Fcdm

bzbarsky · on Feb 26, 2014

I think you misunderstand. By default, Gecko enables just the common ligatures. But this page has the following styles:

  div {
    text-rendering: optimizeLegibility;
    font-variant-ligatures: common-ligatures;
    -webkit-font-variant-ligatures: common-ligatures;
    -moz-font-feature-settings: "liga", "dlig";
    -ms-font-feature-settings: "liga", "dlig";
    -webkit-font-feature-settings: "liga", "dlig";
    -o-font-feature-settings: "liga", "dlig";
    font-feature-settings: "liga", "dlig";
  }

The text-rendering style enabled optional ligatures. The font-variant-ligatures style enables the use of OpenType "liga" and "clig" ligatures, but then the font-feature-settings style overrides that with "liga" and "dlig" (discretionary ligatures). What that means depends on how your font designer flagged their various ligatures, of course, but the page is explicitly opting into more than just common ligatures.

davidad_ · on Feb 26, 2014

I may have been overzealous here. The page also provides Garamond in various OpenType and webfont styles, and I wanted to be sure that the Garamond is displayed with ligatures, because Garamond's ligatures are really nice and not particularly noticeable (whereas I notice the lack of ligatures right away). I didn't really give thought to what happens if the page is being rendered in a different OpenType font for some reason. Perhaps I should replace "dlig" with "clig".

dfc · on Feb 26, 2014

I am not sure why you think you should just replace dlig with clig? Did you want the discretionary ligatures? Which of the contextual ligatures do you want? In my opinion it is the discretionary and the historic ligatures that are distracting and gaudy. I you can not think of the specific ligatures you want from the discretionary/contextual/historic why not just stick with the basics provided in liga?

This is going to sound crazy: Does anyone know the name of the js/css development website that has three or four panes and lets you mock up things on screen. There is probably more than one. I can not think of the name right now for the life of me. I always see it linked to when people are doing demonstrations here on HN.

I put up an image here of your defaults with your fonts and mine:

https://imgur.com/a/x44Wq

My default font is Minion Pro and you can see the extra discretionaries in words like "spurred, directed, testing, etc." The one goofy css test that I found did not let me choose any web fonts so I could not check your garamond but I did check my local Garamond Pro. and all of the same discretionary ligatures are present in my local garamond.

pacaro · on Feb 26, 2014

Nice,

I've been doing some genealogy research recently (caution: if you are obsessive this can become a real time suck), some of this has involved reading old German and Austrian newspapers, A typical obituary page might have Swiss, Roman and Gothic (black letter) faces, in particular the gothic tends to make extensive use of ligatures

vidarh · on Feb 26, 2014

> genealogy research recently (caution: if you are obsessive this can become a real time suck

I experienced this first hand as a child - more summer vacations than I'd have cared for involved trips arranged so we'd pass by various cemetery's or churches so my dad could check graves or church books against his latest leads...

lstamour · on Feb 26, 2014

I had a problem displaying the page in Chrome Beta on Android. Most of the page was blurry and I couldn't read past two or three screens of the article, it just turned into a scrolling black field. Haven't seen that before...

djvu9 · on Feb 26, 2014

I feel nowadays to meaningfully talk about performance of N-queens problem N should start with 18 instead of 8.. Here is a C++ solution with 40 lines of code and can solve 18-queens in less than a minute on a laptop:

https://gist.github.com/jdeng/7915749

IanCal · on Feb 26, 2014

Are you calculating the total number of solutions, or just finding a solution? The latter has a closed form solution: https://gist.github.com/IanCal/1858601

That'll do ~N=1e6 in a second (I'm sure there are lots of optimisations that could be done in it though), or solutions for N=4 to N=1000 in less than 10.

The more complex problem is finding out how many solutions there are in total.

djvu9 · on Feb 26, 2014

Counting all the solutions. http://en.wikipedia.org/wiki/Eight_queens_puzzle#Counting_so...

IanCal · on Feb 26, 2014

I can't edit this now, but I just realised it's 1e7, not 1e6 in 1s.

shoo · on Feb 26, 2014

N = 2 ^ 18 = 262144 queens in less than 6 seconds in a VM on a laptop

its in ~210 lines of C++:

https://bitbucket.org/rfc/dopt_nqueen/src/3e812164d97a8e91d4...

uses random local search approach, as described in russell and norvig (http://aima.cs.berkeley.edu/).

wrote this code during https://www.coursera.org/course/optimization

IanCal · on Feb 26, 2014

Apologies, I'm posting this few times, but are you calculating the total number of solutions, or just finding a solution? The latter has a closed form solution: https://gist.github.com/IanCal/1858601

That'll do ~N=1e6 in a second (I'm sure there are lots of optimisations that could be done in it though), or solutions for N=4 to N=1000 in less than 10.

The more complex problem is finding out how many solutions there are in total.

shoo · on Feb 26, 2014

You're correct, i'm merely finding a single feasible solution. Cheers for the link

IbJacked · on Feb 26, 2014

Hard to believe I wrote anything useful back in the day, having just an A, X, and Y.

wglb · on Feb 26, 2014

These new instruction sets look like a serious luxury, don't they?

davidad_ · on Feb 26, 2014

They truly, truly are.

zurn · on Feb 26, 2014

It would be interesting hear from a compiler guy why it's hard to produce this sort of code on x86 compiling from C or OpenCL. Bonus points for GPU version :)

_ooqq · on Feb 26, 2014

reminds me hof how I taught the c64 one of wolfram's 1d automata (as in http://www.mathpuzzle.com/MAA/42-From%20Keyfobs%20to%20Ringt... ) by heavily relying on lookup tables and bit shifting which got down the calculation times to about 30s for the whole 320x200 hires screen. Fun times.

Patient0 · on Feb 26, 2014

This looks like a great blog. The previous entry on virtual memory is also pretty informative.

plumeria · on Feb 26, 2014

Fellow wikipedians: Curiously, there is no English Wikipedia entry for MCPL.

muglug · on Feb 26, 2014

Yes – I don't think it ever served much purpose beyond academic papers. The referenced paper's author, Martin Richards, never mentioned it when supervised me. Another language of his, BCPL, is here: http://en.wikipedia.org/wiki/BCPL

Edit: Turns out he created it after I graduated. See here: http://www.cl.cam.ac.uk/~mr10/MCPL.html for more info