Stochastic Superoptimization

pbsd · on April 8, 2013

The ASPLOS paper makes a bold assertion regarding the first example:

> To the best of our knowledge, it is truly optimal: it is the fastest program for this function written in the 64-bit x86 (x86-64) instruction set.

This is an unwise claim to make, since x86_64 speed is a moving target, and what is fast today will not be 6 months from now. For example, this sequence should be slightly superior on modern Intel CPUs:

    mov edx, edx
    shl rcx, 32
    ; xor rcx, rdx
    ; mov rax, rcx
    lea rax, [rdx + rcx]
    mul rsi
    add rdi, r8
    adc rdx, 0
    add rax, rdi
    adc rdx, 0
    mov r8, rdx
    mov rdi, rax

nkurz · on April 8, 2013

Wow, that's really embarrassing.

I checked against IACA (http://software.intel.com/en-us/articles/intel-architecture-...) in case I was missing something, and it agrees that your version is 2 cycles faster on Sandy Bridge. Worse, while both blocks are slower on Nehalem than Sandy Bridge, yours appears to be 2 cycles faster on the older architecture as well. Seeing as that came out in 2008, it's hard to be charitable to their boasting.

Would you mind if I use your sample code to ask the authors?

pbsd · on April 8, 2013

Sure, you can use that however you wish.

jacques_chester · on April 8, 2013

> This is an unwise claim to make, since x86_64 speed is a moving target, and what is fast today will not be 6 months from now.

It's pointed out that the optimizations can be changed.

Nature-inspired techniques are very good at updating solutions in dynamic search spaces.

pbsd · on April 8, 2013

I simply meant the specific example they give and discuss in Fig. 1. The technique itself looks sound (and pretty cool).

jacques_chester · on April 8, 2013

Right, it's a moving target. Basically you'd adjust the cost function each time a new instruction comes along and rerun the optimiser to see what changes.

Thinking about dealing with a changing search space makes me wonder what an ant colony optimiser JIT would look like. A graph of instructions, maybe? With escalating path lengths?

edit: but then I need to update the graph based on the next legal instruction and voila, it looks a lot like markov chains with 0/1 probabilities :|

xyzzy123 · on April 8, 2013

Ha, OAAS (optimisation as a service). I like it. That's the first time I've seen that idea.

Of course, if you standardise something based on say LLVM IR, that spins into the idea of a marketplace of competing optimisation services... -Ofastly versus -Ogoogle-optimizer-cloud ...

Sadly, I don't think it would be feasible to commercialise this. That doesn't mean there aren't amusing possibilities:

"Although -Omechanical-turk does produce very good code, most people cannot wait 3 weeks for that compiler pass to finish..."

sanxiyn · on April 8, 2013

A year ago, Paul Graham selected "optimisation as a service" as a startup idea. It's #6 in http://paulgraham.com/ambitious.html

xyzzy123 · on April 8, 2013

Oh, so it is!

The reasons I wasn't convinced a web-service-based IL optimisation pass would be commercially viable were: loss of determinism in compilation, the law of diminishing returns (throwing 10x effort might generally give only a few percentage points gain), and the idea of parts of the compiler literally being a black box being unpalatable to a lot of people. Also, IP and confidentiality concerns.

OTOH PG's idea was slightly different; it reads to me like he's talking about something that starts with source code.

Houshalter · on April 8, 2013

Now run the superoptimizer on itself (or at least parts of it.) Yes I realize this would quickly hit a point of diminishing returns since it would only try to create code that produces the exact same output, not novel code that produces even better output, but it's still an interesting idea to think about.

And maybe you could judge the output code by it's utility rather than how closely it matches the output of the original algorithm, but that's a much harder problem.

Xcelerate · on April 8, 2013

(Warning! This is a long one!) This is a really intriguing concept. I've often daydreamed about writing something like this, but didn't have the time or expertise to go about attempting an implementation. It's so cool to see others doing this.

What would be really cool with these optimizers is if the stored databases of "patterns" could be shared. In other words, if the optimizer stumbles across a particularly excellent implementation of a particular code goal, it would be uploaded to a central database and cached for reuse and further mutation. People all around the world would be working from, and contributing to, one giant database of randomly discovered optimization techniques.

As a field of study, I also feel like stochastic optimization has a huge potential for improvement over the coming years. Recently, I had the problem of trying to space equally-sized spheres as far apart from each other as I could -- in other words, if I have 361 points that I'm putting in a box, I want the minimum distance between two points to be maximized (this is the fitness function).

My first attempt was to give each point a "charge" and attempt a little electrostatic simulation. This didn't work that well, for various reasons. After stumbling around for a while, I decided to attempt a genetic mutation algorithm. I had never written anything like this before, but I thought it would be fun to give it a go.

Initially, my program randomly placed points in the box. Then, for each point, it finds its nearest neighbor and computes the distance to it. I then take the average of all these distances, so I essentially have the mean nearest neighbor distance. On the next iteration, all points less than this mean value are perturbed. The process is repeated until no random perturbations produce improvement.

This worked alright, but it needed more. I kept modifying my program, but in the process of modifying it, I introduced a bug that somehow actually made the code the converge to a larger minimum distance (which is the goal). Inspecting my code, I realized the source of the "error": in the process of computing the distance between a point and its nearest neighbor, I was setting an initial "minimum" distance, which was supposed to be infinity. As closer distances were found, the infinity was replaced, so eventually it ends up with the closest distance. Well, it turns out that instead of infinity, I had been setting it to something like 10 (for a 100^3 box). So if the nearest neighbor distance was larger than 10, it was still a value of 10 that was contributing to the mean NN distance!

I soon realized that I could also modify the distances that I randomly perturbed the points in each step, and that some values worked better than others. Not only that, but that larger perturbations worked better at the beginning of the convergence, and smaller perturbations worked better toward the end of the convergence. So I created a heuristic that based the next "perturbation distance" on the most recently successful distances.

But this wasn't enough. Eventually as the perturbation distance became smaller, the convergence process would "stall". I found that by temporarily increasing this distance, the system would find new mutation paths that would get the program out of its local optimum and move it forward again. But how frequently do I modify this distance?

The more code I wrote, the more I realized could be changed. There's a ridiculous amount of parameters that could be optimized in my optimizer, and I was just sitting there tweaking them by hand and testing what worked best. The thought occurred to me, why don't I write a program, that writes a program, that optimizes the sphere placing problem? And then, darn, I may have to write a program that even optimizes that one.

Well, I had to get this thing done -- since it was for my research -- so I couldn't go off on a tangent and I had satisfactory results. But the sheer number of techniques to optimize this thing left me highly intrigued, and I will certainly be revisiting my program when I can find time to do so.

ISL · on April 8, 2013

A lot of people have thought really hard about this problem. Plenty of open problems remain:

http://en.wikipedia.org/wiki/Packing_problem

Someone · on April 9, 2013

And for this particular problem:

http://en.wikipedia.org/wiki/Close-packing_of_equal_spheres

http://demonstrations.wolfram.com/SpherePacking/

http://www.maa.org/devlin/devlin_9_98.html (nb: that proof is for infinitely many spheres. For finitely many, the problem is, AFAIK, still widely open. 361 likely is close to infinitely many, though, if your box isn't weirdly formed)

eru · on April 8, 2013

Your approach sounds a bit like simulated annealing, in case you need a keyword to look up more about it.

jacques_chester · on April 8, 2013

You'd find some books on nature-inspired computation to be interesting, since you've reinvented some of the parts of it:

http://www.cleveralgorithms.com/ http://cs.gmu.edu/~sean/book/metaheuristics/

Dn_Ab · on April 8, 2013

Searching within a grammar was recently discussed in Gelman's blog. http://andrewgelman.com/2013/02/26/an-ai-that-builds-a-stati...

ionforce · on April 8, 2013

Yes, I just recently came to the same conclusion in my research of genetic algorithms; that they themselves have parameters that can be optimized. Really wild!

jacques_chester · on April 8, 2013

Very cool. It reminds me a bit of the genetic programming work (aside: GP and genetic algos are different schools of research) on evolving programs where the genome is a string of assembler instructions.

Markov chain Monte Carlo would appear on its face to be better suited to this problem than a pure GP approach; at least to speed up convergence to a solution. It's actually quite different from GP -- no crossovers, more global state etc. Maybe a flock/swarm optimiser approach would be a better analogy in the nature-inspired computation world.

I'd be interested to know whether they experimented with other cost functions. For example, in some schemes, some candidates which fall outside the acceptable solution state (eg code that doesn't run, code that produces the wrong answer) are simply culled. In some you try to perform "repairs". In this case they've taken the other common approach of assigning a cost to incorrectness.

ced · on April 8, 2013

Wikipedia claims that MCMC is for sampling from probability distributions, and that's what I was taught, too. Is there a way to use it to sample from a general search space?

Dn_Ab · on April 8, 2013

I don't think this is MCMC. The way I read it, monte-carlo was mentioned with regards to the idea of searching widely and sparsely. This is counter to the more targeted idea of MCMC, which generates a bunch of correlated samples to more efficiently sample the distribution.

Another reason to suspect this is closer to genetic programming is that they talk about mutation and fitness. I got the impression that the algorithm searches a space of program trees via mutation, without keeping a population - so no crossover. If they did use MCMC it would be as a very specialized form, such as simulated annealing. Also, the Monte Carlo technique used in Go is with tree search, not MCMC.

I find it amazing that randomization works so effectively, we must live in a friendly universe. Many real world problems tend to have enough exploitable structure that the idea of saving effort on optimization (xor measurement) via randomization is a powerful one.

coherentpony · on April 8, 2013

>This is counter to the more targeted idea of MCMC, which generates a bunch of correlated samples to more efficiently sample the distribution.

The idea of MCMC is not to more efficiently sample a distribution through to use of correlated samples. Sampling a Gaussian exactly is more efficient than constructing a Markov chain with some random walk proposal (for example) to sample it.

The idea of MCMC is to sample distributions that are considered intractable. Correlation in samples skews moment calculations, and this is bad. In fact, there is a whole field of research dedicated to designing proposals that allow chains to reach stationarity more quickly.

Dn_Ab · on April 8, 2013

Yeah I know that it is to aid in the computation of intractable inference, I should have been more careful in my wording. I am not saying it works by generating correlated samples, I'm saying because of how it works it will exhibit autocorrelation. Essentially, MCMC will generate samples that are near each other which will tend to be positively correlated. This property is why one needs to tune the variance of the proposal, do burn in and test for convergence. However, due to the nature of markov chains, large enough samples will have you sampling from P(x). The idea is to approximate the density of P(x) while trying to reject as little as possible. Hence less wasteful. But unlike simpler rejection methods samples will not be i.i.d.

I've also read the actual paper now, which I should have done in the first place. I see how they make use of MCMC. And looking at it, I think my suspicion that what they have is very close to annealing (sans Temp) is accurate since they emulate density with cost and proposal with local edits. The particulars of the more general (though less generally applicable) MCMC as I understand it involve actual distributions. I'm not sure why the blog emphasizes searching sparsely instead of sampling.

coherentpony · on April 8, 2013

>The idea is to approximate the density of P(x) while trying to reject as little as possible. Hence less wasteful.

The idea is to approximate P(x) period. Rejecting as little as possible means nothing other than one's choice of proposal was poor. There are published works stating the optimal acceptance probability in high dimensions is 23.4% (see the work of Roberts et al (1997)). I think if one wanted to reject as little as possible one would have, albeit naïvely, tuned to a much higher acceptance rate.

Dn_Ab · on April 8, 2013

You misconstrued. Reject as little as possible (<> reject little) was the motivation with respect to more naive random walks which suffer at dimensionality. I was not talking about leniency of proposal, where you are of course correct.

coherentpony · on April 8, 2013

>You misconstrued. Reject as little as possible (<> reject little) was the motivation with respect to more naive random walks which suffer at dimensionality.

Misunderstanding on my part; apologies.

Can you clarify the statement on dimensionality? Moment calculations converge at a sharp rate of 1/sqrt(n). Riemann sums as approximations to integrals (moments) suffer from the curse of dimensionality.

jacques_chester · on April 8, 2013

> I got the impression that the algorithm searches a space of program trees via mutation, without keeping a population - so no crossover.

My reading is that they're generating programs of gradually increasing length using markov chains (because the original superoptimizer work was generating random programs of progressively greater length until one worked). So not really a classic GP scheme at all.

> Also, the Monte Carlo technique used in Go is with tree search, not MCMC.

Monte Carlo tree search, right. That connection was drawn by the blogger, not in the paper.

> I find it amazing that randomization works so effectively, we must live in a friendly universe.

http://en.wikipedia.org/wiki/Anthropic_principle

Dn_Ab · on April 8, 2013

My reading is that they're generating programs of gradually increasing length using markov chains (because the original superoptimizer work was generating random programs of progressively greater length until one worked). So not really a classic GP scheme at al.

It seems my definition of Markov Chain Monte Carlo is more specific than yours (approximate a probability distribution using a simpler proportional one and conditional samples) while I am more generous about GP: any search technique using mutation operators on program trees.

What do you make of the fact that the blogger specifically stated that the key to the technique is searching sparsely, which suggests very large jumps and runs counter to the idea of chains? There is also no hint of the idea of probabilistic acceptance. So even my mention of annealing is likely to be mistaken.

Anthropic Principle

Yes. Still doesn't stop me from chuckling at the audacity of things like Random Kitchen Sinks though (google it).

jacques_chester · on April 8, 2013

Good question about whether it's sparse searching. To me the point of using markov chains seemed to be to converge more quickly, not to increase coverage (where you'd actually want to introduce more variation).

Annealing is always worth mentioning because a lot of the time, it's a better performer. GAs and GPs carry a lot of systemic overhead.

coherentpony · on April 8, 2013

Annealing is hit or miss. I often have to sample a distribution on a domain that approximates an infinite dimensional space and annealing doesn't cut it for me. There are just far too many modes. Personally, I'd like to see what GAs and GPs have to offer in this regard.

jacques_chester · on April 8, 2013

Well to pick between GA and GP, the question is: are you trying to evolve a list of parameters, or create a novel function/program?

Parameters into function: use GA.

Function/program: use GP.

(speaking very roughly, that's the historical distinction between them)

coherentpony · on April 8, 2013

From a mathematical perspective, it looks as though your statement could imply that GA may potentially be viewed as a finite dimensional analogue of GP. Interesting.

tikhonj · on April 8, 2013

Following that paper, a friend and I implemented a simple MCMC-based synthesizer for the GreenArrays chip (a very low-power and rather obtuse architecture based on a variant of Forth).

Most of the code is up on GitHub in several different repos[1][2]. I'm actually rather proud of some of the code, especially part of the F18A interpreter (F18A is the name of the instruction set on GreenArrays chips). It's rather short and neat.

[1]: https://github.com/TikhonJelvis/array-forth

[2]: https://github.com/jacobt/mcmc-synthesis

My version actually supports loops and even self-modifying code. This no doubt has some performance implications--my current approach is to throttle each test case to take at most as many steps as the input program does. For various reasons, I found it easier to just allow loops than to try to stick to loop-free code.

I found that their correctness measure--looking at the number of bits that differs from the expected result--was basically useless for GreenArrays. I actually have some graphs lying around which show how it basically doesn't vary until it suddenly finds a correct program and shoots up to being 100% correct immediately. It does not exhibit the nice hill-climbing behavior that we want.

For better or worse, I have not been able to get this style of approach to perform as well as a similar synthesizer using an SMT solver (namely Microsoft's Z3). The advantage with the MCMC system is that it's easily parallelizable. It can also deal with certain programs that can choke up the SMT-based version.

The SMT-based system has problems dealing with lots of bits, so we usually run it pretending that the architecture uses ~4 bits instead of 18 (yes, it uses 18 by default!). So for problems requiring more precision, the MCMC approach might be better. Similarly, the SMT system does not model memory terribly efficiently, so problems needing a lot of mutated memory might be better for the randomized system.

Also, I should add that I know literally nothing about machine learning and even less about SMT solvers, so it's not surprise that the things I have written don't perform terribly well. Basically, my only solace is that this project's code is mostly rather elegant. I learned quite a bit about programming at a high level of abstraction from this project--some of the things I ended up doing rely on significantly more abstract reasoning than anything I've done before.

Anyhow, it's cool to see people reading about something I've been playing around with for a while now. Even if I find the whole field a trifle annoying :P.

If you want to play around with this sort of synthesizer, I have uploaded both a generic library based around MCMC[3]--it's surprisingly simple--and the GreenArrays-specific code[4] to cabal, so they're both just a cabal install away.

[3]: http://hackage.haskell.org/package/mcmc-synthesis

[4]: http://hackage.haskell.org/package/array-forth

nkurz · on April 8, 2013

I've recently been trying to write some very fast x64 assembly, and have been thinking about this quite a bit. I think one could get substantial improvement even with the simpler approach of just reordering the code. With modern processors, small changes in number of instructions have much less impact than how well they interact with the superscalar nature of the CPU. And while out-of-order processing helps diminish this, the order of instructions still matters a lot.

Instead of trying lots of random orders and seeing if they work produce the right result, my thought is to generate a data dependency graph of the instructions, and then enumerate all possible topological sorts of this DAG. There's a classic 1974 Knuth typewritten Knuth paper on generating all the sortings efficiently: http://www.cs.ncl.ac.uk/publications/trs/papers/61.pdf

Generating the dependency graph is difficult in the general case, but should be simple enough if we presume branch-agnostic and alias-free code. As long as the dependency graph is correct, all the orderings should work. Thus if one of them doesn't, I'll know immediately that the graph is wrong. Since I only need to generate the graph once, this step doesn't need to be particularly fast.

Then I'll directly generate assembly for each of these orderings. For speed, and since I already have the machine code, I'm planning to write the bytes directly memory. Then instead of analyzing the code and estimating the processors front-end scheduler, I'll just call each function and time it using RDTSC or a a comparable on-die performance counter. As the timer is cycle-accurate, I shouldn't need many if any iterations.

This should be straightforward in C, and apparently possible in the standard scripting languages using libffi: http://stackoverflow.com/questions/6143042/how-can-i-call-in.... Then I'll save and write out the fastest one.

To keep it simple, it would be nice to outsource the heavy lifting to existing libraries and programs. Their availability might dictate my choice of language. I'm most comfortable with Python, Perl or C, but something else might be fun if all the tools are available. Here's my current list of potential tools:

For disassembly, DiStorm from Python or C. Haven't found anything great for Perl. Maybe X86::Udis86? Maybe just parse 'objdump -d'? My needs are very simple since I'll already have machine code.

For calling assembly, wrappers for libffi if using Perl or Python; nothing needed for C. I guess this could be trickier if I want to consider different alignments, but I think I'll start without.

For timing, I don't think it's usually possible to call RDTSC from Python or Perl, but since I'm planning to be able to execute arbitrary assembly, this shouldn't be any harder to add. A much more travelled path in C, though

For generating all the possible topological sortings, I'll just write something following Knuth's approach. It would be nice if I could generate and call the functions from the sorting algorithm so I can monitor progress and start getting results sooner.

I think my weakest part is generating the dependency graph. I'm willing to cut corners on this by making lots of assumptions about the code, but it needs to be right. I can parse the output from Intel's IACA if necessary, but would be nice to have something more easily available. Are there libraries that do this?

Suggestions and alternative approaches greatly appreciate. My thought is that once it works, it shouldn't be that hard to add some of the "substitute different code" approaches. I like the idea of having a database of approaches one could try out. Of course, what I really hope is that something exactly like this already exists, and I'm just searching by the wrong terms.

_ntka · on April 8, 2013

It would seem intuitive that the search process would be greatly speeded up by using a genetic search of the solution space rather than brute force search, it's a bit odd that genetic algorithms aren't mentioned.

jacques_chester · on April 8, 2013

Two things.

1. You're thinking of genetic programming, not genetic algorithms.

2. They used the Markov chain Monte Carlo method to perform the search. It's a different method altogether. For example, note the absence of crossovers.

Houshalter · on April 8, 2013

I think that it would be correct to call it a genetic algorithm. I thought that genetic programming usually uses a tree structure in a functional programming language.

jacques_chester · on April 8, 2013

Most GP representations are of instruction trees, but some GP experiments have been done with strings of assembler instructions. The commonality is that the population of agents are programs and the test of fitness is a function execution.

Houshalter · on April 8, 2013

The problem with that is that it's really hard to crossover strings of instructions in ways that don't completely break it, especially when code is inserted or removed from one parent. And crossover is 90% of evolution (in fact some claim that genetic programming works just as well or better with no mutations whatsoever and only crossover being used.)

jacques_chester · on April 8, 2013

It's hard, but it's been done. The nice side effect is that fitness functions evaluate quickly.

jules · on April 8, 2013

It isn't; genetic algorithms don't work any better than hill climbing.

taliesinb · on April 8, 2013

And so an NKS prediction comes to pass.

LolWolf · on April 8, 2013

Nice, that's pretty awesome.

asimpletune · on April 8, 2013

This is pretty cool.