Why Probabilistic Programming Matters

mjw · on July 15, 2014

I think a lot of statisticians and machine learners remain to be convinced that's there's much payoff available from trying to do efficient statistical inference in such a general setting. As the article warns, it's inherently really hard in its full generality, and I don't think anyone expects a silver bullet. It seems likely that the most general probabilistic programming tools will be strongest on:

* Problems with few parameters and/or few data (I was going to say toy problems, but there are sometimes important and interesting problems of this nature)

* Problems where the generative model is so complicated that you have no hope of doing any better than this and turn to it as a last resort. A bit like in combinatorial optimisation where you just say "gah, let's throw it at a SAT solver!".

(Perhaps that's not a bad analogy actually. If they can get to the point where SAT solvers are now, that would actually not be a bad proposition.)

* In particular, problems where the generative model is complicated but the complicated part of it is largely deterministic -- perhaps some kind of non-linear inverse problem where there's some simple additive observation noise tacked onto the end, for example.

What I do fear about, is the suggestion that people can just start building fiendishly complicated hierarchical Bayesian models using these things and get valid, useful, robust, interpretable inferences from them without much in the way of statistical training. I suspect even a lot of statisticians would be a bit scared of this sort of thing. Make sure you really read up on things like model checking and sensitivity analysis, that you know something about the trade-offs of different model structures and priors etc. And that's before you start to worry about the frequentist properties and failure cases of any approximate inference procedure which is magically derived for you.

Statisticians tend to favour simpler parsimonious models, not only for computational convenience but because it's easier to reason about them, understand and check their assumptions, understand their failure cases and so on.

I wish these guys lots of luck though, it is a really interesting area and the computer scientist in me really wants them to succeed!

mturmon · on July 15, 2014

I generally agree, but I feel like there is less skepticism than you portray among the ML/Stats community about probabilistic modeling languages (hence the DARPA call for white papers). Universal inference for well-posed models is ambitious, but a lot of people have worked hard, with significant success, to make it happen.

The Geman and Geman paper from 1984 was directed at general procedures for inference using sampling, as was the work of the Brown group (Grenander et al.) throughout the 80s and 90s. BUGS (http://www.mrc-bsu.cam.ac.uk/software/bugs/) was specifically directed at that goal, and has been highly successful in real applications since the mid-1990s. Other more recent software, like Stan (http://mc-stan.org) is also targeted at this approach.

Looking at this from a different angle, the published work of many ML researchers has been directed at unifying common models with the clear view of producing software. I'm thinking about Michael Jordan's "plate" notation (http://en.wikipedia.org/wiki/Plate_notation) and the various unifying reviews of multi-component models and inference algorithms for time series (e.g., http://dl.acm.org/citation.cfm?id=309396), and the corresponding reviews for the EM algorithm done by Meng, van Dyk, and others (e.g., http://www.stat.harvard.edu/Faculty_Content/meng/JCGS01.pdf).

mjw · on July 16, 2014

OK, fair points. Perhaps think that came across a bit more negative than I intended it to.

I think there's a lot of value (and a lot of buy-in) for probabilistic modelling languages, which impose some constraints in order to get efficient inference, which help you meet those constraints and don't provide a false promise of generality. And tonnes of value in research which looks for nice unifying formalisms to enable this sort of thing. Also in nice formalisms for inference in general, so perhaps that will be a useful side-effect of this work.

It's the idea of full-blown probabilistic programming, where you have unconstrained turing-complete non-deterministic programming language and inference just works, where I've seen a bit more in the way of healthy skepticism. Of the "this will be nice if it ever arrives, but in the meantime I'll be getting on with doing statistics, which it doesn't obviate the need for".

Similar to a computer scientist and the ultimate "sufficiently smart" compiler I suppose.

tlarkworthy · on July 16, 2014

I disagree. People who know how to build complex bayesian models will be over the moon they don't have to piece a complete system together using matrices. I see no reason for such a system not to scale, essentially the underlying math is elementary, but each statistical package at the moment has a kind of lock in though lack of interoperability. Ideally you want to only do MCMC as a last resort for part of the reasoning, but once you are in WIN bugs you end up doing all the basic stuff with MCMC too, as its too much of a pain to try and tie it to a more efficient system for the bits you can reason with efficiently. I see potential in models that are partially MCMC and partially analytical. I see even more potential in embedding those kinds of systems within larger smart systems.

Some probabilistic standardisation would be lapped up by the applied community.

nashequilibrium · on July 16, 2014

I am enjoying going through this "Probabilistic Models of Cognition":https://probmods.org/generative-models.html .Even though i am a python guy but the fact that it is written using a functional probabilistic programming language called Church, really makes it easy to follow along.

tlarkworthy · on July 16, 2014

yeah looks like nice externals, but yet again, all the inference is with MCMC :s

MCMC is the most general approach, but also the most inefficient, so its got limited appeal in crunching live numbers coming out a physical system. Some problems are intractable and its the only way, but really you want to limit application of MCMC as little as possible, so if you use church you now have an interoperability problem with binding (LISP) to some ugly but efficient FORTRAN system or some such.

Exactly why I think think this initiative will be highly welcome!

mjw · on July 16, 2014

So you're probably aware of this, but I think it might be interesting to try and articulate why I fear extending fully-Bayesian inference algorithms beyond MCMC could be a challenge, from experience with ML-focused Bayesian models and larger datasets:

Generally to make inference for these kinds of models fast, you have to make it approximate.

MCMC has the nice property that, while it's an approximate method, it becomes exact in the limit of infinite samples. Meaning that when applying it in a general setting, the compiler doesn't have to make hard and final decisions about where and how to introduce approximations -- it produces MCMC code and you decide how long you can be bothered to wait for it to converge and how accurate you want the answer to be, based on various diagnostics.

Most other non-MCMC-based approximate inference methods (MAP, EM, Variational Bayes, EP, various hybrids of these and MCMC...) don't converge to the exact answer, they converge to an approximate answer which remains inexact no matter how long you run the algorithm for.

Different approximations have different strengths and weaknesses, and the best choice may depend on the model, the data, what you actually want to do with the posterior (a mode, a mean, an expected loss, predictive means, extreme quantiles?), and what frequentist properties you want for the results of the inference, especially given that this is no longer pure Bayesian inference but some messy approximation to it. Often the only practical way to decide will be to try a bunch of different things and see which does best on the application (or best predicts held-out data), even armed with full human intuition. A compiler doesn't really have a chance here.

In short, it's not going to be easily to automate fully because it's not something you can decide on formal grounds alone. You have to decide how and where you're willing to approximate, and you have to understand the approximations used and check the resulting approximated posterior against data in order to know whether to trust the results and how to interpret them.

mjw · on July 16, 2014

I don't think I'm disagreeing with you really. I'd love to have a standard language for describing probabilistic models, together with some tools to automate deriving a range of different approximate inference algorithms for them. Perhaps also some heuristics on top of this to pick good defaults for the inference method, which I can override. Ideally also a way to implement my own approximations (more realistically, approximations from papers I want to implement) without having to start from scratch.

I'd prefer to decouple those things though, so I can benefit from the first and the second without waiting for a "sufficiently smart compiler" to arrive which can automate the whole shebang well enough to rely on in a fully general programming language setting. That was what I was expressing skepticism about, although I'd be very happy to be proven wrong.

Faint · on July 16, 2014

I think the most exciting opportunity here is actually compiling models to run in specialized inference hardware! Lyric semiconductor a.k.a. Lyric Labs of Analog Devices is working towards such goals (http://dimple.probprog.org/, http://arxiv.org/abs/1212.2991). I hope that they will get some hardware out at some point.

Also, doesn't it strike you strange that we are building simulations of stochastic phenomena (MCMC) using deterministically behaving components? What if we used stochastically behaving components as building blocks to begin with?

Most electronic components start to behave more or less stochastically when you try to run them with as little power as possible, and try to scale them down as small as possible. What if you could build MCMC simulator for the problem of your choice directly from stochastically behaving components? Just think of all the transistors you nowadays use just to generate a random number for simulating 90% probability of something.. they all are components that are manufactured to such tolerances, and run with such power levels that their probability of error is something like on the order of 1e-24 (for one computation). Doesn't that strike as a humongous overkill, when all you needed is something like "1, most of the time"..?

For more related cool stuff, google for imprecise hardware.

ajtulloch · on July 16, 2014

For some examples of one form of "probabilistic programming" is, have a look at some examples from the BUGS book [1]. Here, we'll estimate P(Z < 3), where Z ~ Binom(8, 0.5)

    model {
        Y ~ dbin(0.5, 8) # Y is Binom(8, 0.5) - 8 trials, pSuccess = 0.5
        P2 <- step(2.5 - Y) # does Y = 2, 1 or 0?
    }

When this is passed to {Open/Win}BUGS, the software constructs a factor graph for this model and uses MCMC techniques to sample efficiently on this graph. For example, you can example the distribution of the nodes P2 and Y.

    node   mean     sd     MC error    2.5%   median  97.5%   start   sample
    P2    0.1448   0.3519   0.003317   0.0     0.0    1.0     1       10000
    Y     4.004    1.417    0.01291    1.0     4.0    7.0     1       10000

Thus, we infer that P(Z < 3) ≈ 0.1448.

[1]: http://www2.mrc-bsu.cam.ac.uk/bugs/thebugsbook/examples/

matt2019 · on July 15, 2014

Reading this reminded of a fun quote: "Google uses Bayesian filtering the way Microsoft uses the if statement" [1]. I imagine that having probabilistic values as first class primitives is a step in this direction.

[1]: http://www.joelonsoftware.com/items/2005/10/17.html

tree_of_item · on July 15, 2014

I'll admit, I looked over https://probmods.org and I don't really get "probabilistic programming". It just looks like you call random() sometimes...? Is there something going on in the language runtime that I'm missing?

gliese1337 · on July 15, 2014

The power of probabilistic programming is not in running the programs forward to generate output. That pretty much does just come down to calling random() sometimes to alter the flow of the program. Having built-in syntax for that would be nice for modelling stochastic processes, but we can already do that.

The big deal is being able to run the program backwards. I.e., you already have some set of output, obtained from some other source, and you want to figure out how that output could have been produced (i.e., how the source works). So, you write a probabilistic program that encodes your various hypotheses about how the source process works, and then run it backwards to figure out what the forward execution path would have had to have been to produce that output, and thus which hypothesis is correct. That's what "inference" refers to.

Inference is a basic process in scientific investigation- come up with a bunch of hypotheses, do an experiment that generates data through physical processes independently of whatever you hypothesized, and then go backwards to figure out which of your hypotheses would have had to have been true in order to get the experimental output that you did. Which hypothesis is true corresponds to a particular execution path through a probabilistic program, so having efficient implementations of probabilistic languages means we could do much better at accurate analysis of data, which is Good For Science.

j2kun · on July 15, 2014

> so having efficient implementations of probabilistic languages means we could do much better at accurate analysis of data

What does a compiler-level implementation of probabilistic analysis have to do with accuracy?

tlarkworthy · on July 16, 2014

general inference on a graph of dependant variables is NP-hard to do exactly. So you use a sampling based method to converge towards the right solution and stop when it looks about right. However, for many smaller problems there is an exact analytic solution.

Real problems in the wild are a mix of both types of problem, and intractable part and a part you can do precisely.

In an ideal world you would use the iterative methods ONLY for the parts that don't have an exact solution available. In practice once you have developed the general iterative solution, you might as well just use that for the easy parts too, as you are just burdening yourself with more development work and more scope for bugs by writing two inference implementations.

A language that allowed inference to be switched at no development cost it would be amazing

gliese1337 · on July 16, 2014

Aside from what other commenters already said, inference is fiddly and easy to screw up when you have to do it yourself. Errors in statistical analysis are not uncommon, even in peer-reviewed papers. Automating it is good for two reasons: first, it reduces the error rate in the kinds of analyses already being done; second, it opens up the option of doing much more complicated analyses, either with a larger number of hypotheses or with hypotheses of greater complexity, which would otherwise be ignored as infeasible.

Both of those result in the potential for greater confidence in the accuracy of conclusions reached from experimental analysis.

AlexCoventry · on July 15, 2014

It makes it cheaper to experiment with different models.

j2kun · on July 15, 2014

Still not really a statement about accuracy...

davmre · on July 16, 2014

Experimenting with more models increases the likelihood you'll find a good model. A good model is, by definition, a more accurate representation of your domain than a bad model. It will also tend to generate more accurate predictions, if that's what you care about.

As a secondary point, re-implementing inference code for each new model makes it almost certain that there are bugs in said code. So even without changing the model, automatically generated inference code is likely to have fewer bugs and thus give more accurate inferences than hand-written code. (assuming it runs to convergence; naturally there are lots of scenarios in which naively generated code will be slower to converge than something hand-tuned).

j2kun · on July 16, 2014

I don't consider bugs in code a matter of accuracy of the model. And while you can compare the accuracy levels of various models, having a compiler do the inference or you do the inference doesn't chance the accuracy. I also don't subscribe to the idea that the right model for a scientific or statistical phenomenon is a random event.

popotamonga · on July 15, 2014

So to, for example reverse hashes? (find a collision)

hammock · on July 15, 2014

How is that different that testing a bunch of hypotheses forwards and determining which ones result in the end/starting data?

mturmon · on July 15, 2014

Because:

(1) Even very simple linear models in effect encode uncountably many hypotheses (i.e., y = a x + b where a and b are real numbers). You can't just pull (a,b) out of a hat until the agreement looks good, because the hat is too big.

One tool is, obviously, analytic solutions (where possible) like least squares. Beyond that, iterative methods like the EM algorithm. Beyond that, sampling-based methods like MCMC. You can view MCMC as your idea ("generate a bunch of hypotheses and test them"), but with the change that it does not ignore the old hypotheses when generating a new one to test, it uses an old hypothesis to seed a new one, using a probabilistic update rule.

In the presence of a model, you can basically compile (in the CS sense) an inference program that deduces model parameters from input data. The project to do this has been worked on, in fits and starts, since the late 1940s, by a series of geniuses/visionaries.

(2) Once you have a multi-level model, with many parameters, trading off what you gain by tweaking one parameter versus another parameter becomes hard. If the data is

  (x1,y1), ..., (xN,yN)

and you generate approximations Z1 and Z2 using two different models:

  M1: (x1,Z1_1), ..., (xN,Z1_N), i.e. Z1_i = M1(x_i), and
  M2: (x1,Z2_1), ..., (xN,Z2_N), i.e. Z2_i = M2(x_i)

what is the correct way to gauge agreement of Z1 versus Z2 to the target y? Only working through the model will tell you. For many reasons, least squares it not always suitable.

So even if you were content to keep generating hypotheses and checking agreement, you wouldn't know how to measure agreement.

tel · on July 15, 2014

That's one way to do it, it's just tremendously inefficient in general.

Really grossly impressively inefficient.

defen · on July 15, 2014

Here's my understanding based on the article alone; would love corrections if I'm wildly missing the mark here.

Let's say you want to determine if a coin is fair, but you don't know any statistics. What you can do is write a program (model) that represents the fair coin flip process. It might be as simple as:

    flips = [random.choice([True, False]) for x in range(0, 10)]

Then you flip your coin 10 times and feed the output into your program, and it gives you some indication of the probability that the coin is fair.

basyt · on July 16, 2014

seemingly the program is also run backwards. so maybe it will try coins with various biases until it determines the model that gives the fair flipping?

i am speaking from what i gathered reading the article and the other comments. seems pretty interesting to me.

sp332 · on July 15, 2014

You create a probabilistic model. So you tell the compiler: there's a 50% chance this will be true, and the program will compute both sides and say, there's a 50% chance for this result and 50% for this other result.

tree_of_item · on July 15, 2014

How is this not trivially done with something like (random 0.5)? There must be something significant going on in, say, http://projects.csail.mit.edu/church/wiki/Church for it promote this "probabilistic programming" term.

What I'm trying to say is that at the moment it looks like every programming language ever is trivially capable of expressing these sorts of programs using the random number generators that are wildly common. What is different in this programming paradigm?

In particular, this paragraph:

> Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction. It runs forward to compute the consequences of the assumptions it contains about the world (i.e., the model space it represents), but it also runs backward from the data to constrain the possible explanations.

Are these programs truly running backwards? Is there some backwards reduction relation in their semantics that I could look at?

sp332 · on July 15, 2014

I'm not an expert on probabilistic algorithms, but one thing you're missing is that it computes all the possibilities. (random 0.5) just picks one.

tristanz · on July 15, 2014

Most probabilistic programming languages give you samples from the distribution of execution traces that are consistent with the observed data. No probabilistic programming languages actually run things backward. They all use standard Bayesian computational methods, like MCMC over possible execution traces. The beauty is the user can forget about this and just build generative models with primitives every language contains -- random numbers. The language then offers a way for users to query plausible values for the unobserved elements of the model.

Conceptually, the easiest way to understand things is via rejection sampling. You can "run things backward" but running things forward and just keeping the traces that match the observed output. That's Bayesian inference. Unfortunately, the probability of matching the observed output is zero for most problems so you need to use more advanced methods. Most accounts gloss over these methods since it's a huge research area.

Running things backwards is also not a perfect analogy because if fails to account for the probability distribution you start with.

neurobro · on July 15, 2014

Hand-wavy explanation: (random 0.5) could represent a fair coin toss, e.g. "generate a 0 or 1 with probability 0.5". But the coin isn't necessarily fair, so we can't assume the probability is 0.5. In fact, we'd like to learn what the probability is by observing a lot of coin tosses and counting the heads vs. tails. That's a very simple machine learning model. More complex models involve graphs of random variables where the structure of the graph represents conditional probabilities, and often the variables actually generate other probability distributions instead of scalar values. It doesn't take long before it becomes impossible to calculate the optimal values and parameters of the model. Much of statistical machine learning research involves looking for better ways to approximate solutions to these models, as well as finding graphs that are a good fit for real-world data.

Presumably, a programming language designed around this kind of task would make it easier to explore and develop larger, more intricate models.

araes · on July 15, 2014

I can attempt to provide a more concrete example of real-world use. At the end of the day, it mostly comes down to preserving the idea that there is a probability associated with the occurrence of a result. One of the easiest versions of this I've seen is the two factor model with: mean value + deviation

In one of my lives, I've calculated ballistics. When calculating ballistics, you can probably think of a lot of factors that might go into it.

- Release conditions (velocities, angles, ect...)

- Environmental / trajectory conditions (winds, gusts, different media, ect...)

- Impact conditions and behaviour (spalling, rebounding, fragmentation, yatta....)

For more reading, you can look at something like: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/2007002... (only uses probabilistic methods for transport)

Anyhow, one way to attempt modelling of this is to run a Monte-carlo of the parameter space. Effectively, you run all the permutations of release, all the permutations of those permutations for transport, and then all the permutations^3 of the impact variation.

Now, one of the advances that has come about from efforts like that, is that probabilistic models for some of these steps have been created to approximate the results in a time effective manner.

For example, in the above, they note that instead of running a Monte-carlo of trajectories and releases, a single trajectory is instead run, and then a probabilistic flight cone is created which calculates parameters based on standard deviation from the mean.

A general, and robust version of a language like this would allow you to specify all of your parameters with similar types of effects. IE, you would have mean and then deviation or statistical curve envelopes you would provide around that mean which would be propagated through calculations, and spit out an overall convoluted probability at the end of the pipeline.

In ballistics for example, you might have a mean release condition, with statistical variation on release. That would then be fed into a trajectory calculation, which would do so based on a mean, but with a transport cone that does statistical variation on the mean flight. Finally, impact might be calculated around a mean value, but would then carry the statistical variation through the release and flight. At the end, you would then be able to pull a probability number, as well as impact condition at that probability from any point in the potential impact zone.

At least that's how I see the use of probabilistic languages.

mturmon · on July 15, 2014

Everything you say is true, and that only covers the forward problem: given a multi-level model with stochastic dependencies at each stage (like winds, fragmentation, etc., that you list), calculate the overall distribution of the output.

At its simplest, the forward problem is like a functional composition:

  y = F( G( H(x) ) )

where F, G, and H are all noisy, and you want to characterize y.

@gliese1337 nearby, in an excellent comment, addresses the more complex inverse problem: when you have outputs (y's above) from such a process, how do you determine the most likely model (F, G, H) that produced it?

So for example, if you have a bunch of crater ejecta, what can you tell me about the mass of the incoming objects?

araes · on July 15, 2014

Thank you for the clarification, and yes, most of what I gave as an example could be characterized as forward probabilistic methods.

- Start with a value + divergence and then propagate that state through your transforms to get a mean + divergent final state.

I should note that in the given example the base divergences need to be found through some method, which is usually running a Monte-carlo around some baseline state, and then making the significant assumption that new calculations (with small deltas) follow roughly the same statistical behavior.

For the backwards case, that can be a significantly more challenging case, as it basically comes down to performing approximate eigenvalue decomposition. However, there are a lot of SOTA tools like pretty much anything from this list:

- http://en.wikipedia.org/wiki/Pattern_recognition#Algorithms

IE, the crater example is fairly close to the setup for a Hidden Markov Model, where we get final data, and assume that it is based on a set of hidden states. Or for lighter reading, almost any of the major tools used in the Netflix recommendation contest, which is effectively an enormous version of this problem.

mturmon · on July 15, 2014

"...the backwards case ... basically comes down to performing approximate eigenvalue decomposition..."

Hmm, not in general. (I've been working on various problem-specific versions of the backwards case for most of my career.)

psb217 · on July 16, 2014

FYI, some folks (see: https://spectrallearning.github.io/icml2014/) are certainly working hard to apply eigenvalue decompositions (well, more like generalized SVDs) to any inference problems they can get their convexity-loving hands on.

Disclaimer: my lab mate has been organizing these workshops for the past couple of years and I'm currently procrastinating from preparing a presentation on this material for a reading group later this week, so I've presently got mixed feelings towards it.

BenoitEssiambre · on July 15, 2014

It's true that this Church language seems to be little more than Scheme with a few clever tricks that allows it to walk through code paths and execute programs in clever probabilistic ways. I don't claim to understand it all, however, the clever tricks seem to be very clever and things are put together so well with emphasis on just the right concepts that, to me at least, it makes certain aspects of probabilistic models like the relationship between correlation, causation, induction and deduction much clearer than anything I have seen before.

Years ago when I was a grad student, I tried to tackle some NLP problems using similar approaches (ie. unsupervised generative bayesian models organized in hierarchical ways that try to mimic our intuitive concept of causality) but mostly failed at making much headway. I couldn't figure out some of the math and my models usually failed to perform any better than other probabilistic algorithms. No one at my university did anything similar or knew much about bayesian probabilities back then which didn't help.

I always believed in the approach however and until I stumbled on this book, never knew there was significant progress being made. I wish it was available to build upon when I was a student.

In the "Learning as Conditional Inference" chapter the section "Inferring an Arithmetic Expression" show the first example I have seen of a legitimate and flexible approach that could some day allow a computer to program itself. I mean the algorithm is still far from doing general programming tasks but I could imagine extending the approach so that given a programming language grammar and a unit tests as inputs, a computer could write a well chosen and simplified program that pass the tests (simplified with the automatic Occam's razor inherent in the bayesian approaches) . That to me is mind-blowing.

Now not all is perfect in the book, the "Inference about Inference" chapter seems oversimplified to me. The last few chapter seem to get a bit mathematically ad-hoc with an alpha parameter introduced to tie categories and sub-categories together in hierarchical models without much mathematical justification. On the other hand, there is strong evidence for this kind of transfer of information in hierarchies and it's not totally unreasonable to make up a parameter, assign it a bayesian uninformative prior that reflects ignorance of how this parameter should be set and try to induce a value for it from data. It's a perfect example of meta probabilities where probabilities are used to guess parameters that govern other probabilities. It's very bayesian.

Anyways before reading this book I was convinced that general AI was not going to happen in my lifetime. Now I'm not so sure.

mathgenius · on July 16, 2014

I wonder if such a system can be used for "programming by example". Ie., generate by hand a bunch of example behaviors and then the system learns the program that can do that.

TuringTest · on July 16, 2014

I was pondering that same thought here. Modeling user's intent is one of the more difficult things in end-user development - it would be great to have a more-or-less standard way to infer what abstraction the user had in mind when introducing an example that ought to be generalized into a program.

Yesterday's Program Synthesis Demo [1] with MS TouchDevelop shows that the approach is viable. What I find lacking in such systems is a way to correct the inferred program in those inevitable cases when the learning system guesses wrongly; having a language to model the possible explanations for the example would allow the user to tweak the program or give hints on how to correct it.

*[1] https://news.ycombinator.com/item?id=8028620

platz · on July 15, 2014

something like PGM [1] (this is not a lightweight class) helps to understand the concepts. But it still seems like more of a niche domain right now than a general programming technique.

When one can apply it though, it really shines.

I understand the current implementation of matching for xbox live is a big mess of imperative code - this is one area where knowledge of math can actually simplify the programming [2]

"Online gaming systems such as Microsoft’s Xbox Live rate relative skills of players playing online games so as to match players with comparable skills for game playing. The problem is to come up with an estimate of the skill of each player based on the outcome of the games each player has played so far. A Bayesian model for this has been proposed..." [3]

[1] https://www.coursera.org/course/pgm

[2] http://research.microsoft.com/pubs/208585/fose-icse2014.pdf

[3] http://research.microsoft.com/en-us/um/cambridge/projects/in...

mango_man · on July 15, 2014

This article doesn't do a great job of explaining what probabilistic programming actually is. It's about 1) making machine learning and probabilistic modelling accessible to a larger audience, and 2) enabling automated reasoning over probabilistic models for which analytic solutions are inconvenient. (Sorry for the wall of text)

The idea, in a nutshell: create a programming language where random functions are elementary primitives. The point of a program in such a language isn't to execute the code (although we can!), but to define a probability distribution over execution traces of the program. So you use a probabilistic program to model some probabilistic generative process. The runtime or compiler of the language knows something about the statistics behind the random variables in the program (keeping track of likelihoods behind the scenes).

This becomes interesting when we want to reason about the conditional distribution over execution traces after fixing some assignment of values to variables. The runtime of a probabilistic language would let us sample from the conditional distribution -- "what is a likely value of Y, given that X=4?". (in Church this is accomplished with query). A lot of models have really simple analytic solutions, but the inference engine in a probabilistic programming language would work for any probabilistic program. The semantics of this are defined by rejection sampling: run the program a bunch of times until you get an execution trace where your condition holds. This is really, really, grossly inefficient -- the actual implementation of inference in the language is much more clever.

An analogy to standard programming: it used to be the case that all programmers wrote assembly by hand. Then optimizing compilers came along and now almost nobody writes assembly. The average programmer spends their time thinking about higher order problems, and lets the compiler take care of generating machine that can actually execute their ideas.

Probabilistic programming languages aim to be the compiler for probabilistic inference. Let the runtime take care of inference, and you can spend more time thinking about the model. The task of coming up with efficient inference algorithms gets outsourced to the compiler guys, and you just have to worry about coming up with a model to fit your data.

Because you don't have to think too hard about the math behind inference, probabilistic modelling suddenly becomes accessible to a much larger subset of the population. A ton of interesting software these days is relies on machine learning theory that goes way over the heads of most programmers suddenly becomes accessible.

On the other hand, the people that do this work already are freed up to choose more expressive models and be more productive. The current paradigm is: come up with a probabilistic model, then do a bunch of math to figure out how to do efficient inference over the model given some data. Proceed to code it up in a few thousand lines of C++, and panic if the underlying model changes. The probabilistic programming approach: come up with a model, and write it in a few hundred lines of probabilistic code. Let the language runtime take care of inference. If the model changes, don't worry, because inference is automatic and doesn't depend on the specific model.

If you're interested in this, the Probabilistic Computing Group at MIT (probcomp.csail.mit.edu) has some interesting examples on their website.

An really simple example of Venture, their new probabilistic language: http://probcomp.csail.mit.edu/venture/release-0.1.1/console-...

sitkack · on July 17, 2014

Well written.

? How does logic programming (Prolog, etc) relate to probabilistic programming ?

pointfree · on July 15, 2014

This reminds me somewhat of the Bloom programming language for "disorderly programming in distributed systems".

http://www.bloom-lang.net/

JadeNB · on July 15, 2014

Every time that I see a discussion of probabilistic programming, I'm reminded that I want to try to find an excuse to use IBAL (http://www.gelberpfeffer.net/avi.htm). I used to have a binary for it lying around, but can't find it any more; does anyone know where to get it?

davmre · on July 16, 2014

IBAL is quite old; you're probably better off learning something that's still being actively maintained. The natural choice would be Avi Pfeffer's new language, Figaro (https://www.cra.com/commercial-solutions/probabilistic-model...).

JadeNB · on July 16, 2014

Thank you! I will have a look at it. Do you know if there's any analogue of the 99 Haskell / Lisp / Prolog problems list geared toward probabilistic programming?

davmre · on July 16, 2014

Not as far as I know. The closest I can think of is Josh Tenenbaum and Noah Goodman's online book "Probabilistic Models of Cognition", https://probmods.org, which has lots of examples of simple models, written in Church; these could be a good set of exercises to try translating into Figaro. Though these systems are immature enough that there's no guarantee Figaro will actually work on the translated models (or that Church will even work on the originals :-).

sriku · on July 16, 2014

Functional Pearls - Probabilistic Functional Programming in Haskell - http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf

danarlow · on July 16, 2014

Re running your program "backwards": good luck ever figuring out if your sampling of an unknown but very complex distribution has converged :/

basyt · on July 16, 2014

The buzzwording was strong in that article. I am not an expert, but I will be looking at this in the future.

contingencies · on July 16, 2014

Law of Probability Dispersal: Whatever it is that hits the fan will not be evenly distributed.