Bayes' rule in Haskell (2007)

ekidd · on May 2, 2013

This is a pretty old post of mine (2007), and I was a bit surprised to see it on Hacker News this morning! For those of you who find all the Haskell impenetrable, let me summarize:

1. As raverbashing points out, the key takeaway here is "removing the impossible and normalizing is equivalent to Bayes rule." This can be represented in a programming language using a "die unless X is true" instruction and some magic behind the scenes. It's nothing more than a trivial generalization of McCarthy's "AMB" operator for LISP with some weights attached to each possibility.

2. The nicest thing about this approach is that it makes it very easy to reason about probability correctly. For example, the Monty Hall problem, which is known for being counter-intuitive, can be stated very clearly with all the assumptions visible.

3. The original audience of this post was programming language designers and people who understand Haskell and monads. For everybody else, I'd like to apologize for the notation. "Simple" is relative, and in this post, I take a lot of background knowledge for granted.

4. To implement this idea, you can choose between several approaches: explicit enumeration of all possibilities (which explodes exponentially), particle systems, or weighted particle systems. The latter two would be ideally compiled down to GPU code for big problems. Unfortunately, you can't use either Kalman filters or modern Bayesian network solvers, which somewhat limits the usefulness of this approach. Basically, it will scale to lots of data points, but not to complex Bayesian networks.

5. Dealing with different false-positive and false-negative rates would add a couple lines of code to this example.

If you'd like a more detailed writeup of this idea, see: http://www.randomhacks.net/darcs/probability-monads/probabil... I'll be back online in a couple of hours and I'll be happy to answer any further questions.

apw · on May 2, 2013

Do you really mean that "removing impossibility and renormalizing" is equivalent to Bayes Rule?

For example, imagine that we know a priori that a coin has only two possible biases, both of which are neither 0 nor 1. When I use Bayes Rule to update my belief in the two biases in response to observed data, I don't see how I am removing any impossibilities. It seems much more like I am downweighting the unlikely bias and upweighting the likely bias.

ekidd · on May 2, 2013

Excellent question. Let me unpack a bit. Here's an example from the PDF I linked above:

  fluStatusGivenPositiveTest = do
    fluStatus  <- percentWithFlu 10
    testResult <- if fluStatus == Flu
                    then percentPositive 70
                    else percentPositive 10
    guard (testResult == Pos)
    return fluStatus

In this code, you can read the operator "<-" as "pick a possible value from a probability distribution."

Until we look at the test results, there are four possible "worlds" with the following probabilities:

   7% (Flu, Pos)
   3% (Flu, Neg)
   9% (Healthy, Pos)
  81% (Healthy, Neg)

But once we see the the test result is "Pos", then all those worlds with "Neg" become impossible, giving us:

  7% (Flu, Pos)
  9% (Healthy, Pos)

But our probabilities don't add up to 100% any more. Fortunately, all we need to do is normalize them. That gives us:

  43.75% (Flu, Pos)
  56.25% (Healthy, Pos)

This normalization step is basically the denominator in Bayes rule.

As noted above, the number of worlds grows exponentially with the number of probabilistic choices we make. So if you want to use this on a larger problem, you need to use sampling (run the program N times, tabulate the results) or a particle system (which essentially does the same thing in parallel).

applicative · on May 3, 2013

It's unimportant, but the recently reintroduced 'monad comprehensions' bring out some nice features of this approach:

      fluStatusGivenPositiveTest = [fluStatus | fluStatus <- percentWithFlu 10
                                              , testResult <- testRate fluStatus
                                              , testResult == Pos]
        where testRate Flu     = percentPositive 70
              testRate Healthy = percentPositive 10

apw · on May 2, 2013

Until we look at the coin flip results for a coin known a priori to have only two possible biases a and b, where neither a nor b equal 0 or 1, there are two possible "worlds" with the following probabilities:

  0.5 (a)
  0.5 (b)

But once we see that the first coin flip is tails, which we code as -1, then we update using Bayes rule, giving us:

  p(-1|a)p(a)/p(-1) (a, -1)
  p(-1|b)p(b)/p(-1) (b, -1)

I just don't see any removing of possible worlds here. Am I doing this wrong?

ekidd · on May 2, 2013

Let's take the flu test I used in the grandparent post, and walk through it step by step using the traditional notation.

  P(Flu) = 0.1
  P(not Flu) = 1 - P(Flu) = 0.9
  P(Pos|Flu) = 0.7
  P(Pos|not Flu) = 0.1

  P(Flu|Pos)
    = P(Pos|Flu)P(Flu)
      -------------------------------------------
      P(Pos|Flu)P(Flu) + P(Pos|not Flu)P(not Flu)
    = 0.07 / (0.07 + 0.09)
    = 0.4375

  P(not Flu|Pos) = 0.09 / (0.07 + 0.09) = 0.5625

As you can see, the same numbers show up in both versions of the problem. In the traditional example, the "impossible worlds" are P(Flu|not Pos) and P(not Flu|not Pos), because we know we have Pos. So these two terms never appear in the equation. The division by (0.07 + 0.09) is the same as the normalization step in my example: once we narrow our world down to 0.07 and 0.09, we need to divide by 0.16 so that everything sums back up to 1.

You may need to work through the math in both the Haskell version and the traditional version before it becomes completely obvious that they're exactly the same thing.

mokus · on May 2, 2013

The two a priori conditions can also be viewed as 4 possible worlds by branching on the outcome of the (unobserved because it hasn't happened yet or, if you prefer, after it happens but before you observe it) coin toss:

    P(a)P(heads|a)     (a, heads)
    P(a)P(tails|a)     (a, tails)
    P(b)P(heads|b)     (b, heads)
    P(b)P(tails|b)     (b, tails)

When the coin toss is observed, you are removing the possible worlds in which the coin toss did not have the observed outcome. The renormalization step at that point will yield the same posterior probabilities as Bayes' rule.

In your example, we are eliminating the possible worlds where the flip is heads, leaving a total remaining probability that is precisely p(-1).

apw · on May 3, 2013

Great explanation. Thank you!

dllthomas · on May 2, 2013

http://en.wikipedia.org/wiki/Monty_Hall

millstone · on May 2, 2013

Here's more of the quote:

Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.

It's also why Google knows "Jews should be wiped out" and "Muslims should be exterminated" and "blacks are ruining America" and "whites are neanderthals", all suggestions based on the first two words of each phrase. Yes, I'm provoking Google - but surely people also encounter these by accident.

If Microsoft shipped a version of Office that suggested any of the above as "corrections," they would be lambasted for it, and rightly so. Why does Google get a pass? Is it because our standard of decency is so much lower on the Internet? Or is it because we know that Google is merely reflecting popular sentiment on the Internet, and so the true villain is ourselves?

(In fairness, Bing happily suggests equally monstrous ideas.)

anonymous · on May 2, 2013

Because unethical sentences aren't a criminal offence? I mean, if you gave me the first two words from those sentences and asked "what's the most probable next words", I would give you pretty much the same answer. Let me try some more and I'll put my answer next to google's

"lawyers are" -> "greedy sharks" (actual suggestion: scum)

"gays should" -> "be killed" (actual suggestion: be killed)

"marijuana should" -> "be legal" (actual suggestion: be legalized)

"drug users should" -> "get help" (actual suggestion: be shot)

"macs are" -> "better than windows" (actual suggestion: better than pcs)

As you can see, google's opinion on what is the largest cluster of opinions on the subject aligns with what I expect in 4 out of 5 cases. Also note how you're getting both the suggestion that marijuana should be legal and that people who smoke it should be shot. These suggestions aren't google's opinion on the matter, it's what google expects you to think. Thankfully, they're wrong most of the time.

In the end, it's not impossible to write an ethical filter the same way google has a spam filter, the only reason they haven't done it is because there's no pressure to do so. If you don't want to see such suggestions on google, you have two options:

1) Make people not talk about killing gays or how all lawyers are scum

2) Pressure google to add a filter

Also, white people really are partly neanderthal https://en.wikipedia.org/w/index.php?title=Neanderthal&o...

pmcg · on May 2, 2013

It's not necessarily what google expects you to think, but rather what you are most likely to be searching for.

Sometimes people search for content that they might not agree with, because they want to see what is being said there out of curiosity. Not every search is someone submitting their opinion to google, I'd expect that most are not.

anonymous · on May 2, 2013

You're right, I didn't phrase that right. I should have written "what google expects you to think of", like you said. Still, isn't that the same as "if we divide people in groups based on what they think about topic A; what does the largest group think?" Which to me sounds the same as "what you're most likely to be thinking".

All the people who want the addicts dead, let's say they're 30% of all people who think seriously about drug addicts, will happily rally under "should be shot", while the ones who want them rehabilitated would form many smaller groups under specific kinds of rehabilitation programs, how those should be administered and really what is the best program for fixing these people. Though, when you look at it like that, you're really most likely to think "should be rehabilitated" or maybe "should... I don't really have an opinion one way or the other". But then, if google actually did high-level clustering; that is, extracting opinions that are all at the same level of specificity, would those suggestions be useful for a search engine?

I guess the really right way to put it is -- That's what the google crawler has seen written most frequently -- and assume it doesn't really mean what you or I think about things.

pmcg · on May 3, 2013

> I guess the really right way to put it is -- That's what the google crawler has seen written most frequently -- and assume it doesn't really mean what you or I think about things.

Not what the crawler has seen most, but what people typing the same thing as you have ended up searching for most frequently. (We may be thinking the same thing and just confusing the words.)

I don't believe it's supposed to be "what you're most likely to be thinking", it's just a commonly searched-for phrase. I don't think Google's trying to autocomplete with your opinion because people aren't just searching for their own opinion, they're searching for words that will hopefully return the information they want.

jeff303 · on May 2, 2013

Exactly. The "drug users" one, for example, leads to an article explaining how a police officer said that. Anyone hearing secondhand about that story and wanting to learn more about the incident would probably Google that phrase.

mcintyre1994 · on May 2, 2013

There's a difference between suggesting corrections (like office) and trying to guess your next word though. Google does both, but unless you misspell "blacks are ruining America" it's not going to suggest thataas a correction. Since apparently they expect it to be searched they suggest it as you're typing, but I don't think office does any sort of word prediction as you type? As you said, bing does the same. The standard is higher for office and other actual spell checks because they shouldn't be changing something that isn't meant to say "black people are ruining America" into that. Prediction is entirely different.

codeka · on May 2, 2013

Actually, Google Docs suggests spellings the same way Google Search does. It's actually quite cool: http://lifehacker.com/5895252/googles-new-spell-check-is-cra...

mcintyre1994 · on May 2, 2013

Very neat. I wonder if Office online does the same using Bing's predictive features. It's still not the same as filling in "are ruining America" if you type "black men" though, Google docs doesn't do prediction like that. It's 'just' a way more clever spell checker. My point is that comparing Google search to Office isn't a sensible comparison to make, and the fact that Google's version of Office behaves similarly to Office, and Microsoft's search works similarly to Google in prediction/spell checking is pretty much exactly what I was getting at.

It would be neat to see prediction as a feature in Office/Docs etc. though, there's a pretty huge corpus of essays available, it'd be interesting to see how accurately a new one can be predicted.

ivanca · on May 2, 2013

Both, Office haves enterprise as target market and people know this and they set their expectatives in this context; the same thing happens in the Internet where the context is "everyone in the world" and is assumed that most things are popularity-driven (likes, re-tweets, etc).

an_opabinia · on May 2, 2013

I entered expecting a cool implementation of Bayes through a simple if statement.

Instead, I saw Bayes' rule in Haskell, a form that's more idiosyncratic than probability expressions.

Simple is in the eye of the beholder. Perhaps making monads isn't actually making Bayes' rule simpler.

wfunction · on May 2, 2013

+1 it was actually a rather horrifying presentation of Bayes's theorem. I would just explain it in 2 lines:

P(Cause and Effect) = P(Cause) × P(Effect given Cause) = P(Effect) × P(Cause given Effect)

so P(Cause) = P(Effect) × P(Cause given Effect) ÷ P(Effect given Cause)

jhdevos · on May 2, 2013

Naming it Cause and Effect is misleading; it makes it seem like bayes rule only applies when there is a causal relation somewhere. It also make the symmetry of the first line seem off. I would write the first line as

P(A and B are both true) = P (A is true) x P (B is also true, given that we know A is already true) = P (B is true) x P (A is also true, given that we know B is already true)

wfunction · on May 2, 2013

I was actually going to write A and B at first, but then realized that wouldn't show the relationship of the formula to the disease example very obviously.

I thought it'd be better to make it easier to understand (and more practical) by showing how it helps you infer causes from effects.

hollerith · on May 2, 2013

Unfortunately, "Cause" and "Effect" are actively misleading choices of names here. ("Evidence" and "Hypothesis" on the other hand are frequently used.)

ADDED. Explanation: hearing squeaking noises in the night is evidence for the hypothesis that the cheese will have bite marks on it when we look in the morning (in the sense that the squeaking noises increase the probability of the bite-mark hypothesis) even though the squeaking noises do not cause the bite marks nor do the bite marks cause the squeaking noises. You and I know that the squeaking noises and the bite marks have the same underlying cause, but Bayes's rule is useful in situations where cause-and-effect remain unknown. E.g., it can be used by a space alien without knowledge of mice who cannot afford to ponder on the possible causes of the bite marks and the squeaking noises.

ajanuary · on May 2, 2013

Two lines of equations aren't a simple to use programming tool. It's debatable whether TFA achieved that goal, but you certainly didn't.

pekk · on May 2, 2013

It's very easy to turn those simply stated equations into a short program that looks quite similar, without lots of strange metaprogramming going on.

As a bonus, others will easily be able to understand and fix or extend the code.

swift · on May 2, 2013

I don't see any metaprogramming going on in the original post at all.

Evbn · on May 2, 2013

Please show us an example in one language.

tristanz · on May 2, 2013

Bayes Theorem can be implemented as an if statement:

  posterior = []
  [observables, unknowns] = simulate_from_model(priors)
  if observables == observed
     posterior.push(unknowns)

posterior now includes random draws from the posterior p(unknowns|observed).

This is my favorite explanation of Bayes statistics since it implements Bayes Theorem without math. It also is the underlying intuition behind the probabilistic programming approach to Bayesian statistics. Rubin (1984) has a great explanation: https://twitter.com/tristanzajonc/status/325120025428119552

lucian1900 · on May 2, 2013

The end result is an expression executing Bayes' rule and it is used much like an if expression.

vidarh · on May 2, 2013

As someone who have tried on occasion to get used to Haskell (and failed - Haskell makes me want to claw my eyes out), and who have minimal maths background, I found reading Bayes original paper substantially easier than figuring out the Haskell in that post.

visarga · on May 2, 2013

> Google uses Bayesian filtering the way Microsoft uses the if statement.

I am sure it depends on the employee, not the company. MS has done amazing things in machine learning. I love Vowpal Wabbit which is maintained by an ex Yahoo, currently MS employee. I am amazed at the complexity of the Kinect software to translate images into body configurations. Just today I have seen another amazing tecnology from MS: IllumiRoom. Check out the vid : http://www.youtube.com/watch?v=nGGMv9RnJIA Check out MAVIS too, a system for transcribing videos and making them searchable. It's a little dated, but very cool nonetheless.

So, while I still think Google is superior, MS is making some amazing stuff too, from time to time.

abecedarius · on May 2, 2013

The quote was from most of a decade ago; probably truer then. (Though Clippy was Bayesian. I read somewhere a claim that its awfulness came from management messing with it.)

zaphar · on May 2, 2013

The quote wasn't about their technology. It was about their decision making.

Google is famous for data driven decision making.

Eliezer · on May 2, 2013

I've totally seen a LISP dialect with probabilistic execution and Bayesian updating built in - it was one of the lecturers at SPARC 2012 - but damned if I can remember the name of the speaker or the LISP dialect. Roughly, using this dialect, you could build a program that would randomly branch at various points, then keep only the cases of the program that matched some property, and ask what the probabilities were for some other property over those cases. I think the keyword was rejection-query but the first page of Google doesn't turn up the desired result for that.

rawatson · on May 2, 2013

Church? I'm not very familiar with the language but I believe it sounds similar to what you're describing. http://projects.csail.mit.edu/church/wiki/Church

gmodena · on May 2, 2013

Possibly church? http://projects.csail.mit.edu/church/wiki/Church

JadeNB · on May 4, 2013

It's not a Lisp, but there aren't many probabilistic programming languages, so you may be interested in IBAL (http://www.gelberpfeffer.net/Papers/ibal-ch.pdf).

raverbashing · on May 2, 2013

Well, the quoted statement mentions bayesian filtering which is an application of bayesian rule, but it's not as simple as an if.

The other quote 'removing the impossible and normalizing is equivalent to Bayes rule' is a good thing to remember.

Just to mention that drug tests have different rates of false positives and negatives, it's not usually '1% failure'

tripzilch · on May 2, 2013

This actually explains a lot on Google's abysmal customer service, especially where they appear way too triggerhappy on the account-suspensions and pagerank penalizing for appearing too spammy.

adlpz · on May 2, 2013

Google lacks customer service. They are not interested in you, but in volume. And you handle volume through statistics.

If you happen to fall on an edge case, they won't really care. Because you fall in the error.

epo · on May 2, 2013

People who try to exploit Google on the margin risk falling the wrong side of that margin. If that happens then Google are not interested in your feeble "but if you look at this way" justifications. This is not abysmal customer service, it is simply not giving abusive idiots the benefit of the doubt. If they don't like it then they can get free service elsewhere.

forrestthewoods · on May 2, 2013

I still don't understand bayesian statistics and every article I've read to try and learn has had syntax I didn't understand. I just don't know lisp/haskell/etc syntax. My college math stopped at calc 3 and it's been many years since I took that.

Is there a good "bayesian probability for dummies" type resource out there? Any suggestions?

NickPollard · on May 2, 2013

Bayes' Theorem is about updating estimated probabilities after receiving evidence.

For example: I have two bags of sweets. One has 10 sweets, of which 8 are chocolate and 2 are marshmallows. The other has 20 sweets, of which 10 are chocolate and 20 are marshmallows.

I offer to give you a sweet, which I will select at random. Assuming I empty both bags on to the table and choose one at random (uniformly), what is the chance it comes from the first bag? Assume that I do this where you can't see what I choose.

(Answer: 10/30, or 1 in 3 - as these are the ratios of total sweet numbers).

Now, assume that after I've chosen (you still can't see), I tell you that the sweet is a chocolate (This is the evidence). What is the chance now that it came from the first bag?

(Answer: 8/18 - the ratio of first-bag-chocolates out of any-bag-chocolates)

We have now 'updated' our probability estimates based on the evidence ('It's a chocolate'). Important to note here is that the first bag has a higher ratio of chocolates than the second bag, BUT it's still more likely to have come from the second bag due to the second bag having more sweets in total - what we call the Prior probability.

This relationship can be expressed mathematically, which is what Bayes did.

    P(B|A) = P(B & A)/P(A) = P(B)*P(A|B)/(P(B)*P(A|B) + P(!B)*P(A|!B))

(Where P(X) means The probability of X occurring, P(X&Y) means the probability of both X occurring and Y occuring, and P(X|Y) means the probability of X occuring, given that Y has already occurred (or that we assume it occurs))

pfortuny · on May 2, 2013

You probably don't understand it because nobody has taken the effort to explain it to you simply. Everything I have seen on the Internet (especially programming related sources) take a lot for granted and just spread a little code here and there.

Also, a good idea is not "trying to turn your mind inside out", which is what seems to be needed from a lot of explanations I have read.

It is easy, it is simple, it is normal.

I am sorry I do not have a reference handy (because of the above) but those are good suggestions (otherwise I would not be writing them) that have helped me overcome similar "difficulties".

Edit: Cannot help saying it. The 'a priori' and 'a posteriori' and all those terms are just fancy words (which mean what they signal) which scare a lot of people away from something, I repeat, natural.

astrec · on May 2, 2013

Try:

An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem

http://commonsenseatheism.com/?p=13156

huherto · on May 2, 2013

Thanks for the link. I loved this quote.

  Seeing the world through the lens of Bayes’ Theorem is like seeing The Matrix. Nothing is the same after you have seen Bayes.

yummyfajitas · on May 2, 2013

http://yudkowsky.net/rational/bayes

llasram · on May 2, 2013

Jeremy Kun's "Math ∩ Programming" blog has two great primers on probability theory:

http://jeremykun.com/2013/01/04/probability-theory-a-primer/

http://jeremykun.com/2013/03/28/conditional-partitioned-prob...

His next post in the series is supposed to be on Bayes' theorem.

jaynos · on May 2, 2013

The Theory That Would Not Die [1] is a great read. It's a complete history of the fight to legitimize Bayes' theory from Bayes' original experiment through Turing and Nazi subs and beyond. Probably not the quickest way to learn the theory, but worth the read.

[1] http://www.amazon.com/dp/0300188226

tmoertel · on May 3, 2013

I while ago I wrote a series of introductory articles that builds from the ground up, using a single coin toss as a running example. Here's the first:

http://blog.moertel.com/posts/2010-12-07-on-the-evidence-of-...

RobAley · on May 2, 2013

If you understand procedural languages better, the following article discusses a basic naive bayesian classifier and implements it in PHP :

http://phpir.com/bayesian-opinion-mining

JacobAldridge · on May 2, 2013

Nate Silver's book The Signal and the Noise explained it to me really simply, and was a great read. Not as immediately helpful perhaps as some of the links in response to your question, but I thought it was worth sharing.

chrismealy · on May 2, 2013

For the me the trick was realizing that it's not as complicated as everybody makes it out to be. You probably understood it before you'd ever heard of it.

michaelochurch · on May 2, 2013

Let me throw my hat in the ring for this. Tell me how I did.

First, Bayesian formalism shouldn't be hard. Just remember that

    P(A | B) = P(A & B) / P(B)

and it's easy to derive (left as exercise) Bayes's Law. That's quite easy. The hard part, here, is applying it.

Let's say that I have a coin and I run a game where I always bet heads. If I flip tails, you get $1; on heads, I get $1. Let's also say that 1.00% of the people in the world are depraved enough to use unfair coins that always come up heads. You don't know me, so that's a good assumption for how likely I am to be a scumbag. So there's a prior probability of 0.01 that I'm a scumbag running an unfair game, and 0.99 that I'm using a fair coin. Now, we play, and I flip a head. I win. There's evidence (or signal) that is suggestive that it's more likely that I'm a scumbag, but far from conclusive (fair games will turn up heads as well). How much has the probability changed? Well, let's look at the possibilities:

    I'm a scumbag, and flip heads: 1/100 * 1 = 1/100
    -- There's a 1% prior prob., and if I'm a scumbag I'll always flip heads.
    I'm a scumbag, and flip tails: 1/100 * 0 = 0
    -- If I'm a scumbag, I'll never flip tails. 
    I'm using a fair coin, and flip heads: 99/100 * 1/2 = 99/200. 
    I'm using a fair coin, and flip tails: 99/100 * 1/2 = 99/200.

It might help to draw a rectangle for the four possibilities, like so:

      Scumbag    Fair
      (1/100)  (99/100)
    +---------+---------+
    | Scumbag | Fair    |
    | Heads   | Heads   |
    |         |         |
    |  1/100  |  99/200 |
    +---------+---------+
    | Scumbag | Fair    |
    | Tails   | Tails   |
    |         |         |
    |     0   |  99/200 |
    +---------+---------+

Once we flip a head, we can throw out the "flip tails" sections, because we didn't. We flipped heads. That leaves us with a subspace that contains 99/200 + 1/100 = 101/200 of the total space-- we know we're in that space, so we can consider only it-- while the probability mass of the "I'm a scumbag and flip heads" remains 1/100. So the probability that I'm a scumbag posterior to flipping heads is:

    (1/100)/(101/200) = 2/101 ~ 2%

Probability wise, I'm almost twice as likely to flip a coin after a head comes up. That relationship holds very well for small priors, but what if there's a 55% chance that I'm a scumbag? Flipping heads doesn't put that probability to 110%-- first of all, that makes no mathematical sense, and second of all, there's still a chance that I'm fair but just flipped another head. (The actual posterior probability is about 71%.)

For probabilities over 0.5, "twice as likely" doesn't make sense, if you think in terms of probability. What about odds, however? Then, you can derive that "twice as likely" as 55% is 71%.

You can say "twice as likely" if you think in odds, not probability. Odds is probability transformed through p/(1-p), or the ratio between the probability of the event happening and it not happening; it has domain [0, +inf] which means that "twice as likely" doesn't cause that problem. If you think in odds terms, it turns out that odds(I'm a scumbag) exactly doubles each time I flip a head. After the first, that odds number goes from 1/99 to 2/99. If I flip 10 heads, then it's 1024/99; transformed back into a probability it's 1024/1123, or a 91% chance that I'm a scumbag.

In fact, what I think makes Bayesian inference hard is that it involves that subtle context switch between probability and odds.

    If B is observed and there is a known prior _odds_ of A, the posterior _odds_ of A 
    are k times higher, where k is the ratio of the _probabilities_ of 
    (A & B) vs. (~A & B).

What makes Bayesian inference neat is that you can run it as an online process (you can look at events one at a time). If observed events (signals) are independent, you don't need to process the corpus as a whole, and order doesn't matter. That becomes nice when you factor in concurrency and distributed computation. From these principles, you can also derive logistic regression (multiplication in the log-odds space is a vertical shift along an S-shaped logistic curve) and you start to see why the logistic curve comes up so often in machine learning.

Ok, so what about Bayesian statistics? Essentially, it comes from the idea that:

    (1) There is some prior probability distribution [0] over the possible states S 
    of affairs, which you cannot observe directly but whose relative 
    probabilities you can estimate based on observed events E. 

    (2) For each event E observed, compute unnormalized posteriors according to 
    posterior_unnormalized(S) = prior(S) * likelihood(E | S).

    (3) To compute normalized posteriors, divide each of the unnormalized ones 
    by the sum (or integral) of those likelihoods over all states S. 
    You now have a sum of 1, which makes it a legitimate probability distribution.

[0] Regarding (1), picking priors is more of an art than a science, which is why some people distrust Bayesian methods. The good news is that if you have a lot of evidence and your priors are reasonable (few assumptions / Ockham's Razor) you will converge to something close to the right answer with enough evidence, regardless of priors.

Now, many actual Bayesian inference methods ignore (3). It's computationally expensive (when there aren't two possible states S, but millions to infinitely many) to normalize and often we don't need to do it. For example, if you're doing classification between two classes Q and R-- with "events" being features of what you're trying to classify-- then you generally only care about which is more likely, not whether there's specifically a 23.75% chance of Q. So the normalization step is generally either skipped (if relative probabilities are all that matters) or procrastinated to the end of the analysis. This "feels wrong" at first but it actually works, and is often better (numerical stability) for the accuracy of the conclusions.

ajanuary · on May 2, 2013

Typo, you have Scumbag heads twice in your grid.

michaelochurch · on May 2, 2013

Good catch. Thanks.

Evbn · on May 2, 2013

Quite the Freudian slip for Mr Church.

forrestthewoods · on May 3, 2013

That's a great write-up that helps a lot. Thanks!

I do wish every explanation of such things explained the syntax better. I still have to keep checking what P(A | B) and P(A & B) mean. You say the hard part is applying the equation, but the first "hard part" is learning what all the symbols mean!

carlob · on May 2, 2013

Shameless plug:

Mathematica knows about Bayes' rule and makes reasoning about it quite simple. If you look here [0] you'll find Bayes' rule in the section: "Properties and relations".

What makes it extremely powerful is that it's fully symbolic so you can do things like:

    Probability[x^2 > a \[Conditioned] x < b, x \[Distributed] NormalDistribution[m, s]]

and have it evalate to a bunch of Erf…

[0] http://reference.wolfram.com/mathematica/ref/Probability.htm...

gizmo686 · on May 2, 2013

As much as I like Haskell, if your goal is to make Bayes' rule as simple as an if statement, I doubt you can do better than quantum computer emulators.

Say the probability of being a User is 1%. You can initialize your variable, X, to a superpossition, with 99% chance of being clean, and a 1% chance of being a user.

Now, the function, f(x) for the test is such that: f(user) = (99% positive) + (1% negative) f(clean) = (1% positive) + (99% negative)

Take y=f(x). Now, look at the quantum state of your machine (this is why we are on an emulator). Each of the 4 potential states have a certain probability associated with them.

The (theoretical) slowdown of using a quantom emulator is the fact that as the size of your memory increases, you have exponentially more states to deal with. But, because we would need to deal with the same number of states any way, this is not actually relevant.

I should also mention that using a real quantom computer does not provide a (trivial) speed up to this prosess. This is because, while the quantom state contains all 4 possibilities with some probability, when we go to measure it, we will only see 1 of the 4 states. Not only will this not give us a probability, but we would need to re-run the computation to measure it again.

willvarfar · on May 2, 2013

I would imagine that in real life you track the reliability of a drugs test as % false positive and % false negative, rather than a single % reliable?

wfunction · on May 2, 2013

Is it just me who doesn't understand what the body of the article had to do with Google or Microsoft?

cturner · on May 2, 2013

It made some sense to me. The author used the G/M thing to introduce an idea: that it's powerful to think at scale. And then seeks a way to make a thinking-at-scale structure (bayes' rule) as accessible to the programmer as an if statement, by putting it in the language.

That was the only bit of the article that did make sense to me :) I get lost trying to read ML-style languages.

iand · on May 2, 2013

It's just a quotation that makes you think "what if...". I wouldn't fixate on the company names.

sordina · on May 2, 2013

What people are referring to with respect to Google is that if higher-level statistical operations are as easy as an if statement, then this may explain why google takes (presumably amorally) such a statistical approach their customers (because it is easy to).

wfunction · on May 2, 2013

But what did any of that have to do with Microsoft, or even Google's practices?

The entire article was a tutorial on Bayes's rule, with a completely irrelevant introductory paragraph thrown in at the top. Nothing about Google or Microsoft's practices was actually mentioned anywhere.

sordina · on May 2, 2013

Yes, the article was about simplifying statistical operations. The comments were inferring that some observed practices may have followed as a consequence. I'd prefer to discus the meat of the article, but it's not unusual to see comments go off on a tangent like this.

wfunction · on May 2, 2013

What annoys me is when people write irrelevant provocative introductions just to attract attention to their otherwise not so interesting writings.

dllthomas · on May 2, 2013

In this case, it was a quote of a relevant provocative sentence, where the provocation itself was irrelevant. Really, the title was far more intriguing than the quote, but unfortunately the article didn't quite live up to it.

wfunction · on May 2, 2013

> "What if Bayes' rule was as simple as an if statement?"

As the article clearly showed, this is wishful thinking.

And there's one simple reason why:

- Bayesian inference is generally computationally intractable.

- "If" statements are quite tractable.

Any language in which you could perform Bayesian inference as easily as you perform "if" statements in would result in horrifyingly slow code and become useless.

flipgimble · on May 2, 2013

You are thinking about languages that are targeting a single machine at a time. That's the way Microsoft thinks with their history in client-side software.

The implication of the quote is that Google attempts to implement the seemingly intractable algorithms by running them on huge clusters, and returning the results to a large audience that makes the computation worth it. If your language targets a data center then having a language support Bayesian inference might be useful.

gizmo686 · on May 3, 2013

As I pointed out elsewhere in this thread, Quantom computer languages literally do this.

Also, if you want to perform Bayseian inference, then it is equally as difficult computationally whether or not it can be expressed as simple 'if' statements. However, in terms of code simplicity, expressing it as simple if statements is easier.

wfunction · on May 3, 2013

> Quantom computer languages literally do this

Quantum computers also make the sun rise every morning.

I can claim anything about what they do, because they don't exist.

gizmo686 · on May 3, 2013

Even if quantum computers do not exist, a theoretical model of them does exist, so we can discuss them the same way we discuss a turring machine. (Although we might need to clarify which model we are talking about).

However, that is irrelevant because quantum computers do exist. For example, in 2012, researcher's factored the number 21. [1].

However, even the existence of quantum computers is irrelevant, because programming languages designed for them still exist. [2] Not only that, but you can still run programs written in such languages on a classical computer.

Of course, you could also do the type of Bayesian programming in a language designed for that, instead of one designed for Quantom computation.

[1] http://arxiv.org/pdf/1111.4147v2.pdf [2] http://tph.tuwien.ac.at/~oemer/qcl.html

duaneb · on May 2, 2013

I think the original quotation simply used it as a rhetorical device to indicate difference in approach at a more abstract level than literally replacing conditionals with Bayes filters.

auggierose · on May 2, 2013

This is funny, because Microsoft Research actually works on that.

http://link.springer.com/chapter/10.1007%2F978-3-642-19718-5...