Conformal Prediction: Machine Learning with Confidence Intervals

arcanus · on Feb 6, 2017

Interesting post!

A quibble:

> If you’re a Bayesian, or use a model with confidence intervals baked in, you may be in pretty good shape. But let’s face it, Bayesian techniques assume your prior is correct, and that new points are drawn from your prior. If your prior is wrong, so are your confidence intervals, and you have no way of knowing this.

I don't agree. Baysian models must be validated, like any model. While no validation process is exhaustive, a predictive validation process is designed to directly test the applicability of the prior to a set of results.

Furthermore, there are priors (Jeffrey's for example) that are entropy maximizing, from an information standpoint. These non-informative priors are designed to be used when an otherwise possibly misspecified prior would otherwise be introduced. It is not uncommon for reviewers to ask for results reproduced with Jeffries priors to ascertain of this is indeed the case.

stdbrouw · on Feb 6, 2017

... and in fact, it's machine learning models whose performance is tested using cross-validation or a holdout sample that assume new points are drawn from exactly the same distribution that generated the training data, whereas Bayesian priors are often used for the exact opposite purpose: when you're afraid the data is too noisy or unrepresentative and won't generalize without incorporating outside information.

Also, there is no need to train any model whatsoever if your prior is correct. The prior is the model.

scottlocklin · on Feb 6, 2017

I said it in an awkward way. The important thing is, your CP gizmo will tell you when something has gone haywire while you're using it. Your Bayes doodad might not (other than noticing errors). In particular, some very important new point may have a bad prediction associated with it, with bad error bars: CP may be of big help here. There's an example of this in page 102-106 of the original book I think. Another good example is applying this idea to HMMs which you can read about in "Hidden Markov Models with Confidence" Giovanni Cherubin1 and Ilia Nouretdinov

I'm not throwing poo at Bayesian models, which I think are sadly neglected these days, but with CP ideas you can get more useful results. While I think CP is useful for practitioners now, the most exciting applications are in stuff like active learning, and developing novel techniques associated with this basket of ideas.

Really, my blog kind of missed the mark. I need to do more with examples.

stdbrouw · on Feb 7, 2017

> Really, my blog kind of missed the mark.

It'd be sad if people only published stuff once they considered it perfect. I enjoyed the post, thanks for writing.

arcanus · on Feb 7, 2017

> I enjoyed the post, thanks for writing.

Same. My quibble was just part of the discussion, not intended to imply it was a poor article.

damon_mcdougall · on Feb 6, 2017

Agreed.

shoo · on Feb 6, 2017

i'm glad to see conformal prediction getting a bit more exposure. the idea is quite interesting and reasonable (in terms of the assumptions you make), and in principle the conformal transductive approach can be used to turn any standard supervised learning algorithm into one that produces some kind of conformal confidence intervals. but, as Scott writes

> Practically speaking, this kind of transductive prediction is computationally prohibitive

I tried to use this some years ago - with the simpler ridge regression conformal approach described in the book [1] - when fitting empirical models to experimental data (low number of samples, very high cost of obtaining more samples, high dimensional space) where it seemed desirable to produce some kind of reasonable estimate of the uncertainty of the model fit without making a bunch of assumptions about the underlying relationship.

> There are a number of ad hoc ways of generating confidence intervals using resampling methods and generating a distribution of predictions.

In practice I ended up doing something ad-hoc -- just boostrap sample and fit a bunch of decision trees, then back some kind of crude confidence interval out of the distribution of resulting predictions. I think I ended up preferring this over over a conformal regularised linear model approach because the trees seemed to be a better able to model the actual relationship than whatever simple family of linear model we were using (probably just degree 2 polynomials in the raw input values, there wasn't really enough data for the number of dimensions to support doing much else).

I've never read up on the non-transductive approaches to conformal prediction, so it'll be interesting to read up on some of the references from this post.

[1] Algorithmic learning in a random world - Vovk, Gammerman, Shafer http://www.alrw.net/

volodia · on Feb 7, 2017

That's an interesting discussion! Having read the Vovk papers, this blog post definitely presents things much more clearly. The original papers often don't adhere to the standard definition/lemma/proof style of mathematical exposition, which makes them really hard to follow.

It's also an interesting coincidence that this story on the front page today. I'm giving a talk tomorrow at AAAI on some work that extends this theory. We show how to do uncertainty estimation (e.g. calibrated probabilities for ML classifiers) under fully adversarial assumptions (input data can be chosen by an adversary). I'll do a shameless plug and post the paper here, in case people are interested in this general topic:

https://arxiv.org/abs/1607.03594

scottlocklin · on Feb 7, 2017

Some guys in Reddit pointed this out as a more clear presentation:

https://people.dsv.su.se/~henke/DSWS/johansson.pdf