I am interested in a companion phenomenon with the recent interest in causal models in machine learning. Namely, the fact that at least in computer vision, it is not new at all and has been an important idea for at least many decades.
One of the original sources that took this approach is "The Ecological Approach to Visual Perception" (1979) [0], by James Gibson, discussed at length the idea of "affordances" of an algorithmic model, similar in some respects to topics in reinforcement learning as well. Affordances represented the information about outcomes you gained by varying your degrees of observational freedom (i.e. you learn how to generalize beyond occluded objects by moving your head a little to the left or right and seeing how the visual input varies. This lets you get food, or hide from a predator that's partially blocked by a tree, etc., so over time generalizing past occlusions become better and better -- this is much more interesting than a naive approach, like using data augmentation to augment a labeled data set with synthetically occluded variations, for example as is often done to improve rotational invariance).
Then this idea was extended with a lot of formality in the mid-to-late 00's by Stefano Soatto in his papers on "Actionable Information" [1].
I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning rather than appearing like one is saying, "Us over hear in causal inference world already know so much more about what to do ... so now let's apply it to your domain where you never thought of this". Pearl has a history of doing this stuff too, like with his previous debates with Gelman about Bayesian models. It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.
> It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.
This doesn't seem like a very fitting description of Pearl. In his work, he is very careful to cite existing approaches (structural equation model literature, various topics from graphical models). In his various discussions with Gelman, he comes off as freakishly polite and not looking to one up.
I'm sorry but I simply don't agree about the politeness comment. As linked from a Quora post that goes into, this was one of Pearl's original statements about the disagreement (link to the original at the UCLA site appears to have been taken down) [0]:
> "I therefore invite my colleagues... to familiarize themselves with the miracles of do-calculus. Take any causal problem for which you know the answer in advance, submit it for analysis through the do-calculus and marvel with us at the power of the calculus to deliver the correct result in just 3–4 lines of derivation. Alternatively, if we cannot agree on the correct answer, let us simulate it on a computer, using a well specified data-generating model, then marvel at the way do-calculus, given only the graph, is able to predict the effects of (simulated) interventions. I am confident that after such experience all hesitations will turn into endorsements. BTW, I have offered this exercise repeatedly to colleagues from the potential outcome camp, and the response was uniform: “we do not work on toy problems, we work on real-life problems.” Perhaps this note would entice them to join us, mortals, and try a small problem once, just for sport."
This is absolutely the cheeky spirit of one-upsmanship I am talking about. The offers are always framed in terms of "look how causal inference supersedes everything," which is not a charitable take on approaches from others, especially in historical applied ML, that might have already developed some of the same underlying ideas.
I don't know. The issue he is addressing in your quote is that people often leverage criticisms of his approach that are just verbal statements. Pearl wants people to use data-generating models to make their concerns explicit.
The link you used explains the situation pretty well. If anything Pearl's regular acknowledgement of graphical models seems to be an indication that he is mindful of at least one very common approach in current ML.
In theory, yes. However, I think in practice addressing the concerns of critics is often out of Pearl's hands.
Until they supply a "ground truth" or data generating model, he has a dilemma:
* if he doesn't create a data generating model, then arguments for / against his approach will be specious.
* if he creates a data generating model, they can claim it doesn't reflect reality.
In the case of Judea Pearl and Andy Gelman, it seems like the point of contention is much broader than the do-calculus. Andy Gelman does not seem to be a fan of structural equation modeling / similar graphical models.
How is it out of Pearl’s hands? Also, Gelman & Rubin already did look into Pearl’s models, and even agreed that for some toy model examples, the technique works as intended, but that there are serious how-things-work-in-practice reasons why Pearl’s models are unlikely to be mathematically appropriate for some real world use cases.
It’s really a fair response from them to Pearl, especially when the whole time Pearl is presenting it like causal inference is a miracle cure-all.
All I am seeing in your comments is hand waving attempts to shift the burden of proof onto the group of practitioners who already looked into this stuff and weren’t convinced!
So why does it being incumbent on Pearl or on another causal inference practitioner to demonstrate it scaling up to a more complicated in-practice problem still get qualified with an “in theory” from you? Why isn’t it resoundingly obvious by this point that the burden of proof lies with Pearl, and that people would be happy to hear if he can use these models for large-scale, practical use cases, but they (rightfully) don’t see a reason (even after looking into the models) to spend their own time doing it?
Many people have already and continue to try his experiment, and indeed work hard on scaling up the problems that his method is applied to.
That’s part of the problem. It was offered up as a “miracle” that supersedes and fully, logically subsumes the hard fought and real-world tested methods of others, who rightfully weren’t going to advocate the use of some other, unproven thing claimed to be a cure-all, yet they still did engage heavily with it and worked through the math and agreed with derivations for simple models under various collections of assumptions.
I’m still not seeing anything convincing. Pearl’s tone is not polite or even quirky, and strays badly from any measure of humility or collaborative spirit of inquiry. Really. I mean, sure, he’s not coming out dropping f-bombs, but that’s utterly not the point. Trying to justify it as if it’s a diplomatic and earnest request is silly. “Hey, your career’s worth of work is totally wrong. Here look, my work completely supersedes yours because I worked through some canonical, highly academic cases. But I’m being fair & balanced about it, I swear!”
And then when folks did look into it and still felt unconvinced, “I guess you’re just dogmatic & closed minded. I offered to look into it with you, but you ignored me.”
I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning...
I think you wind-up with a situation where the none of the less-than mainstream of conceptions intelligence will have further parts added. Instead, each becomes associated with a single individual's career. It's something of the nature of academia, a situation that made sense when scientific models and approaches were "small" enough to be fully encompassed by an individual.
But you have the problem models aren't naturally modular. Whether X model extend Y model is something of a judgment call. What makes one like or not-like another model is a matter of both the structure of the model and the reasoning behind the model.
Moreover, consider ten programmers creating one computer program tend to proportionately less productive than one programmer creating a program (ie, they work much less than 10x as fast as a rule). Ten theorists putting together one single theory may face a similar or greater problem of diminishing return and coordination.
The development of Quantum Field Theory is a good example where >10 people all collaborated to come up with a framework that integrated the viewpoints of multiple theorists with radically different approaches, rather than every new contributor forking a personalized version of the previous theory.
Consider, for example, the way Freeman Dyson combined the graphical approach of Feynmann with Schwinger's more formal methods.
The development of Quantum Field Theory is a good example where >10 people all collaborated to come up with a framework that integrated the viewpoints of multiple theorists with radically different approaches, rather than every new contributor forking a personalized version of the previous theory.
Sure, I hope I was clear that I don't ten theorist (or ten programmers) collaborating is impossible. I would simply say that collaborating has an extra cost to it - and a competitive academic world, any cost needs some degree of payoff. This makes extending a mainstream theory advantageous but not so much less-known theories.
And Quantum field theory had the advantage that the experiments for demonstrating it's truth or falsehood were relatively straightforward. With AI, the question of a theories truth is more debatable.
You make good points, and particularly to explain why there might not be much effort to unify approaches, this makes sense.
But it still doesn't explain Pearl's generally thorny disposition regarding other approaches. Most practitioners and researchers will err on the side of humility, and assume that broad swaths of comparable research is valuable and that many of their ideas have probably been thought of before, in one form or another, even if the researcher's approach is deserving of praise for its innovation or novelty.
David Mumford, in his 'dawning of the age of stochasticity' lecture, mentioned the idea of a 'hubris quotient' -- for him it was the idea of claiming to adequately summarize thousands of years of math progress to the point that someone could actually say something novel, in the span of a single career. If you've only been working on it for 30-40 years, and you're claiming to upend something that's been central for hundreds of years, that's a poor hubris quotient, and so maybe you should proceed with a lot of humility and caution.
It just never quite feels like Pearl accepts this for causal inference. Maybe he feels like it has not gotten the attention it deserves and needs to advocate in a more no-nonsense kind of way, but it just seems like somewhat of a bad hubris quotient to start speaking about how it is a novel take on something he feels ML has historically not adequately accounted for.
Well, I'm not qualified to judge Pearl's integrity.
I would note that Pearl is not necessarily the first or the only person to note that modern machine learning has problems associated with it, problems often described by "correlation is not the same as causation." We can see actual practical problems appear when machine learning systems are deployed in situations they make definite judgments affecting people's lives based only on factors correlated with a condition. In the extreme, if X,Y, Z factor are associated with someone acting criminally, are we allowed to arrest the person without a crime being committed? (etc).
So Pearl has some credibility stepping into this "breach" with his (perhaps sell-branded but) more mathematically grounded and statistically sound approach. Of course, the problem is no statics really gives a "sound" way to unambiguously predict a future datum only from past data. They Bayesian does describe how to make sound predictions when you happen to know prior probabilities, a view that "kicks the can down the road" as others have mentioned.
The thing is, in contrast to math, AI has involved a group of models, theories and ideas which have all broadly moved forward across the decades with their stars rising and falling but few being utterly discarded. This is because little to nothing can be proven and moreover, because despite presented alternatives, they intersect like fat Venn Diagrams if considered only formally (though as specific programs-of-research, they may be exclusive). Moreover, publicity is one key to a given approach getting more concrete implementations and ultimately getting funding, more researchers and chance to go the next generation. The relative speed of a neural net on a GPU might well be a key to this sort of model showing promising practical applications. Is this speed inherent or are other models waiting for optimized implementations? If such an optimized implementation is possible, it would require a specialized programmer and hence funding.
And this means? Well, I'm not sure what it means. Perhaps one could deduce a correct model of machine intelligence if one could determine and correct for the biases which currently drive the process.
This comment doesn't really make much sense for me, especially since none of Pearl's techniques have been convincingly demonstrated to work in real situations. It's one thing to take pot shots at practical engineering problems and point of flaws and locations for improvement, but it's quite different to claim that a new framework would solve them when (a) elements of that framework have already existed a while and practitioners knew about them, and (b) the framework hasn't been shown to give state of the art performance or to actually solve cases when algorithmic decision making made improper judgments.
Do you have examples to dispute this... actual examples where a causal inference based model was used for large-scale deployed machine learning problems and demonstrably fixed some type of judgment error that had previously been leading to bad outcomes for people?
I mean there are structural equation models that preceded Pearl's work that Pearl cites. And before that the Neyman-Rubin work.. Neyman first wrote about it in 1923. I think Pearl's principle insight was to use graph theory to reason about either Bayesian things (see probabilistic graphical models) or causal things (see causality). This is a fairly fundamental insight.
Pearl's attention to the do conditionality -- i.e., P(Y|do(X)) versus P(Y|X) is interesting and important in a certain sense, but I'm not sure it's really resolved debates about causality in any practical sense.
I don't really mean that in a dismissive sense, just to point out that his notation just begs the question of what do(X) means, in terms of why it is actually important. To me it just kind of formalizes a certain notation and kicks the hard theoretical can down the road.
In the books and papers I've read of Pearl's, he makes reasonable logical arguments for certain types of causal inferences, but when, in discussion with colleagues, we've tried to think of how they would be implemented outside the context of an experiment, we've been sort of at a loss. I say this as someone who identifies with observational study professionally, but who recognizes the importance of experiments.
My broader point is that I think Pearl's do-calculus can be reexpressed in traditional graph theory/structural equations/statistics without introducing anything new. In that sense, although I think his writings have drawn attention to important issues, I don't think they have solved anything.
> just to point out that his notation just begs the question of what do(X) means,
It's very formally specified. The key object of study in Pearl-style causal inference is a structural causal model. A structural causal model is composed as equations like the following:
Y = f(X, Z, U)
Here, X and Z are observed inputs, other random variables in your system. U is unobserved. In other words, "Y is computed by a deterministic function which takes an unknown random input."
Then, P(Y = 1 | do(X=1, Z=2)) is defined as P(f(1,2, U) = 1).
My gripe isn't with the importance of what Pearl published-- of course it's important. I just mean the concept of conditioning on how the target or observational outcome varies when you intentionally vary some conditional variables, that concept for use in machine learning is not new at all. Causal models would just be one more take on it, with interconnections and differences and pros and cons compared with what came before. But it's always disingenuously framed like, "ML practitioners never knew about doing this, but it's the only way to truly go further with our models."
One of the original sources that took this approach is "The Ecological Approach to Visual Perception" (1979) [0], by James Gibson, discussed at length the idea of "affordances" of an algorithmic model, similar in some respects to topics in reinforcement learning as well. Affordances represented the information about outcomes you gained by varying your degrees of observational freedom (i.e. you learn how to generalize beyond occluded objects by moving your head a little to the left or right and seeing how the visual input varies. This lets you get food, or hide from a predator that's partially blocked by a tree, etc., so over time generalizing past occlusions become better and better -- this is much more interesting than a naive approach, like using data augmentation to augment a labeled data set with synthetically occluded variations, for example as is often done to improve rotational invariance).
Then this idea was extended with a lot of formality in the mid-to-late 00's by Stefano Soatto in his papers on "Actionable Information" [1].
I wish more effort had been made by e.g. Pearl to look into this and unify his approach with what had already been thought of, especially because it turns me off a lot when someone tries to create a "whole new paradigm" and it starts to feel like they want to generate sexy marketing hype about it, rather than to say hey, this is an extension or connection or alternative of this older idea already in the topic of machine learning rather than appearing like one is saying, "Us over hear in causal inference world already know so much more about what to do ... so now let's apply it to your domain where you never thought of this". Pearl has a history of doing this stuff too, like with his previous debates with Gelman about Bayesian models. It almost feels to me like he is shopping around for some sexy application area where his one-upsmanship approach will catch on too give him a chance at the hype gravy train or something.
[0]: < https://en.wikipedia.org/wiki/James_J._Gibson#Major_works >
[1]: < http://www.vision.cs.ucla.edu/papers/soatto09.pdf >