Who is more Bayesian, Rubin or Pearl?

Posted by Yuling Yao on May 23, 2021.       Tag: causal  


OK the short answer is that they are both experts on Bayesian statistics, or at least in terms of using Bayesian inference.

But we have argued that there are many levels of Bayesian practice. For example, one Gelman school Bayesian statistician would be featured by using Stan—and more intrinsically, the preference of a generative model that can be often be mapped into a chuck of stan code.

The word “generative’’ can mean other things, here I mean that we have a probabilistic model for outcome given input, and we could/want to generate predictions and simulation draws for this outcome. If we have such a model, then the causal inference is trivial by imputing all potential outcomes. In contrast, reduced form models do not model individual outcomes. For example, in ANOVA, we do not model $y$ at all; we make a hypothesis test. In Latin square design: we do not model row effect or column effect (even if we can). In the least square fit, we do not have a probabilistic model for $y\vert x$: whether it has a Gaussian noise or a student-$t$(8) noise, a least-square-fit is equally fine.

Many of Rubin’s approaches would fall in the reduced-form modeling culture. I suspect this preference had received some impact from Cochran, who did a lot on experimental design, and Corcoran’s experimental design is pretty much a hedging tool to eliminate all environment factors such that treatment effect can be derived from sample average. A similar goal applies when we are doing matching or weighting: we would like to mimic a randomized experiment after some procedure as much as possible, such that we do not have to model all other confounders by a parametric model. Both matching and weighting are not generative—perhaps matching can be translated to a $k$-nearest neighbors type of model to be conceptually generative, but there is even an additional gap from $k$-nearest neighbors to a probabilistically generative model. A further development through this line is a rich literature on semiparametric theory we automatically dismiss nuisance parameters in estimation, and therefore not focus on individual level prediction. For example: regressing y on the propensity scores appears to be efficient in the semiparametric sense, but it is often not a good individual level prediction.

A DAG, on the other hand, is automatically Bayesian and generative. It could typically be mapped to a Stan program (maybe with the exception of discrete variables, but those can be marginalized out too), and vice versa. In this sense, inference from DAG is at least more “generatively Bayesian” than weighting, matching, double robustness, etc.

But instead of viewing their graph as one of many plausible models, I find DAG people often obsessed with treating a graph as a description of (conditional) dependence. The historical reason is probably due to computation: in the bygone days, we need a lot of conditional independence to run Gibbs sampling or variational inference. Nowadays in generic computation software such as Stan, conditional independence is not even a consideration. But this ease of computation does not ease DAG people’s obsession with (in)dependence. For example, one critical assumption to be made is the faithfulness condition: a DAG is mapped to a set of dependence or independence relations, and nothing more is allowed in the population. On the contrary, I don’t think a matching person would spend too much time discussing conditional independence: matching is sold for being distribution-free (in the same way as a LSE is prima facie distribution-free).

Certainly there are some aspects of Pearl’s approach that are connected to semiparametric efficiency, or the reduced-form-model-for-efficiency culture, when he recommended to only regress on back-door variables. A fully generative model would want to include all variables—I am not saying I want to include all variables and then use the regression coefficient as causal effect; I am talking about using a good regression model for individual level predictions, and then we can extract all potential outcomes.

In light of these comparisons, the difference between Rubin and Pearl’s approach to causal inference is orthogonal to the battle between build-a-genertive-model-as-much-as-you-want vs reduced-form-for-robustness-and-efficiency. Perhaps another orthogonal dimension is between use-a-model-to-smooth-the-outcome-then-do-something-else vs directly-manipulate-data-to-obtain-an-algorithm.

There is a third participating counterpart in this casual inference battle. I want to call it Gelman’s approach but I guess he hates human names in methods. The idea is that we would like to build a fully generative model (as big as you want) for individual level prediction. Then all causal questions, individual or average, are answered by imputations/generative quantities. Still, there are non-checkable assumptions to be made such as unfoundedness. However, we would not assume our model is the true model and derive a sequence of conditional independence therein. Rather, we would check and improve our model in terms of better individual level predictions. Perhaps this view is indeed more “generative” than DAG. A DAG can, but sometimes prefer not to specify the joint distributions of all variables, and that is why it needs a “faithfulness assumption”—a DAG can be used to represent a family of generative models, in the way that a reduced-form LSE can represent both the normal- and student-t-error-model.

I think people in the foxhole have followed this fully-generative approach. Maybe it is time to develop more theories. For example, if I know the individual level outcome $y$ depends on treatment $z$, some related variables $x_1$, and a another set of input $x_2$, and I also know that $z$ depends on $x_1$ but is
independent of $x_2$, then at the risk of being attacked by both Rubin and Pearl, a foxhole -fully-generative-Bayesian would still want an outcome model $y$ that includes $x_2$, $x_1$, and $z$. This model is likely less efficient in estimating ATE, but has the benefit of robustness against model misspecification. Is there any usage of the propensity score in this fully-generative approach? How to balance the goal of generic model fitting (robustness) and a targeted estimation (efficiency) in the workflow?