A Bayesian reflection of "Invariance, Causality and Robustness"

Posted by Yuling Yao on Feb 23, 2021.       Tag: causal  

I was reading Peter Bühlmann’s statistical science article “Invariance, Causality and Robustness”. To be fair, he gave a short course in 2020 here in Columbia, but after reading this paper I guess I did not totally understand his lecture last time.

Going beyond the potential outcome

The motivation to depart from the potential outcome framework is that in an open-ended observational data gathering we do not know what input is treatment and what is covariate. Or put it in another way, rather than to find the effect of the cause, Bühlmann is investigating the cause of the effect. Denote $X$ to be all input variables and $Y$ the outcome, one goal is to understand which variable in $X$ that actually causes/casually impact $Y$.

Bühlmann defines causality as invariance under different environments. Assuming a non-confounder situation, or equivalently if we have collected all possibly relevant input variables $X$. As per Bühlmann’s definition, we would like to collect data from various environments (perturbation of marginal of x, different countries, different experiment design, etc) and the (x, y) relation that remains unchanged in all environment is then the causality.

For a concrete example, consider the input x is gene sequence of length 1000, $y$ is some protein expression, and we collect data from 10 “experiments”. We denote data by $(x_{ijd}, y_{ij})$, $i=1,\dots, n, j=1, \dots, 10, d=1,\dots,100$, so $i, j, k$ are indexes for data, environment, and covariate dimension respectively.

Bühlmann folds it into a multiple testing problem: to seek a subset $\mathcal{S}\in {1, \dots, 1000}$, such that the conditional distribution of $y\mid X_\mathcal{S}$ is invariant, i.e.,

\[y_{i j}= \sum_{d\in \mathcal{S}}\beta_{d} x_{ijd} + \mathrm{iid~ noise.}, \forall j.\]

Given subset $\mathcal{S}$ we can certainly test this hypothesis by testing if the regression coefficient is invariant under environment. To find this subset $\mathcal{S}$ (the true cause), we can test the hypothesis over all $2^d$ subsets, which needs to be done in the lens of multiple testing. This is the basic of Bühlmann’s method.

Bayesian adaptation

From a Bayesian perspective, if I collect data from 10 “environments”, a more natural way is to fit a hierarchal regression with all inputs included,

\[y_{i j}= \sum_{d=1}^{1000} \beta_{jd} x_{ijd} + \mathrm{iid~ noise}, ~~ \beta_{j d}\sim\mathrm{normal}( \mu_{d}, \sigma_d), ~~\mathrm{other~priors}.\]

Now if the true DG is really the reduced form (i.e., there exists a sparse subset $X_{\mathrm{X}}$ who are $y$’s parent nodes), then with large sample size, we should expect

\[\sigma_d \to 0, ~~\forall d\in \mathrm{S}.\]

Put it in another way, we can interpret the $\sigma_d$ as the extent to which the finding can be transported to other environments, and therefore, $\sigma_d = 0$ means causality.

The nice thing of the Bayesian framework is

  1. We avoid multiple testing.
  2. generalization to non-linear models is straightforward.
  3. In practice we will never observe $p (\sigma_d\mid y)= \delta(0)$. That said, we can place a horseshoe prior on $\sigma_1, \dots,\sigma_{1000}$ to enforce sparsity. I almost want to call this model an automated cause finding.

An alternative Bayesian approach is some predictive projection (post-inference variable selection). After obtaining the full posterior $p(\sigma \mid y)$, we project it to a subspace $\sigma_{d}=0$ for some $d$.

Other thoughts

  1. There is generally no contradiction between prediction and causality. A good prediction requires knowing the true DG, and true DG= causality. Conversely, knowing causality = knowing true model = robust predictive extrapolation.
  2. But certainly I can come up with some counter-example when I used the word generally. Think about the example in which we know the true DG is $x_1 \to y$. $x_1$ is a car’s speed and $y$ is the moving distance in a given distance in 1 minute. But suppose we do not know physics and we therefore collect all other inputs, such as the color, the weight, the number of the wheels… from many observed cars. Now if we do the automated cause finding, we will get $y=\beta_1 x_1$, the correct physical finding, and all other variables are irreverent. But in principle we cannot measure the speed exactly so we always have some measurement error. Let’s say $x_1$= speed + normal (0,0.01), and $x_2$= speed + normal (0,0.02). To improve the prediction, we would like include both $x_1$ and $x_2$ in the linear model—i.e., at the cost of polluting the causal finding.
  3. The framework of Invariance= Causality= true DG is a good justification why we often propose fake data simulation in Bayesian model, even though we seldom make reference to causality. Fake data simulation = generating (fake) data from various environments, and if the fitted model being desired = invariance under environments.
  4. One conceptual challenge is how to make the division of input and environment (analogous to dividing between covariates and treatment in PO approach). If I collect survey from 10 countries, should I view country as an input (a possible treatment/cause), or an environment? So I guess after all there is some human decision here.