Automated Predictive Evaluation of Causal Inference Competitions

Posted by Yuling Yao on May 30, 2019.       Tag: causal  
We can evaluate causal inference methods through simulations and based on predictive performance of ATE, but is it enough?

One day Andrew showed me this paper Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition by Dorie et al, along with several discussions.

The paper provides a comprehensive comparison of 30 submitted methods in 2016 Atlantic Casual Inference conferences competition, whereas all methods are evaluated using large scale simulated datasets.

As expected, there is no single method that outperforms all others, because otherwise, causal inference will be a finished area. Notably, BART seems to have the best average performance, followed by stacking (h2o).

Causal Inference, given unconfoundedness assumtions, is after all a prediction problem as to predict $E(y \mid z=1)-E(y \mid z=0)$. If I have the reliable prediction ability to predict $y \mid z, x$, then I automatically have a reliable estimation of $E(y \mid z)$.

But to what extent is Causal Inference different from prediction?

Here is my only concern, suppose hypothetically there was a 2016 Atlantic Computer Vision Conferences Competition which asked participants to submit black box methods for facial recognition, the organizer would use the same work-flow for the competition: she would generate input and output data from a pre-fixed data generating mechanism. She then released the training data and test all submitted methods based on MSE. So far so good.

But for causal inference, is these automated predictive approaches the most calibrated evaluations?

Unconfoundedness and likelihood principle

We can say casual inference is unconfoundedness plus prediction. The unconfoundedness is vital, but also irrelevant, as after all methods are trying to estimate $E(y \mid x, z)$, no matter parametric or non-parametric, DIY to the black box. The unconfoundedness is critical, but also irrevalent. Or in other words, all methods are equally just as sensitive to unconfoundedness. Suppose there are some unobserved/unselected confounder $x_u$, all methods result in a hidden bias \(E_{x_u, x}(y \mid z ) -E_{x_u\sim sample}[ E_{x}(y \mid z )]\)

By equally sensitive, I mean this hidden error is the same across all methods, as long as $X$ is given/collected. It is indeed the likelihood principle, predictive evaluation never takes any other factors into account other than the likelihood/generative distribution.

But different methods can have different variable selections.

Conceptually there are two variable selections:

  1. variable selection for prediction, as a larger number of variables may induce higher variance.
  2. variable selection for unconfoundedness or casual assumption, as the variables in the model to use has to block back door path – which of course is only feasible unless we know the true model.

Suppose in an extreme situation, the researcher is unable to distinguish the post-treatment and pre-treatment variables. variable selection for prediction says she should indeed include all variables, provided correct amount of regularization. But she might better not if she needs to maintain casual assumptions.

Put it in another way, there can be both exclusion bias and inclusion bias for causal interpretation, beyond the level of prediction or sampling.

To some extent, the argument here is also irrelevant as we never see a black box method for variable selection whose only purpose is to exclude pre-treatment variables or back door path, because essentially it is not learnable from data (i.e., we cannot tell apart the post-treatment and pre-treatment variables from the data if this labels are missing). Different methods can select variables only based on the data it saw, or in other words, it is only able to make a predictive variable selection.

I do not have a clear answer. I wrote this blog after I read that paper and I thought I had something to say about the gap between predictive evaluation and causal inference evaluation, but now I feel like I have been convinced that these two concepts are indeed equal, up to an additive constant that is independent of methods.


Comments are disabled for this post. Feedbacks via email are more than welcome.