A Decoupling Perspective of Projective Inference in High-dimensional ProblemsPosted by Yuling Yao on Oct 10, 2018.
I have recenetly been reading a textbook on decoupling. Decoupling literally means “from dependence to independence”. I initially though the book can help me understand some properties of self-normalized importance sampling, but it turns out too mathematical to be any related to my work, but sometimes I also read NY times, which is probably even less relevant.
Anyway, I read the great paper “Projective Inference in High-dimensional Problems: Prediction and Feature Selection” (by Juho Piironen, Markus Paasiniemi, and Aki Vehtari) this afternoon. The paper is wonderful, though it is hard to find any paper from Aki that is not wonderful. I realize some of their results might also be demonstrated in the language of decoupling.
Warning: No, this blog is not going to give a more natural or more straightforward introduction. I am writing this blog to justify why I read that textbook at all.
#Lower variance from Rao-Blackwellization
The motivation behind projective inference can be easily found in Figure 1 in their paper. When $\mu=\beta x$ and $y=\mu+\epsilon$, μ behaves roughly like a sufficient statistic (at least for $\beta$), then it makes perfect sense to estimate $\beta$ (variable selection) by conditioning on $\mu$. Of course I am abusing the concept as μ is not even observable, but by sampling from y tilde in the reference model, we assume either we know μ (in the draw by draw case), or we know the distribution, or more precisely the conditional distribution of $\mu$ given x and $\beta$, in the clustered case.
So I can also describe such procedure as Rao-Blackwellization as $\mu=E[y]$, which is known to give a lower-variance estimation of the parameters.
#But that is not the only advantage
Consider a graphical representation. It is easy to see from the graph, or maybe you don’t need the graph, that the simulated predictions from the reference model, $y’$ (I denote it by $y’$ as it is easy to type than tilde, but I have to use an extra sentence to justify why I choose such notation), is conditionally independent of each other given $\mu$. No matter how dependent $y$ or $\mu$ is. The conditional independency is a clear advantage. Even in those complicated data structure, such as time series, spatial data, or multilevel data, where naive implementation of LOO is prohibited , we can still use loo to evaluate the selection model without further modification, as long as the reference model is reasonable accurate.
I have to acknowledge that the conditional independence of y’ given μ is not that useful as it looks like, since y’ is also independent of beta given $\mu$. Nevertheless, when unconditioning on $\mu$, p(y’)=p(y), hence we can still do variable selection using y’ when when unconditioning on $\mu$.
Blocking $\mu$ will not affect the noise parameter ϵ, so we have $p(\epsilon’ \mid y’ )$ = $p(\epsilon \mid y )$, plus y’ is now factorizable given $\mu$. I would imagine an attractive application will be in spatial data or in time series, with heterogeneous noise $\epsilon \mid x$~ $g (x)$, the second stage regression (model selection) can learn that heterogeneity even better than the reference model, which is only aimed/ required to learn $\mu$.
By the way, it is also the similiar conditional indepence, that makes LOO applicable to the one-step-ahaed prediction in time series.
#What makes post-selection inference trival
When y per se is conditional exchangeable given x, such as in all regression examples in that paper, the extra conditional independence of y’ is probably useless. However, we have another conditional independence here: β is independent of $\epsilon$, given $\mu$.
Recall the challenge of post inference in LASSO. var$(\epsilon \mid$ full model) is different from var$(\epsilon \mid$ relaxed LASSO), which leads to different confidence interval for $\beta$, and then makes the post-selection-inference nontrivial. But now ϵ is conditional independent of $\beta$, given $\mu$. So I can select whatever beta I want, without worrying the inference-uncertainty from $\epsilon$.