Measuring extrapolation

Posted by Yuling Yao on Feb 19, 2021.       Tag: modeling   causal  

Cramér–Rao lower bound

I will not call myself a theoretic statistician but sometimes I still find mathematical statistics amusing especially when they have practical implications. To start this blog post, I will go from Cramér–Rao lower bound: Given a statistical model with likelihood $p(y|\theta)$, and $\hat \theta_n:= \theta_n (y_1, \dots, y_n)$ is any regular unbiased estimate from $n$ iid data, then it is guaranteed that the asymptotic variance of $\theta_n$ is lower bounded by Fisher information:

\[Var(\hat \theta_n) \geq \frac{1}{I(\theta)} = \frac{1}{n Var_{\theta}(S(\theta, y))}.\]

where $S(\theta, y_i)= \frac{\partial}{\partial \theta} \log p(\theta, y_i)$ is the score function—the gradient of the log likelihood, or pointwise grad_log_prob in Stan.

One clear limitation is that this bound only applies when $\hat \theta_n$ has finite variance, or when $I(\theta)$ is PD. But in practice many estimate may still be useful if they do not have finite variance. Put it in another way, we want an estimate to have a smaller asymptotic variance for a quicker convergence rate, but an estimate might still converge in practical amount of time even if CLT does not hold.

Semiparametric efficiency

A more profound version of the Cramér–Rao lower bound is the semiparametric efficiency. For convenience we now view the parameter $\theta$ a $d$-dimensional vector, although we could extend to infinite dimensional space too. We also assume there is a true $\theta_0$ which generates all data. Now we consider a regular linear estimate, such that

\[\sqrt {n}(\hat \theta_n - \theta_0) = \frac{1}{\sqrt n}\sum_{i=1}^n \phi (y_i) + o_p(1).\]

$\phi (y_i)$ is the point-wise influence function. Again, the scaling $\frac{1}{\sqrt n}$ implies we want to apply some CLT.

The main efficiency theory is that the influence function must satisfy

\[E( \phi(y) S^T_{\theta} (z, \theta_0))= I_{d\times d}.\]

$S^T_{\theta} (y, \theta_0)$ is the score function. It has mean zero. We could divide the space into the linear span of $S^T_{\theta} (y, \theta_0)$ and its orthogonal complement $\Gamma$. Because the equation above still holds if we add or subtract something $\phi(y)+= something \in \Gamma.$ It also implies that the most efficient estimate (in terms of asymptotic variance) shall live in the tangent space:

\[\{\phi(\cdot):\phi(\cdot)= B \times S_{\theta} (\cdot, \theta_0)\}.\]

For a regular model, this bound will be achieved by MLE, in which case the coefficient $B= I^{-1}(\theta)$.

We love tight bounds as much as hate extrapolation

Besides every line else, one of my favorite quote from Andrew is

Use the methods that you think will best solve your problem, and stay focused on the three key leaps of statistics: (1) Extrapolating from sample to population; (2) Extrapolating from control to treatment conditions; (3) Extrapolating from observed data to underlying constructs of interest. Whatever methods you use, consider directly how they address these issues.

Among his three categories of extrapolation, (1) and (2) are kinda the same thing and (3) is a different concept that focuses more on interpretation. For this blog post, I will only talk about the first two.

The running example I have is how extrapolation affects learning. It is also convenient for my blog post that Andrew explicitly use the phrase “whatever methods”—We can design better and better models, better computation, better features, better neural network infrastructure, but hey, there is some intrinsic learning bound toward which whatever method cannot overcome, the same reason that the CR-bound holds for any estimate. Maybe we shall distinguish aleatoric and epistemic learning bound: the latter one refers to what we could have learned better if we had a better model or a better algorithm, while the former one depends on the configuration of data: some tasks are just more difficult than others no matter what method we use. If we observe 100 points with their input $x\in$ normal(0,1), then making perditions at x=0.5 is clearly easier than x=1000.

Out-of-distribution detection

Out-of-distribution in recent ML literature means that we have some training data (e.g., all existing road data from Tesla users in CA), and when we have a new application (driving on the moon), we shall tell this is beyond the training domain and be cautious for prediction over-confidence.

I do not like this term “out-of-distribution” for in many situation the new test data is a given point, not a distribution—a linked but different task than covariate adaptation. I think a better name would be “out-of-support”. But then it is also misleading as we can always say 10000 is literally in the support of normal(0,1), although too rare to be seen in the training data.

The real concept is how much extrapolation is harmful enough that we should stop given the amount of training data we have. There is a tendency that we have method-specific, or epistemic, tools to detect this extrapolation level: when we do matching, we have various cute statistics to compute the covariate imbalance, when we run a regression, we could use the model based posterior variance of the estimate as a proxy (think about a gaussian process in which we have larger uncertainty outside the data domain).

Extrapolation as a description of data.

But extrapolation is a description of data configuration, and will jeopardize “whatever method”. It is like the sample mean of $y$: we do not report two different sample means for matching and regression.

Return to the CR-bound, when we apply this bound to causal estimate, it is straightforward to obtain the asymptotic lower bound of any regular estimate of the local causal effect $\hat \mu $ by

\[nVar (\hat \mu) \geq E[ \frac{\sigma_{c}^2 (X)}{1-e(X)} + \frac{\sigma_{t}^2 (X)}{e(X)} ]\]

$e(x)$ is the propensity score at $x$ and $\sigma_{t}^2 (X)$ and $\sigma_{c}^2 (X)$ are the conditional variance in the treated and control group. It is clear how the variance is amplified if some $e(x)$ is close to 1 or 0. It is tempting to measure the intrinsic covariate extrapolation by this expecation.

A tighter bound via k hat

A nice property is that the right hand side is always finite—the importance weight has expectation 1. But that is misleading, applying CR inequality already implies a finite second moment of the score function, which is partially not true in this importance sampling application.

Loosely speaking, the order is more important than the constant. Zuckerberg probably (I guess, I do not know for sure) does not envy Bezos for his net wealth is a fraction of the latter one as they share a similar growth rate, but Rothschild family probably (again I guess) fell salty when they look into how the new money growth rate can easily beat their big O constant. The point is that when CLT holds, we have square root convergence rate. But really the extrapolation is more harmful when we do not even have CLT for any method.

A tighter bound is to measure the k hat of the ratio $\frac{1}{1-e(X)}$ and $\frac{1}{1-e(X)}$ (detailed reasoning will come in a new paper). If k hat is smaller than 0.5, then we do have finite variance, so we could compare that variance term. If k>0.5, this lower bound is useless, and in particular if k>0.7, we should not expect any reasonable estimate from any finite sample and any algorithm.