What is the optimal design of regression covariates?

Posted by Yuling Yao on Feb 23, 2021. Tag: modeling

Depends who you ask.

To be specific, assume the observation is (x, y) pairs, $x\in [-1,1]$, and the model is

\[y= \beta x + \mbox{normal}(0,\sigma).\]

Suppose we also know that in the population of interest, the input is $x\sim$Uniform[-1,1].

Depending who you ask, there are three goals to optimize:

Experiment design or an M-closed view

The objective is to minimize the variance of $\hat \beta$, which is achieved at placing all masses of x at the two end-points

\[x\sim .5 \delta(-1) + .5 \delta(1).\]

Active learning or an M-complete view

Now what if the model is not correct Isn’t it reckless that we do not even try to look around x=0? Adopting a covariate shift persecutive, we want to reweight the model to reflect the difference between training and testing x, $w_i=\frac{p_{te}(x_i)}{p_{tr}(x_i)}$, and we would like to minimize the variance of the weighted OLS estimate: target += w[i] * log_lik[i]. Clearly placing x only on -1 and 1 is bad: the importance weight $w$ and thereby the weighted-estimate $\hat \beta$ would have infinite variance.

We can work out the optimal design

\[p_{train}(x) \propto \vert x\vert.\]

A further complication is the self-normalization of $w_i$.

Causal inference or an M-open view

In the derivation above, we still need some belief model (i.e., we do not trust the linear model so we use weighted OLS, but to derive the optimal design we have to approximate the true DG by this the linear model anyway). In a model-free/minimal assumption case, we would like to hedge against some minimax risk: what if the true $y= x + \mbox{normal}(0,0.1)$, except inside the interval $x \in [-0.01,0.01]$, $y=100 x$? You never know.

The most conservative design is to minimize the variance of the importance weight, which is 0 and is achieved at

\[p_{train} =p_{test} = \mbox{Uniform} [-1,1].\]

In reality we have more complex model than a linear regression. Sadly, literature are somewhat developed separately in these areas. In sequential-data gathering it can be more complicated that we may adaptively change the design based on how the model fits the existing data.