Immunity to Domain Adaption, and Invariance Under Reweighting

Posted by Yuling Yao on Jun 14, 2019. Tag: modeling decision

Some loss functions are invariant under domain adaption, which suggests we can indeed learn the optimal model of the population from a non-representative sample without sacrifice from the extrapolation.

Earlier this week I talked to Prof. Samory Kpotufe about how invariance of odds ratio in case-control study is related to the domain adaption, essentially what I wrote in my previous post Should I reweight a case-control study. Samory pointed to me a fairly recent arXiv paper A Generalized Neyman-Pearson Criterion for Optimal Domain Adaptation by Clayton Scott. Yes, it is quite involved.

Types of domain adaptation

Denote $p(x,y)$ as the distribution of the ultimate population on which we would make prediction. Sometimes we might also have access to its margin $P_X(x)$ directly or indirectly through unlabeled data. A researcher however can only sample training data from a proposal ¹ $q(x,y)$. Typically $p(x,y)$ and $q(x,y)$ have to be coherent in some part — we can probably imagine a zip code classification data helps us a fair bit in facial recognition such as feature extraction, but there is probably less to gain for the purpose of predicting the price of NASDAQ:BYND by looking at the historical price data of 15th century English agriculture — we cannot do statistics without stationarity. Some common assumptions on this connection include:

covariate shift: If the conditional model is invariant $q(y\mid x) = p(y\mid x) $, such shift is termed covariate shift as only the margin of $x$ changes. In survey sampling this is the most common situation when there is sampling biases.
posterior drift: In contrast if the margin of $x$ is invariant $p(x)= q(x)$, and $q(y\mid x) = \phi (p(y\mid x)) $ through a monotone transformation $\phi$. For example feature-independent label corruption can be constructed by sampling $(x, y_{latent})$ from $p(x,y)$ first and corrupt the label $y$ by a random map $y_{obs} \sim f(y_{latent}) $ that only depends on the outcome. It should also be applied in survey sampling when the individual response is deliberately randomly drifted so as to protect individual privacy² — suppose we are on the street on Pyongyang and interview random people on their approval rate on DPRK authority, we could ask each participant to flip a (biased) coin, if the coin is head, they would only be required to report their true binary response (1= approve, 0= against), and the opposite response if the coin is modified by the random coin flip (head=report the true response, tail= report the opposite). Presumably it is enough corruption for individual privacy, maybe not enough for ВЧК though.
covariate shift with posterior drift: This is what Scott introduces in his paper. The margin of $x$ is allowed to shifted, $p(x)\neq q(x)$, but the conditional model is up to a monotone transformation $q(y\mid x) = \phi (p(y\mid x))$. I am not saying this is the most general framework, but it does discussed part of the problem I have raised in my last post, which I will draw their connection in this post.
outcome shift: That is $p(y)\neq q(y)$ but $q(x\mid y)=p(x\mid y)$. This is exactly what I discused in my last post. Examples include case-control study in epidemiology.

Let me first recap my previous discussion. In retrospective sampling / case-control study if I fix the margin the $q(y)$ for the purpose of easy implementation and sample $q(x\mid y) =p(x\mid y)$ afterwards, I can still estimate the odds ratio of the population in $p$ as if it is prospective sampling. In the end of my previous post, I emphasized that it is not compatible with the weighting methods which requires to reweighs the empirical loss function of each data point (i.e., cross entropy) by its importance ratio $p(y_i)/q(y_i)$. And I expect there is some other loss function that would induce such weighting invariance.

Scott’s paper essentially answers this question by considering a generalized Neyman-Pearson criterion. We denote $\hat y =g(x)$ as a classification, and the utlitity to optimize over is:

\[\begin{split} \max E_{x\sim p(x| y=1)} g(x) \\ s.t., E_{x\sim p(x)} g(x) \leq \alpha. \end{split}\]

That is to maximize the detection rate by controlling the discovery rate.

First, it dose not imply a loss function associated with individual prediction.

Second, we know the Neyman-Pearson type of criterion, the optimal decision is the likelihood ratio test, which should look like $g(x)= 1 $ if $p(y=1\mid x)>\eta$ and $g(x)= 0 $ if $p(y=1\mid x)<\eta$.

What Scott proves is that, under covariate shift with posterior drift, such optimal decision of generalized Neyman-Pearson criterion can be learned from the data of $q$ without reweighing. The high level insight is that the optimal decision rule should be $g(x)= 1 $ if $p(y=1\mid x)>\eta$, and a monotone transformation between $p$ and $q$ says $q(y=1\mid x)= \phi (p(y=1\mid x))$ therefore the optimal optimal decision rule is $g(x)= 1 $ if $q(y=1\mid x) > \eta^* $. $q(y=1\mid x)$ can be learned from $q(x,y)$ and the threshold $\eta^*$ can be learned from $p(x)$.

A retrospective sampling, is indeed a covariate shift with posterior drift. So what Scott’s paper proves is that if a epidemiology researcher wants to make prediction on his future patients, he could fit a logistic regression using $P(y=1\mid x)=\mathrm{invlogit} (\alpha+\beta X)$, and use that estimated $\beta$ as the slope in his final optimal decision boundary. $\alpha$ is irreverent because the threshold has to be learned using $p(x)$ anyway.

Cool.

But it the whole answer. First, it changes the loss function. What I described last time we can use the MLE of $\beta$ in retrospective sampling as if it is prospective sampling, which corresponds to cross-entropy loss. Second, in this NP-criterion, odds ratio is not a key quantity anymore. I can fit a probit regression, or a LDA, or neural network to $q(y=1\mid x)$, while a hard-core statistician will teach you you could possibly only learn odds ratio from a retrospective sampling, not $\beta$ in probit regression, or a neural network, or a gaussian process.

It is not a contradiction, as they corresponds to different loss function.

From NP-criterion to individual loss function

I am not an extremely fan of hypothesis testing, in part because we almost never use such loss function in reality.

The Neyman-Pearson type of criterion trades off between type-1 error and type 2 error. A more natural (to me) trade off can be constructed through weighted loss function in Bayesian decision theory. To be compatible with NP, I use 0-1 loss:

\[l(y, g(x)=y)=0\]

and

\[l(y=1, g(x)=0)=a_1,\quad l(y=0, g(x)=1)=a_0.\]

Then we can represent the expected loss function using the quantity in NP

\[\begin{split} E_{p(x,y)} l(y, g(x)) &= p(y=1) E_{p(x\mid y=1)} (1-g(x))a_1 + p(y=0) E_{p(x\mid y=0)} g(x) a_0 \\ &= E_{p(x)} g(x) a_0 + p(y=1) E_{p(x\mid y=1)} (a_1- (1+a_0)g(x)) \end{split}\]

It suffices to maximize

\[\max E_{p(x\mid y=1)} g(x) - c E_{p(x)} g(x)\]

where the constant $c=\frac{a_0}{p(y=1) (1+a_0) }$. Recall that NP requires $\max E_{x\sim p(x| y=1)} g(x), \quad
s.t., E_{x\sim p(x)} g(x) \leq \alpha,$ which is a totally different optimization object. In particular we do not have a closed-form solution to the individual prediction loss function.

The bottom line

The NP-criterion, is immune to outcome shift, indeed it is immune to any covariate shift with posterior drift.
In particular, that means in a case-control study one can estimate the $\beta$ in logit, probit or even a neural network in a case control study, as if it is prospective. It also implies we do not sacrifice from the extrapolation (divergence from $p$ to $q$) wheh using data in $q$ to predict $p$.
However, when using the individual loss function, such immunity does not hold. Nevertheless, the odds ratio is invariant so we can still estimate $\beta$ in a logit regression.

These observations seem to lead to a direction to bridge the NP-criterion to individual loss function. Certain methods using individual loss function are invariant under reweighing, too – $\beta$ in logistic regression using cross entropy loss is an example.

Another example in this direction is SVM. SVM itself models the weights on data points so extra data weight is masked. Imagine a linearly separable problem (i.e., there exists a linear boundary that perfectly separates binary outcomes $y=1$ and $y=0$), a training point is either on the support (therefore SVM weight =1 ) or not, therefore (weight =0). In that case, I would not change my inference result even if I know my input X is non-representative of the prediction problem. This can be concerning: If me SVM kernel is so flexible that it essentially can separate any training data, then it will always be weighing-invariant, no matter how strong regularization I put. Deep neural networks, seem to be this case.

it annoys me that I have to call a proposal $q$. ↩
Andrew said he hated this idea as respondents still have motivation to lie. ↩