Should I reweight a casecontrol study?
Posted by Yuling Yao on Jun 01, 2019.Dating back to Prentice and Pyke 1979, the invariance of Logistics Regression under casecontrol studies and cohort studies have been thoroughly studied.
In a cohort study, researchers collect data randomly from p(x, y). However, for rare disease (only a small fraction of the population has the disease, i.e., p(y=1)«p(y=0)), such sampling scheme may result in too few case samples. Throughout the post, I will use $x$ to denote covariates and $y$ to be a categorical outcome.
In a casecontrol study, researchers collect data from cases (y=1) and controls (y=0) separately until a prefixed number of both cases and controls are achieved. The only random variable is $x$.
Prentice and Pyke 1979 established the theory that the odds ratio from the cohort study can be estimated from the casecontrol study, even in a finite sample.
Moreover, a logistic regression essentially models the odds ratio:
In a cohort study, modeling $p(y\mid x)$ as softmax $(\alpha + \beta x)$ is equvalent to model the oddes ratio by $\beta (xx_0)$.
In a case control study, the same odds ratio implies where $\gamma(x)= \log( p(x\mid z=0)/ p(x=0\mid z=0))$.
Logistic regression can be viewed as semiparametric that leaves $\gamma(x)$ unmodeled.
The MLE of $\beta$, turns out to be the same under casecontrol and cohort study, which implies we can treat a casecontrol sampling as if it is cohort.
Let me sketch the proof. To ease the notation, I will consider the general situation with categorical outcome $y=0,\dots, K$ and softmax likelihood $p(y=k\mid x)=\frac{\exp(\alpha_k + \beta_k x)}{\sum_k \exp(\alpha_k + \beta_k x)}$.
In a cohort study, the log likelihood is
In a casecontrol study, the log likelihood (of $x$ given $y$) is
Rewright the marginal of $x$: Then the log likihood above is
Consider the profile likeihood, where $q(x)$ is replaced by its emperical distribution $\hat q(x)=1/n\sum_{i=1}^n1(x=x_i)$, which is its nonparametric MLE. Then solving $\frac{\partial}{\beta} L_{cc}$ gives the same answer of $\beta_{MLE}$ as in $\frac{\partial}{\beta} L_{coh}$.
By the way, I think most ML literature compares logistic regression with LDA and claims logistic regression does not model $p(x)$. It does.
A more Bayesian version of this analysis is recently developed by Byrne and Dawid 2018, which says if the model is strong hyper Markov, then the odds ratio, or any parameter that only depends on it, has the same posterior distribution under retrospective and prospective model.
wait, why can’t I reweight a casecontrol study?
I mean it is fine to stick to the clear distinction among retrospective and prospective, but after all, the only difference is the sampling distribution. In a retrospective sampling, the joint distribution is
In a prospective sampling,
If I know nothing about odds ratio or epidemiology, I will simply use importance sampling to account for this sampling shift. That is we want to optimize the emperical risk, such that it will have optimal predictive performance under the prediction task in population $p$:
A logistic regression defines the probability prediction $(f(x)_j , j=1,\dots, K)=( \frac{\exp(\alpha_j+\beta_j X)} {\sum_j \exp(\alpha_j+\beta_j X)} , j=1,\dots, K)$, and the loss function $l(x, y=k)=\log( f(x)_k )$ to be cross entroy loss.
So if a blackbox machine learning researcher wants to do retrospective epidemiology study, he would use the weighted loss function or the equivalent log predictive density
where $w_k = p(y=k)/q(y=k)$ is importance ratios that only depend on $y_i$. If $q$ is balanced/uniform, $w_k \propto p(y=k)$, i.e., the population rate.
Surprisingly (to me), it is a totally different expression as in the retrospective likelihood. And we will get a different answer by solving $\partial/\partial \beta_k=0$
To be honest, at the risk of exposing my ignorance, I was totally shocked by this difference. So I did some simple R simulation:
library(arm)
x=rnorm(10000,0,1)
p_sim=invlogit(x3)
y=rbinom(n=length(x),size=1, p=p_sim)
n1=sum(y==1)
n0=sum(y==0)
reto_1=sample(which(y==1),size=500 )
reto_0=sample(which(y==0),size=500 )
ip_weights_data=c(n0, n1)[retro_data$y+1]/sum(c(n0, n1))
retro_data=data.frame(y=y[c(reto_1, reto_0)], x=x[c(reto_1, reto_0)])
L1=glm(y~x,data=retro_data, family=binomial(link = "logit"), weights = ip_weights_data)
L2=glm(y~x,data=retro_data, family=binomial(link = "logit"))
Through 1000 repeated simulations, we summarize the performance:
Method  RMSE of $\beta$  Bias of $\beta$ ^{1} 

unweighted  0.071  0.031 
IP weighted  0.085  0.005 
Yes, the unweighted and weighted logistic regression are different. A weighted casecontrol study is equivalent to rescale the loss function– assigning less loss to false negative.
A first comparison: My understanding is that, under the true data generating mechanism (odds ratio model), the unweighted logistic regression is valid in the sense that it gives the same estimation of $\beta$ that is essentially the MLE under retrospective sampling. A retrospective sampling scheme does not see the population $p$ so it will not reweight the sample points.
The weighted model is also valid for giving the asymptotically optimal prediction.
In a Bayesian context, we replace the point estimation by its posterior distribution and the result remains unchanged.
Further, under model misspecification (e.g., what if the true prospective model is a probit link), the importanceweighted estimate is then double roubust. But it may come with the cost of slower, if not halved, convergence rate.
For example, in the previous simulation, the MSE of unweighted logistic regression is indeed smaller. Of course, I would anticipate it has a better predictive performance on the target population. In fact, the unweighted logistic regression estimates a wrong $\alpha$, which is a nuisance parameter to it.
but after all what is the loss function associated with the odds ratio?
It really bothers me that there are two methods there, such that I can either reweigh or not reweigh the sample in a casecontrol study, and they give a different answer, and they are both valid and totally theoretically justified.
The ‘‘contridiction’’ is from the fact that the derivation in retrospective logistic regression does not explicitly define loss function for predicting individual $y_i$, all that is of interest is the odds ratio $p(y=1\mid x) / p(y=0\mid x) (p(y=1\mid x_0) / p(y=0\mid x_0) )^{1} = \exp(\beta (xx_0))$.
In general, the choice of the loss function depends on the application. When a doctor conducts a retrospective study using logistic regression, say he is curious about whether cigarettes increases the risk of lung cancer. Is his goal to predict whether his future unseen patient will or will not get lung cancer, or simply to figure out the causal^{2} risk associated with cigarettes?
I claim that, even under the true model, the best estimation of $\beta$ is not equivalent to the best prediction of $y$.
In fact, the negative crossentropy loss in the previous section is the log predictive density (lpd). We denote the scoring rule to be $S$:
where $\alpha$ is the nuisance parameter (anything other than $\beta$).
The log score is strictly proper, so we know if $y \mid x \sim p(y \mid \beta, \alpha, x)$, then any other predictive distribuion $Q$ is not optimal:
Therefore under the true model, the optimal prediction of $y$ does require the best estimation of $\beta$.
But the other direction does not hold. I can have a totally wrong estimation of $\alpha$, while remaining $\beta$ perfectly estimated.
Bridging estimation and prediction
One possible solution is to redefine the weighted likelihood, I did some simple search and find several papers (e.g., 1, 2) talking about this idea. It is less sensitive to me as I am satisfied with the prediction framework whenever I need to work through weighting methods. I simply do not need to worry about the likelihood.
Another more promising solution is to rephrase the targeted learning into the prediction framework.
The reason why the existing prediction framework (scoring rule, elpd, etc) dose to adapt to this setting is that it restricts the prediction on observed outcome $y$. For the retrospective study, it is both incorrect (the sampling outcome is $x$ even if I fit an “as if” logistic regression $y\mid x$) and redundant (the researchers only care about the odds ratio, not even the conditional model $y\mid x$).
As an attempt to take these into account, I will define a latent variable to be the odds ratio $z ( x, y)$. It is a length$K$ vector and $z ( x, y)_k$ is its $k$th element.
In particular, this precision task does not require any knowledge of $\alpha$.
$z ( x, y)$ a transformation of $x, y$, but not bijective. Imagine we have a special prediction task: it is too demanding to predict a future $\tilde y$ at $\tilde x$, rather I am interested to predict a nonbijective transformation of it:
Note that any joint distribuion $p(x,y)$ determines the distibuion of $z$. To emphasize this mapping, I will denote the joint law of $(x,y)$ as $\nu$ and the induced law $p( z \mid x,y\sim \nu)=p_{\nu}(z)$
The key observation is that, $p_{\nu}$ is independent of the marginal law $\nu_y$, i.e.
Similarly, it is independent of $\nu_x$.
I tried to consider the prediction framework that focuses on z only, but it did not work out.
In the ideal case we wish we could define a scoring rule $S$, such that we could obtain the emperical score: $S(\beta, z)=\sum_{k=1}^K \sum_{i=1}^{n_k} \log\frac{\exp(\alpha_k + \beta_k x_{i,k})} {\sum_j\exp(\alpha_j + \beta_j x_{i,k})}$ and then we could use the fact that the even though $\nu_{casecontrol}=q(y)p(x\mid y)$ and $\nu_{cohort}=p(y)p(x\mid y)$ are different, they induce the same distribution on $p_{\nu}$, therefore it suffices to minizes the emperical risk using casecontrol data without further weighting.
Open questions
 Case control amounts to a special distribution shift: modifying the marginal distribution of $y$. We know under this shift, the odds ratio is invariant. For a general distribution shift, $p(x, y) \to q(x,y)$, what quantity is invariant? Clearly, it implies we could divide the parameter into two parts, the invariant parameter that can be learned without any reweighting using the nonrepresentative proposal, and the remaining nuisance parameters. It implies a theoryjustified framework for multiagent learning.
 How should we evaluate a model, when the target is a nonbijectivetransformation of individual outcome $y$? For instance, in causal inference, predicting each $y_i$ is much stronger than knowing ATE. Do we have anything more valid other than weighted lpd?
 Fome the discussion in the last section, for the purpose of estimating $\beta$, it is not only valid using a uniform weight or an importance weight, but we could also use arbitrary weights as long as it depends on $y$ only. So we could optimize over that arbitrary weight to achieve better predictive performance. In particular, I conjecture if we set inverse probability weights $w=1/q(y)$ we could improve fairness in imbalanced design?

The debiasing of the weights can be reasoned as follows. MLE in logistic regression is unbiased for probability $p(y=k\mid x)$=softmax$(\alpha+\beta x)$. The bias of $\beta$ is, therefore, a result of the nonlinear transformation softmax, which is more linear in its tail. A weight closer to 0 or 1 will enforce the estimated $\vert \alpha\vert$ to be large and adds to linearity. ↩

This post is not focused on causal interpretation. But all regression coefficient will become causal by assuming unconfoundedness and the odds ratio is associated with a multiplicative effect of $p(y\vert x)$. No wonder targeted learning is actually a causal concept. ↩