Should I reweight a case-control study?

Posted by Yuling Yao on Jun 01, 2019.       Tag: modeling   causal
The odds ratio from a case-control study is exactly the same as in a cohort study, therefore I could fit a retrospective logistic regression as if it is prospective and report its MLE or Bayesian posterior distribution. But considering the sampling distribution shift, should I reweight it regardless?

Dating back to Prentice and Pyke 1979, the invariance of Logistics Regression under case-control studies and cohort studies have been thoroughly studied.

In a cohort study, researchers collect data randomly from p(x, y). However, for rare disease (only a small fraction of the population has the disease, i.e., p(y=1)«p(y=0)), such sampling scheme may result in too few case samples. Throughout the post, I will use $x$ to denote covariates and $y$ to be a categorical outcome.

In a case-control study, researchers collect data from cases (y=1) and controls (y=0) separately until a pre-fixed number of both cases and controls are achieved. The only random variable is $x$.

Prentice and Pyke 1979 established the theory that the odds ratio from the cohort study can be estimated from the case-control study, even in a finite sample.

Moreover, a logistic regression essentially models the odds ratio: $\frac{p(y=1\mid x) / p(y=0\mid x) }{p(y=1\mid x_0) / p(y=0\mid x_0)}$

In a cohort study, modeling $p(y\mid x)$ as softmax $(\alpha + \beta x)$ is equvalent to model the oddes ratio by $\beta (x-x_0)$.

In a case control study, the same odds ratio implies $p(x\mid y=k) \propto c_k{\exp(\gamma(x) + \beta_k x)}$ where $\gamma(x)= \log( p(x\mid z=0)/ p(x=0\mid z=0))$.

Logistic regression can be viewed as semiparametric that leaves $\gamma(x)$ unmodeled.

The MLE of $\beta$, turns out to be the same under case-control and cohort study, which implies we can treat a case-control sampling as if it is cohort.

Let me sketch the proof. To ease the notation, I will consider the general situation with categorical outcome $y=0,\dots, K$ and softmax likelihood $p(y=k\mid x)=\frac{\exp(\alpha_k + \beta_k x)}{\sum_k \exp(\alpha_k + \beta_k x)}$.

In a cohort study, the log likelihood is

In a case-control study, the log likelihood (of $x$ given $y$) is

Rewright the marginal of $x$: $q(x)= \exp\gamma(x) \sum c_k n_k n^{-1} \exp(\beta_k x)$ Then the log likihood above is

Consider the profile likeihood, where $q(x)$ is replaced by its emperical distribution $\hat q(x)=1/n\sum_{i=1}^n1(x=x_i)$, which is its nonparametric MLE. Then solving $\frac{\partial}{\beta} L_{cc}$ gives the same answer of $\beta_{MLE}$ as in $\frac{\partial}{\beta} L_{coh}$.

By the way, I think most ML literature compares logistic regression with LDA and claims logistic regression does not model $p(x)$. It does.

A more Bayesian version of this analysis is recently developed by Byrne and Dawid 2018, which says if the model is strong hyper Markov, then the odds ratio, or any parameter that only depends on it, has the same posterior distribution under retrospective and prospective model.

wait, why can’t I reweight a case-control study?

I mean it is fine to stick to the clear distinction among retrospective and prospective, but after all, the only difference is the sampling distribution. In a retrospective sampling, the joint distribution is

In a prospective sampling,

If I know nothing about odds ratio or epidemiology, I will simply use importance sampling to account for this sampling shift. That is we want to optimize the emperical risk, such that it will have optimal predictive performance under the prediction task in population $p$:

A logistic regression defines the probability prediction $(f(x)_j , j=1,\dots, K)=( \frac{\exp(\alpha_j+\beta_j X)} {\sum_j \exp(\alpha_j+\beta_j X)} , j=1,\dots, K)$, and the loss function $l(x, y=k)=-\log( f(x)_k )$ to be cross entroy loss.

So if a black-box machine learning researcher wants to do retrospective epidemiology study, he would use the weighted loss function or the equivalent log predictive density

where $w_k = p(y=k)/q(y=k)$ is importance ratios that only depend on $y_i$. If $q$ is balanced/uniform, $w_k \propto p(y=k)$, i.e., the population rate.

Surprisingly (to me), it is a totally different expression as in the retrospective likelihood. And we will get a different answer by solving $\partial/\partial \beta_k=0$

To be honest, at the risk of exposing my ignorance, I was totally shocked by this difference. So I did some simple R simulation:

library(arm)
x=rnorm(10000,0,1)
p_sim=invlogit(x-3)
y=rbinom(n=length(x),size=1, p=p_sim)
n1=sum(y==1)
n0=sum(y==0)
reto_1=sample(which(y==1),size=500 )
reto_0=sample(which(y==0),size=500 )