The likelihood principle in model check and model evaluation

Posted by Yuling Yao on Dec 16, 2020.       Tag: modeling  

The likelihood principle is often phrase as an axiom in Bayesian statistics. My interpretation of the likelihood principle reads:

We are (only) interested in estimating an unknown parameter $\theta$, and there are two data generating experiments both involving $\theta$ with observable outcomes $y_1$ and $y_2$ and likelihoods $p_1(y_1 \vert \theta)$ and $p_2(y_2 \vert \theta)$. If the outcome-experiment pair satisfies $p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$, (viewed as a function of $\theta$) then these two experiments and two observations will provide the same amount of information about $\theta$.

Consider a classic example. Someone is doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability $\theta$ by independent Bernoulli trial likelihood: $y\sim binomial (\theta\vert n=10)$. Other informative priors can exist but is not relevant to our discussion here.

What is relevant is that later the manager found that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until $y=9$ are positive. The actual random outcome is n, while y is fixed. So the correct model is $10=n\sim$ negative binomial $(\theta\vert y=9)$.

Luckily, the likelihood principle kicks in for the fact that binomial_lpmf $(y\vert n, \theta) =$ neg_binomial_lpmf $(y\vert n, \theta)$ + constant. Hence no matter how the experiment is done, they yield the same inference.

At the abstract level, the likelihood principle says the information of $\theta$ can only be extracted via the likelihood, not from experiments that could have been done.

For example, in hypothesis testing, all the type-1 error is about a hypothetical experiment (e.g., the null is $\theta=0$). A classic example is that one has two scalers which return $y\sim$ N$(\theta, 1)$ or N$(\theta, 10000)$ respectively, and which scaler is to be used is determined by a coin flip. But even if in one trail one knows he is using the precise scaler, the hypothesis testing still uses the inflated p-value $p= \Pr(X_{mix} >\vert y \vert)$. where $X_{mix}$ come from a mixture density $X_{mix} \sim .5 N(0,1)+.5 N(0,10000)$.

What can go wrong in model check

The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of $y$.

In the binomial/negative binomial example, it is OK to stop at the inference of $\theta$. But as long as we want to check the model, we do need to distinguish between the two possible sampling model and which variable ($n$ or $y$) is random.

Consider we observe y=9 positive cases among n=10 trials and the estimated $\theta=0.9$, the likelihood of binomial and negative binomial models are

> y=9
> n=10
> dnbinom(n-y,y,0.9)
 	 0.3486784
> dbinom(y,n, 0.9)
	0.3874205

Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:

> prob_list=seq(0.5,0.95,length.out = 100)
> dnbinom(n-y,y, prob=prob_list)/dbinom(y,n, prob=prob_list)

The result is a constant ratio, $0.9$.

However, the posterior predictive check (PPC) will have different p-values:

> 1-pnbinom(n-y,y, 0.9)
 	0.2639011
> 1-pbinom(y,n, 0.9)
	0.3486784

The difference of the PPC-p-value can be even more dramatic with other $\theta$:

> 1-pnbinom(n-y,y, 0.99)
 	0.0042662
> 1-pbinom(y,n, 0.99)
	0.9043821

Just very different!

Clearly using Bayesian posterior of $\theta$ does not fix the issue. The problem is that likelihood ensures some constant ratio on $\theta$, not on $y_1$ nor $y_2$.

Model selection?

Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.

In the previous AB testing example, given data $(y,n)$, if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate $\hat \theta=0.9$. Then we obtain a likelihood ratio test, with the ratio $0.9$, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant $y/n$, independent of the posterior distribution of $\theta$. If $y/n=0.001$, then we get a Bayes factor 1000 favoring the binomial model.

Except it is wrong. It is not sensible to compare a likelihood on $y$ and a likelihood on $n$.

What can go wrong in cross-validation

CV requires some loss function, and the same likelihood does not imply the same loss function (L2 loss, interval loss, etc). For adherence, we adopt log predictive densities for now.

CV also needs some part of the data to be exchange, which depends on the sampling distribution.

On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Using the two model notation with $M1: p_1(\theta\vert y_1)$ and $M2: p_2(\theta\vert y_2)$

\[\text{LOOCV}_1= \sum_i \log \int_\theta {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }} \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1} p_1 (y_\vert\theta) d\theta,\]

and replace all 1 with 2 in $\text{LOOCV}_2$.

The likelihood principle does say that $p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2) $, and if there is some generalized likelihood principle ensureing that $p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta)$, then $\text{LOOCV}_1= \text{constant} + \text{LOOCV}_2$.

Sure, but it is extra assumption. And arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.

The basic form of the likelihood principle does not have the notation of $y_i$. It is possibles that $y_2$ and $y_1$ have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with $y_i\sim binomial(n_i, \theta)$. If I have 100 polls, I have 100 data points. Alternatively I can view data from $\sum {n_i}$ Bernoulli trials, and the sample size becomes $\sum_{i=1}^{100} {n_i}$.

Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare $\text{LOOCV}_1$ with $\text{LOOCV}_2$. They are scoring rules on two different spaces (probability measures on $y_1$ and $y_2$ respectively) and should not be compared directly.

PPC again

Although it is a bad practice, we sometimes compare PPC p-value for the purpose of model comparison. In the y=9, n=10, $\hat \theta=0.99$ case, we can compute the two sided p-value $\min (\Pr(y_{sim} > y \vert n), \Pr(y_{sim} < y \vert n))$ for the binomial model and $\min (\Pr(n_{sim} > n \vert y), \Pr(n_{sim} < n \vert y))$ for the NB model respectively.

> min(pnbinom(n-y,y, 0.99),  1-pnbinom(n-y,y, 0.99) )
  0.004717254
> min( pbinom(y,n, 0.99),   1-pbinom(y,n, 0.99))
  0.09561792

In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?

Still we cannot. We should not compare p-values at all.

The likelihood principle and the sampling distribution

To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a product space: both $y$ and $n$ are now viewed as random variables.

The binomial/negative binomial narrative specify two models $p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta)$ and $p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta)$.

The ratio of these two densities only admit three values: 0, infinity, or a constant y/n.

If we observe several paris of $(n, y)$, we can easily decide which margin is fixed. The harder problem is we only observe one $(n,y)$. Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).

Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under both sampling model, the event admitting nontrivial ratio, $1(y=y_{obs}) 1(n=n_{obs})$, has zero measure. We could do whatever we want at this point without affecting any asymptotic property in almost sure sense.