Marginal liklihood and the Lindley paradox

Posted by Yuling Yao on Nov 22, 2021.       Tag: modeling  

I read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.

Wagenmakers and Ly pointed out two approaches to escape the Lindley Paradox: either to avoid using a point hypothesis in the Bayes test, or to avoid a vague prior. Notably, we may still have the Lindley Paradox when the null is a spiky continuous distribution rather than the point mass.

To make the discussion, consider a Bernoulli experiemnt $y\sim \mathrm{Bin} (n,p)$ and we observe $p = 5001$, and $n=10000$. We specify a point null $\theta=.5$ and the alternitve $\theta\neq .5$ or $\theta \sim $ uniform (0,1). The p-value for the null is approximately $\Pr(z> 1/ (0.5 / sqrt(n)))= \Pr(z>200)$=0, while BF is some very big number as $\theta=.5$ predicts the outcome much better than the vague prior $\theta \sim $ Uniform (0,1).

We have made this point in our hierarchical stacking article: a model being true or false is not directly related to it being good or bad in terms of data fitting. Indeed a wronger model may make a better prediction depending on your chosen metric. In the Lindley paradox, at least I think, a Bayesian shall not judge that $\theta=.5$ in light of a very big BF, because we know a priori that Pr(point null)=0.

The marginal likelihood is only weakly related to how the model fits the data. It reflects the average leave-q-out log predictive density when q varies from 0 to $n$, among which $q=n$ accounts for a non-proportional share because prior typically has a bad predictive power.

To me, this irrelevance to the prediction task is the larger problem of BF: BF is aimed to test which model is more correct, rather than which model fits the data better. Worse, a model consists of two parts: the structure and the magnitude (the specific value in the prior). To appeal to BF, you need to do well on both parts. At some point, it is a test of the prior rather than the test of the model. In contrast, in hypothesis testing/LOO-model comparison/posterior predictive check, the prior is not or less relevant because these approaches examine the prediction ability of the inferred model other than the prior.

BF/marginal likelihood does have its merit: we can easily trick empirical loss by using an overfitting model, in which the empirical loss approaches zero while BF will typically be very small because of the large/complex parameter space in the prior. In that sense, BF never overfits; BF always underfit.

Can we make BF less sensitive on priors? Yes, use intrinsic BMA, or its $n=1$ limit, the pseudo-BMA (LOO-elpd weighting).

Can we monitor empirical loss to test the model being true or false (other than good or bad)? Yes, stay tuned.