Bayes is guaranteed to overfit, for any model, any prior, and every data point

Posted by Yuling Yao on May 26, 2023.       Tag: modeling  

The myth

A few years ago I saw a StackOverflow question (but I cannot find it now): allegedly Andrew Gelman blogged that Bayesian models would not overfit and would only underfit (I also cannot find Andrew’s blog post now but I think what he meant by “Bayesian models only underfits” is that you can never fully model the true data process in the population).

It reminded me that I received several emails asking me why we need cross-validation for Bayesian models: since Bayesian inference is not set to be an empirical risk minimization or an M-estimate, there is no immediate reason that why the empirical training risk/loss needs to be under-estimate the test data risk. Such argument seems popular in the Bayesian world, for example, a quick google search of “Bayes does not overfit” shows this lecture notes from Edinburgh:

Fully Bayesian procedures can’t suffer from “overfitting”, because parameters aren’t fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not maximizing. The predictions can depend heavily on the model and choice of prior however.

The truth

I have a different view. Bayesian model does overfit.

Moreover, Bayes is guaranteed to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).

Moreover, Bayes is guaranteed to overfit on every realization of training data, not just in expectation.

Moreover, Bayes is guaranteed to overfit on every single point of the training data, not just in the summation.

The guarantee

To see this1, let’s work with a general model setting with exchangeable observations $y_1, \dots, y_n$: the data model is any $p(y \vert \theta)$ and you could have any prior $p(\theta)$. Wherever the model is correct or wrong is not relevant to our discussion here. The posterior predictive distribution of any future unseen data is $p(\cdot \vert y) = \int p(\cdot \vert \theta ) p(\theta \vert y) d\theta$. We will evaluate the prediction by its expected log score. The in-sample log score is

\[\sum_{i=1}^n \log p(y_i \vert y), ~~~ p(y_i \vert y)= \int p(y_i \vert \theta) p(\theta \vert y)d\theta.\]

It is a sum of $n$ terms. Each term, $p(y_i \vert y)$ is a (weighted) arithmetic mean of $p(y_i \vert \theta)$. You could imagine if you have access to $S$ Monte Carlo draws from $p(\theta \vert y)$, then this in sample individual predictive density is the arithmetic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.

To evaluate the out-of-sample prediction, we can leave this $i$-th data out. The leave-one-out log predictive density is $\sum \log p(y_i \vert y_{-i})$, where the $i$-th term predictive density is

\[p(y_i \vert y_{-i})= \int p(y_i \vert \theta) p(\theta \vert y) \frac{p(y_i \vert \theta)^{-1}} {\int p(y_i \vert \theta^\prime p(\theta^{\prime} \vert y) )^{-1} d\theta^\prime} d\theta = \frac{1} {\int p(\theta \vert y) p(y_i \vert \theta)^{-1} d\theta},\]

that is the weighted harmonic mean of $p(y_i \vert \theta)$. Again, when you have $S$ Monte Carlo draws from $p(\theta \vert y)$, then the out-of-sample individual predictive density is precisely the harmonic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.

Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality $p(y_i \vert y_{-i}) < p(y_i \vert y)$, for any point $i$ and any model.

What is the most overfitting inference?

Bayes always overfit, how about other inferences?

From the harmonic mean inequality, you can think about an abstract math problem: the pointwise in-sample-score is the mean of $p(y_i \vert \theta)$, while the leave-one-out version is the inner product of $p(y_i \vert \theta)$ and $\frac{1/p(y_i \vert \theta)}{ \int 1/p(y_i \vert \theta) d\theta}$. The second term is a $\theta$-density that is proportional to $\frac{1}{p(y_i \vert \theta)}$.

The Bayesian update can be quantified by its sequential update: what you do when you see a data point $y_i$. Bayes multiplies the posterior by a factor ${p(y_i \vert \theta)}$, but you also need to normalize it cuz the density needs to have expectation 1. But there are other updates, for example, if you choose to update the posterior by ${p(y_i \vert \theta)}^{\alpha}$, such that your posterior is proportional to $p(y\vert \theta)^\alpha p(\theta)$, then the out-of-sample predictive density will be the inner product of $p(y_i \vert \theta)$ and a number proportional to $\frac{1}{p(y_i \vert \theta)^\alpha}$.

From Jensen’s inequality, it is easy to see that the generalization gap is strictly monotonic on $\alpha$. When $\alpha=0$, the inference always returns the prior and you have no overfitting. When $\alpha=\infty$, the posterior always returns the MLE, and you get the maximum amount of overfitting.

The math we need is elementary: suppose you have a sequence of (fixed) positive numbers: $a_1, \dots, a_J.$ There is another sequence of variable positive numbers: $b_1, \dots, b_J,$ and $\sum_{j=1}^J b_j=1$. We compute their inner product $\sum_{j=1}^J a_j b_j$. When $b_j=1/J$, you get the arithmetic mean, corresponding to the in-sample predictive density; when $b_j \propto 1/a_j$, you get the harmonic mean, corresponding to the out-of-sample Bayes predictive density; When $b_j = \mathbf{1} (a_j= \min(a_1, \dots, a_j))$, you get the $\min(a_j)$, which is the minimum you can obtain from this summation, corresponding to the out-of-sample MLE.

The bottom line:

  1. Bayes is guaranteed to overfit any training data.
  2. MLE is guaranteed to overfit more than Bayes.
  3. Both of these two claims hold for any model and any prior. The claims hold point-wise, not just on average.
  4. These strict inequalities come from the convexity of $f(x)=1/x, ~x>0$.

P.S.

This post is discussed a lot on Twitter. I thank all the readers.

To clarify, I think word overfit could mean two things:

  1. Test error is always larger than training error, or/and,
  2. A complicated model gives worse predictions than a simpler model in testing.

In this blog post, I only refer to (1), not (2). Perhaps a more accurate saying should be that Bayesian posterior prediction always has a positive generalization gap for any single point. In participial, I do not discuss whether a Bayesian prediction necessarily have a better or worse test prediction error than MLE.

PPS. When you do have the correct model, the Bayesian prediction is optimal in the sense of prior replication, see our review paper on Bayes prediction. But even then, the in-sample prediction error will be underestimated when you have finite sample size. Here, the underestimation is an issue of the model evaluation, not of the model design.

PPPS. Don’t get me wrong: I do use, like, and advocate Bayesian approaches. But when there is a distinction between treat-Bayes-rule-as-an-always-correct-blackbox and an open-minded view of use-bayes-and-bayesian-decision-theory-as-building-blocks-toward-model-improvement, I would prefer the latter view.

  1. I am not aware of whether this justification has appeared before. If you would like to cite this “proof”, you can cite my recent review paper on Bayesian prediction where I add this simple math as a side note.