MEBA—Make Empirical-Bayes Bayes again

Posted by Yuling Yao on Aug 19, 2021.       Tag: modeling  

Assuming there are some hyperparameters $\beta$ in the model involving data $y$. We have four ways to get some inference of $\beta$.

MAP is bad

First, we have MAP, or empirical loss optimization. That is, for each $\beta$, we could train the model and obtain some in sample loss $l(y_i \mid \beta )$. Then we minimize this loss: $ \hat \beta_{MAP}= \min \sum_{i} l(y_i \mid \beta ). $

We could add some prior regularization $p(\beta)$ too, which will modify it to be

\[\hat \beta_{MAP}= \min \sum_{i} l(\beta \mid y_i) - \log p(\beta).\]

We can go LOO, or we can go Bayes

The above procedure is attacked in two ways. One argument is that the empirical loss optimization overfits because of the misuse of in-sample error. We can adjust for this error by using cross-validation. For example, incorporating the leave one out cv and empirical loss optimization, we have

\[\hat \beta_{LOO}= \min \sum_{i} l( y_i \mid \beta y_{-i}) - \log p(\beta).\]

This LOO step is related to empirical Bayes if we are using LOO metrics in exmperical Bayes.

Yet another attack to MAP is that it is a point estimate. “You overfit cuz you ignore the uncertainty”. As an attempt to fix it, we have some generalized Bayesian step:

\[\log p (\beta \mid y) =- \sum_{i} l(y_i \mid \beta ) + p(\beta).\]

Can we go both?

It is natural to ask: which one is better: Bayes or LOO-MAP? The answer depends. For example, in the context of regression, LASSO (where the hyper parameter is tuned by LOO) is much better than bayesian lasso (in which the hyper parameter is treated as a parameter to fit using the Bayes rule).

But an even larger picture is a 2 by 2 table

MAP 😱 LOO-MAP (Empirical Bayes) 😐
Bayes 😐 😊

We have two directions to improve MAP: either using LOO or using Bayes. But can we combine them? Can we reach that 😊 block?

Bayesianize the Empirical-Bayes

The idea is to define a posterior density via the leave one out likelihood:

\[\log p (\beta \mid y)= - \sum_{i} l( y_i \mid \beta, y_{-i}) + \log p(\beta).\]

Is it justified to be full-bayes? Yes. It can be viewed as data-augmentation. Assuming there is hold out dataset, we could use one dataset to first obtain conditional parameter inference $p( \theta \mid y, \beta)$, we then obtain exact Bayesian inference on hyperparameter $\beta$ as $p( \beta \mid y, y^\prime)$ using hold out data $y^\prime$ and integrating out $\theta$. Now instead of having this hold-out dataset, we integrate it out. That is the LOO likelihood part.

Is there an example in which this idea yield success? Yes, we have shown in our hierarchical stacking paper that this LOO-likelihood sampling (hierarchical stacking) yields better predictions than LOO-optimization (no-pooling stacking).

Is there computaitonal advantaves over LOO-MAP? Yes, LOO-MAP is often done by grid-search. But we can now use gradient information (wrt $beta$) when sampling this density.

Can we extend this to a general inference paradigm, which will sit parallel, if not above, to MAP, Bayes, and empirical Bayes? Highly promising. I am looking forward to that.