Why Bayesian models could have better predictions.

Posted by Yuling Yao on Sep 28, 2020. Tag: modeling

In a predictive paradigm, no one really cares about how I obtain the estimation or the prediction. It can come from some MLE, MAP of risk minimization, or some Bayes procedure. Also, when we talk about a predictive paradigm, a big selling point of Bayesian procedure for offering confidence interval for free is largely undermined. A hedge fund manager does not really care about the p value of strategy A outperforming strategy B. If it is weak evidence the winning effect is N(0.1,1), decision theory says he should adopt the empirical winner.

However, even with this black box prediction attitude, a Bayesian model analysis is still useful for the following reasons.

Prior = regionalization = robustness. I agree this argument is lame, as everyone is using regionalization anyway.
The posterior distribution is simply more expressive than a MAP point estimate. Sure, I suppose from the probabilistic prediction point of view, an even more general procedure is to optimize over all probabilistic distributions subject to the decision problem. But the space of all probabilistic distributions is tooo big, and bayesian inference gives a coherent approximation to this infeasible goal. A warning is that bernstein von mises theorem says “However, in the long run, we are all dead” cuz bayes has to converge to a point estimate anyway. But that actually means the model is not given or fixed.
It is easier to work with Bayesian models on hierarchical data. To design a good hierarchical model to reflect the data structure is pretty much to have a customized NN infrastructure, except you do not publish every new model in NIPS.
Bayesian procedure is computationally harder, but as long as we obtain the final posterior simulation, it becomes easier to run other post-processing and adversarial-training. For example, BMA gives a coherent model ensemble (which in general is not feasible in MAPs). Simulated tempering is a more sophisticated version of NN-distillation.
The ability to simulate “generated quantity” for free. When we model the stock price by a linear regression a normal model, we have to use normal approximation for the option. But with MCMC draws, the option striking probability can be calculated directly. The same blessing applies to intervals and quantiles.

To be sure, there is NOT a guarantee that a bayesian posterior must produce a better prediction—-we do not even expect it to be so for an arbitrary model. The model-has-been-fit-so-we-are-done era has gone.

Unfortunately, the previous points 3-5 are pretty much unaddressed in “Bayesian learning”. Most theories are based on exchangeable data; minimal post-processing is done; the customized prediction is not really a thing. Much room for developments.