Bayesian as modeling vs Bayesian as computation

Posted by Yuling Yao on Sep 23, 2019.       Tag: casual, modeling  

I came up with this random thought when I was sitting in Andrew’s class this afternoon (as a TA). In order to emphasize why Bayesian is different, Andrew listed a few alternatives for data analysis, such as MLE, machine learning, hypothesis testing.

This is a vague comparison. There is no reason why Bayesian procedure is not used in machine learning, and there is also no reason why MAP estimator is not considered as an approximation of Bayesian posterior. The point is that Bayesian statistics consists of two parts: modeling, or indeed probabilistic modeling; and computation, which is posterior distribution.

Among all alternative procedures, some are intrinsically incompatible with the probabilistic modeling. For example, in the original context of Fisher’s exact test, it makes no sense to talk about the generative model: the data has to be as it is. Another example is in ecometrics, where even a linear regression is only assumed by the finite second order moment condition such that there is no generative model.

But other than these very special cases, most statistical models are in fact probabilistic, and therefore (conceptually) Bayesian. In view of this, the only remaining difference between a MLE and a posterior is the divergence between optimization and sampling. Under usual regularization conditions with finite parameter space, these two are also asymptotically the same.

The final question is that, if I have a finite sample size and an infinite computation budget, should I replace all optimization with sampling in ALL models? The answer really depends. A point estimation itself is some regularization and should be taken into account for modeling. For example, both point-estimation-LASSO and posterior-sampling-horseshoe work reasonably well. But the posterior-sampling-LASSO (full Bayes with Laplace prior) is undesired. A more vague but also more profound example is the theoretically asymptotical equivalence and practically similar popularity of GP and NN.

Does there exist a bijective mapping of the model-compuation double that we can always transform a model with a point estimation procedure into a Bayesian model (with different regularization) that is aimed for full Bayes sampling?