Best point mass approximation
Posted by Yuling Yao on Feb 20, 2021.It comes a lot that we often summarize a continuous distribution (often, posterior distribution of parameter estimation or of predictions) by a point mass (or a sharpe spike) for (1) computation or memory cost, (2) easier communication, (3) part of a larger computation approximation, such as EM or in dropout, or a combination of all of these.
How do we justify the choice of point summary? Here is a few list:
- The mean of posterior density minimizes the $L^2$ risk.
- The mode of the posterior density minimizes the KL divergence to it. Well, this needs more explanation. Assuming $p(\theta)$ is some continuous density on real line with respect to the Lebesgue measure $\mu(\theta)$—but a point mass $\delta(\theta_0)$ is defined with repect to counting measure $\mu_c(\theta)$—but to define KL divergence does requires the same measurable space. To overvome this ambiguity we can either (a) equip the real line with the summed measure $\mu_c(\theta)+ \mu(\theta)$ and the KL divergence reads $C - \log p(\theta_0)$ which is minimized at the mode, or (b) view $\delta(\theta_0) \approx \mathrm{normal}(\theta_0, \tau)$ for a very small but fixed $\tau$, then KL divergence is also approximately $C(\tau) + \log p(\theta_0)$, which again minimized at the mode. Put it in another way, the MAP is always the spiky variational inference approximation to the exact posterior density.
- The Wasserstein metric is different. It is legit to consider the Wasserstein metric between point mass and continuous densities. The posterior median minimizes the Wasserstein metric for order 1 and he posterior mean minimizes the Wasserstein metric for order 2. Likewise we can argue the posterior median/mean is the variational inference approximation to the exact posterior density under Wasserstein metric.