ABC, Model Misspecification, and Bimodality
Posted by Yuling Yao on Jun 18, 2019.Recently I took a short class by Prof Simon Tavaré on approximate bayesian computation (ABC). I gave a short presentation on the multimodality in ABC.
First, multimodality is challenging in both MCMC and ABC, but for different reasons. In MCMC, and in particular HMC, multimodality prevents the log concavity of the density, or equivalently the Hessian is unbounded and therefore the mixing time between two modes/ metastable regions are unbounded.
For ABC, the samplers are not required to take transition between modes. But we after all sample from the prior, and their divergence $\mathrm{KL} ( p(\theta ), p(\theta \mid y) ) = C+ \int p(\theta ) \log p(y\mid\theta) d \theta$ is just the log prior predictive density which will be too large for ABC to be efficient.
Apparently, in the extreme case if the prior $p(\theta)$ is multimodal, say a mixture of two distinct gaussian, and the likelihood is nearly constant, then ABC is completely fine as the effective sample size is almost 100%. But MCMC will suffer from slow mixing. On the other hand, it also suggests how sensitive ABC is to the prior choice.
Let me give a less toy example. Indeed it is my favorite example on bimodality.
Suppose we generate 40% of the data, $y_1,\dots, y_{20}$, from $\mathrm{Cauchy} (10, 1)$, and the remainning 60% from $y_{21}, \dots y_{50} \sim \mathrm{Cauchy} (10, 1)$. Now consider a simple model:
\[y  \mu \sim \mathrm{Cauchy} (\mu, 1).\]Basically we can compute the unnormalized true posterior density. See the right panel below. It is bimodal, centering at 10 and 10 separately. Notably the right mode is much higher than the left mode. We could compute the likelihood ratio $p(\mu=10 \mid y)/ p(\mu=10 \mid y) \approx \exp(62)$.
It makes sense as a consequence of Bernstein–von Mises theorem under modelmisspecification. But it is also not the optimal projection we can get in the model space. Indeed the true data generating mechanism corresponds to a mixture of two delta functions: $\mu= Z \delta(10) + (1Z)\delta(10) $ where $Z\sim \mathrm{Ber}(0.4)$ The left and right mode contains to 40% and 60% of the mass: $p(\mu=10 \mid y)/ p(\mu=10 \mid y) = 2/3$
As for the sampling scheme, if I run a generic MCMC, and I run for infinite number of time, I mean infinity, then the sampler will converge ^{1} to the true posterior density which is bimodal and highly skewed. But if we run stacking, we will end up with true data generating mechanism.
Now it is time for ABC. In standard ABC, we generate $\mu_s$ from the prior distribution, and we generate $y^{sim}$ from $\mathrm{Cauchy} (\mu_s, 1)$; accept when \(\vert\vert y^{sim}  y \vert\vert_2 \leq \epsilon\)
The smallest $\epsilon$ that is achievable corresponds to $\theta=10$. It is indeed infinity in expectation because the mean of Cauchy is infinity. Nevertheless, importanceweighted ABC is still valid.
Since the smallest $\epsilon$ is much larger then 0, the consistency result of ABC will not hold — no matter how many samples I draw.
Now even worse, if I stick to the sufficient statistics, the sample mean $\bar y =\sum_{i=1}^n y_i/n$ in this model, and accept a sample when \(\vert \bar{y}^{sim}  \bar y \vert \leq \epsilon.\)
Intuitively $\bar{y}^{sim} $ is centered at $\theta$, so such criterion enforces the approximate posterior distribution $\theta$ to be close to $\bar y \approx 2$ and is unimodal. The reasoning is a little bit sloppy as the mean of cauchy is not finite. But it is also confirmed by $10^5$ simulations: I report the average value of the smallest achievable $\epsilon$ as a function of $\mu$: $\vert \bar{y}^{sim}  \bar y\vert $ with $y^{sim} \sim \mathrm{Cauchy} (\mu, 1)$.
It is quite bad. Firstly, $\epsilon$ near 0 is still never achievable, so the consistency result is never relevant. The left mode that is centered at 10 will be totally missed unless $\epsilon$ is extremely loose, which destroys ABC in the first place. But even the right mode is wrong, it is leftbiased even with the smallest tolerance.
So ABC converges, whatever it means, to neither the true posterior density nor the true data generating mechanism.
Ummm… I do not know, both multimodality and misspecification seem to be large concerns for ABC and there should be more discussion. I searched the literature and saw Frazier, Robert and Rousseau (2018) presented a ABC framework that addressed misspecification, and utilize the computation diagnostics for model evaluation, which is not always desired.
In conclusion, when the model is written down, the inference is just math and this is why computation/inference can be separated from the model evaluations in MCMC. For ABC, this is not the case and I suspect we have no idea what it will converge to when the model is misspecified, which should be alarming given ABC is nowadays treated as a blackbox computation.

admittedly I use the word converge quite loosely. I think I should only refer to convergency of expectation whenever I use the word converge. ↩