The Optimal Transport and Importance Sampling

Posted by Yuling Yao on May 22, 2019.       Tag: ad-hoc   computation  

Most statisticians-orientated introduction to optimal transport theory starts with the motivation: we can always obtain a one-D distribution $f(\theta)$ through its inverse cdf transformation

\[\theta= F^{-1} (u), u\sim U(0,1).\]

More generally for any two possible multidimensional density $g(u)$ and $f(\theta)$, a transport

\[\theta=T(u)\]

can be obtained by change of variables formula

\[g(u)=\vert \frac{d}{d u} T^{-1} (u)\vert f( T^{-1}(u) )\]

After solving this ODE, we will obtain a $T$ that transforms the distribution of $\theta$ to $u$. Typically $T$ is not unique, therefore optimal transport theory often requires some other optimization conditions (under some distance, or indeed wasserstein distance). Therefore, I can sample $\theta$ from $f$, obtain $u=T(\theta)$, and use the Monte Carlo integration of $u$ to compute any expectation under $g$:

\[\int a(u) d g(u)= \int a( T (\theta) ) d f(\theta).\]

but importance sampling is also natural

If $f$ and $g$ have the same support, any expectation of $g$ could also be computed from the expectation from the samples in $f$ using importance weighting $g/f$. It is just a fundamentally different approach.

\[\int a(u) dg(u)= \int a(u) \frac{g}{f} (u) d f(u).\]

Importance sampling or rejection sampling also defines a random transport T, as $T(\theta)= \theta$ with probability $\propto g/ f$ and otherwise discarded. To the best of my knowledge, such analog relation is not studied in optimal transport theory where $T$ is not assumed to be random or at least no mass loss.

Manipulation of the density or the sample

I could imagine statistics would be a happier area if such a problem had not existed.

  • There is no 1-1 map between the transformation of the sample and the transformation of density.

That is we have a map $S$ between $f$ and $g$ such that $f=S(g)$, and we have a map $T$ between $u$ and $\theta$ such that $\theta=T(u)$. Under some special condition there can be a global map between $T$ and $S$, for example when $X$ are iid gaussians with density $f(X)$, $S(f)=f^k$ corresponds to the unnormalized densities of $X \Sigma$ where $\Sigma$ is a reweighting matrix. But in general, there is no such map.

For example, it is extremely awkward to write the convolution density of two individual random variables $T(X, Y)=X+Y$ even if independent. On the other hand, there is no distribution free function $T$, such that $T(u, \theta)$ corresponds to the distribution of the product of density $\propto f g$.

Depending on the standpoint and applications, some researchers might find the manipulation of samples easier, so they will do optimal transport and solve that ODE. Someone else might it more natural to manipulate densities, and therefore importance sampling is more direct.

Evolution of methods

When we can recover one such relation under some conditons, we can have a special algorithm. For example, $S(f)=f^k$ corresponds to rescaling of $X \Sigma^{-1}$ in gaussian approximation, then we can reweighting to adjust subsampling

\[f(\theta \mid y_1, y_2 ) \approx f(\theta \mid y_1 ) \times f( \theta \mid y_2) ,\]

with fractional priors, which is essentially this Consensus Monte Carlo Algorithm, rescaling $\theta$ by some $\Sigma$.

Alternative, we can manipulatate density funtions, for example using normalizing flow and importance sampling, and that will lead Embarrassingly parallel MCMC using deep invertible transformations.

Another distribution free map is $pf+(1-p)g$ corresponds to $Zu+(1-Z ) \theta$ where $Z \sim Ber(p)$ is the mixture assignment, this is bascially what Bayesian Stacking utilizes.

RKHS (reproducing kernel Hilbert space) is another special case where these two maps are 1-1 connected, so maybe there can be kernelized stacking too that extends linear mixture.