Does the soft constraint converge to the rigid constraint?

Posted by Yuling Yao on Dec 01, 2018.       Tag: modeling   computation  

tl;nr. No.

I have often seen the recommendation of soft-constraints. The main reason is because a bayesian will rarely prefer point mass. Apart from that, I am now curious whether the current implementation of these two in Stan are essentially equivalent, in the limit case.

Let me frame the problem more rigorously. Consider the parameter space $\theta \in \Theta = R^d$ , and a transformed parameters \(z=\xi(\theta),\quad z \in R^m\) where $\xi : R^d \to R^m, d>m $ is a smooth function. We also need some regularization on $\xi$ to make it full-rank. Denote $\nabla \xi$ as the $d\times m$ gradient matrix and G is the gram matrix $G=\nabla \xi ^T\nabla \xi $, it should satisfy \(\mathrm{rank} (\nabla \xi)=m.\) When there is no constraint at all, the posterior distribution $\nu(\theta)$ is proportional to $\exp(-V(\theta))$.

So far so good, until I start writing

transformed parameters
{
z = xi (theta);
}
model
{
z ~ distribution(....);
}

Rigid constraint

Suppose in some situation we need a constraint on $\xi(\theta)$. A non-Bayesian wants $\xi(\theta) \approx z_0, z_0 \in R^m.$ $z$ is sometimes termed as reaction coordinate in physics.

Firstly, with rigid constraint, we need to sample from the submanifold \(\Sigma(z_0)= \{\theta\in\Theta | \xi(\theta) =z_0 \}\) and the conditional distribution \(\pi^{\xi}(d\theta | z_0 ) \propto \exp(-V(\theta))\delta_{\xi(theta)-z_0} d\theta.\) That is the probability measure of $\nu$ conditioned to a fixed value $z_0$ of the reaction coordinate.

To clarify, we now also have the surface measure on $\Sigma(z): d\sigma_\Sigma.$ It is the lebesgue measure on $\Sigma(z)$ induced by euclidean space. The co-area formula gives

\[\delta_{\xi(theta)-z_0} d\theta= (det G)^{-1/2} d\sigma_\Sigma.\]

Later we will see the co-area formula is intrinsic, and it is not equivalent to the change-of-variable formula. The latter one reads $dy = J_x( y ) dx$.

The first question is: which measure should we sample from? If you think it is a trivial question, think about those transformed parameters that do not have an explicit prior in the model block, they enter the log density as if there is a uniform prior, but why do we never give a jacobian for them?

Short answer: here we are NOT doing variable transformation at all and we are always in the $\theta$ space, but we do have the freedom to choose to stand in the ecludian space, or the surface. Of course, the typical choice is the conditional distribution $\pi^{\xi}(d\theta \mid z_0 )$.

Sampling from a contained dynamic is not supported in Stan, unless it is a simple case and we can easily reparametrize (e.g., simplex, sum-to-zero), which is equivalent to explicitly solve the inverse $\theta= \xi^{-1}(z_0)$. In general this can be done by adding the Lagrangian multiplier $\lambda$ in the Hamiltonian system (e.g., Hartmann and Schütte, 2005 ). The main idea is to modify the Hamiltonian by $V^{constraint}(\theta) = V(\theta)+ \lambda (\xi(\theta)- z_0) $. Due to the very first reason I mentioned in the beginning, I know we probably do not have strong motivation to implement it.

Soft constraint

Another strategy is to put a soft constraint. For example, we put a strong prior on $\xi(\theta)$: $\xi(\theta) \sim N (z_0, \tau ^2 \mathrm{Id}_{m\times m} )$. That is target += $\frac{1}{\tau^2} (\xi(\theta)-z_0)^2$.
In the limit case $\tau\to 0$, the marginal density of $\xi(\theta)$ (the image of measure $\nu$ by $\xi$) will be guaranteed a Dirac measure at $z_0$.

OK, only having target += $\frac{1}{\tau^2} (\xi(\theta)-z_0)^2$ is wrong because we do not take into account the variable transformation.

Intuitive, no matter how small $\tau $ is, the dynamic still see the variation around the surface $\Sigma$. Indeed in the limits case $\tau \to 0$, it converges to \(\exp(- V(\theta) ) d\sigma_\Sigma\) which can also be rewritten as \(\exp(- V(\theta) + \frac{1}{2} \log det G ) \delta_{\xi(theta)-z_0} d\theta,\) where G is the gram matrix defined earlier (proof see e.g., Ciccotti et al 2008).

The term $ \frac{1}{2} \log det G $ is named Fixman correction (Fixman 1974).

Adding a Jacobian?

From my understanding, Stan recommends adding the log absolute det jacobian into the log density, whenever we have a model on the transformed parameters $z$.

It is somewhat vague what it means. Jacobian means the derivative to most users. Sure, when \xi(theta) is a bijection and $\nabla \xi(\theta)$ is a squared matrix, \(\frac{1}{2} \log det G = \log |det \nabla \xi |\) is an exact identity. However such bijection is impossible in reality, because it means we put a soft constraint on all parameters, and in the limit case the joint posterior is degenerated to a Dirac measure purely due to prior.

Other than it, $\xi$ is map from $R^d$ to $R^m$ with $d>m$, if not always much larger. From the old post 1, and old post 2

  1. If you have a one-to-many transformation then you do the above but sum over all of the possible values of the transform inverse. This behaves well except in cases where there are infinite number of values, for example when you try to map a circle into the real line by transforming (0, 2pi) to (-\infty, \infty), in which case the transform becomes ill-posed.
  2. If you have a many-to-one transformation then you have to be very careful to ensure that you can self-consistently define a probability distribution on the transformed space. There is no “implicit distribution theorem” – the only way to do this is to “buffer” the output with additional variables so that you can define a one-to-one transformation, solve for the appropriate density on the buffered space, and then marginalize out the buffer variables. This is, for example, how one goes from a 2D Gaussian to a Rayleigh distribution or a 3D to a Maxwell, or the N simplex to the N-1 dimensional unconstrained real space we use in Stan (as @Bob_Carpenter notes below) . Typically this approach is limited to cases where the marginalization can be done analytically so it’s hard to apply to complex problems like this.

That is, augment the transformed parameters $z^{*}=(z, z^{aug})$, so as to make it one-to-one transformation between $\theta \to z^{*}$. And we calculate the jacobian and marginalize out the auxiliary variables.

The augmented Jacobian reads
\(J^{aug}=\begin{bmatrix} \nabla \xi(\theta)_{1:m} & 0 \\ \nabla \xi(\theta)_{m+1:d} & I_{(d-m)\times (d-m)} \\ \end{bmatrix}\) You should be convinced that $\log |det J^{aug}| \neq 1/2 \log det G$, no matter how they look like.

If we are talking about variable-transformation now, I totally agree. It like log-normal or inver-gamma. We put a prior on the transformed parameters, as if we generate such transformed parameters (e.g. normal), and transform back to the original space (log-normal).

Generative vs Embedding, or change-of-variable vs conditioning

They are different.

Stan manual write:

A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.

Agree, we need a Jacobian as if we know the generative distribution in the transformed parameters space, and we want to transform back to its original space. In principle, we do not even need any extra prior on $\theta$.

But there is another situation: which I will call embedding. We know the distribution in $\theta$ space $\nu$, but we want to find what is it embedding/conditional distribution on the submanifold $\Sigma_{z_0}$ induced by those transformed parameters. The section above shows that in order to keep invariance, we need an adjustment by $ 1/2 \log det G$, not Jacobian. Notice that we do not perform change-of-variables.

Prior-Prior conflict?

I don’t think I have made a clear clarification. Though conceptually easy, the confusion between the gram matrix and Jacobian seems common, even for experts in molecular-simulation experts (see Otter 2013 and Wong and York 2012 for a very similar argument)

I have not discussed about the one-to-many transformation, or d>n. In the limit case it is over-identified. But there is nothing prevent me defining a regularization on all $\theta$, and $\theta^2$, and $\theta^m \sim N(0,\alpha^m)$ . In that case I know the distribution of $\theta$, and I am loosely regularize all moments not so being wild. I do not even ask for consistency as $\tau$ is fixed and is not close to 0. I think it is fine to leave all constraints by itself, without further adjustment. This is to echo what I have pointed put earlier: we do have the freedom to choose to stand in the ecludian space, or the surface.

We may also wonder, is it a prior-prior conflict, if we define a prior on theta space, and define another prior/embedding on the transformed parameters. Typically it is hard to detect as sum of multiple convex function is still context. But if $m>d$, the sub-manifold can be empty in the limit.

Conclusion

  1. For variable transformation (e.g. constraint parameters, simplex,…) and generative prior defined by transformed parameters (e.g. log-normal, inv-cauchy), jacobian is always correct, and the ONLY correct option. There is a natural bijection, so the jacobian is well-defined.
  2. The purpose of embedding is to put soft constraint on some reaction coordinate, and it does not change variables. If it is the case, in order to keep asymptotic consistency, the log density should be adjusted by 1/2 log det G, not jacobian.
  3. Hard-constraint is another less-recommend but possible option for general transformations.
  4. When the soft constraint itself is loose (e.g. z ~N(0,2)), it only represents some approximate regularization. Lack of 1/2 log det G adjustment should also be reasonable.

Some questions I am not clear: the embedding interpretation and importance-sampling; when should we put either adjustment in the optimization for MAP.

Main reference: Stoltz, G., & Rousset, M. (2010). Free energy computations: A mathematical perspective. World Scientific.

Edit: I thank Bob Carpenter for catching typos. I rely on auto-spelling-checking for typo corrections, which clearly does not adapt to a multimodal context.


Comments are disabled for this post. Feedbacks via email are more than welcome.