Terrace and gradient

Posted by Yuling Yao on Oct 05, 2021. Tag: computation

I come across a paper “The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask” by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that

From a mathematical viewpoint, the adaptive biasing force method, just like adaptive biasing potential methods, is an adaptive importance-sampling procedure. There is, however, a salient difference between these two techniques. In the latter, the potential of mean force or, equivalently, the corresponding probability distribution along the transition coordinate is being adapted. In contrast, the former relies on biasing the force, i.e., the gradient of the potential. This difference is more important than it might appear at first sight, as potentials and probability distributions are global properties whereas gradients are defined locally. In terms of probability distributions, it means that the count of samples in the neighborhood of a given value of the transition coordinate is insufficient to estimate probability. Knowledge of the underlying probability distribution over a much broader range of $\xi$ is required. This may considerably impede efficient adaptation. In contrast, all that is needed to estimate the gradient is the knowledge of local behavior of the potential of mean force. Other regions along the transition coordinate do not have to be visited. Thus, in many instances, adaptation proceeds markedly faster. Using a common metaphor, the difference between the adaptive biasing potential and adaptive biasing force methods can be compared to inundating the valleys of the free-energy landscape as opposed to plowing over its barriers to yield an approximately flat terrain, conducive to unhampered diffusion.

I like the plowing metaphor. I found a photo of Rice Terraces in Yunnan:

which is in contrast to:

Aside from the context of free energy computation, the exact same reason implied by the previous metaphor suggests that the gradient-based method is often more an alternative dual approach to the zero order method:

In survival analysis, the Nelson–Aalen estimator is sort of the gradient version of of the Kaplan–Meier estimator (product limit).
In optimization, finding the mode of convext function is equivalent to finding the minimin the abs(gradient) function.
In cross-validation, the jackknife is the gradient-alternative to importance sampling.
In optimization convergence test, we can either monitor if the objective is stable, or if the gradient becomes zero.
In MCMC convergence test, we can either monitor if the sample draws have mixed, or if the gradient of the log density has mean zero.

Should we compute more gradients?