Score matching, Bayesian predictions, tempering, and invariance

Posted by Yuling Yao on Aug 20, 2022.       Tag: prediction  

Score matching

Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.

But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.

Since the seminal work by Hyvarinen (2005), score matching has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant disappears by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the score function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,

\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]

hence the name score matching.

In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is

\[H(y_{1:n}, q_{\theta}) =\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]

In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.

Unnormalized models in Bayesian statistics

There are three levels of unnormalized models in Bayesian statistics.

Level 1: A harmless normalization constant comes from the Bayes rule.

In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,

\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]

Level 2: Intractable posterior predictive distribution.

Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that

\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~ z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]

Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.

Level 3: Intractable likelihood.

If the likelihood is also intractable, meaning we are only able to evaluate $q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant $z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called doubly intractable. For example, in alpha-liklihood, the likelihood function is

\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~ z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]

Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:

\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]

The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:

\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]

Matching for doubly intractable Bayesian predictions, or a mixture of unnormalized densities?

“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,

\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]

But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?

Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that

\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]

The answer is negative for any $S\geq 2$.

A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.

The bottomline:

Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.

Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.