Jekyll2023-05-28T22:03:28+00:00https://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoBayes is guaranteed to overfit, for any model, any prior, and every data point2023-05-26T00:00:00+00:002023-05-26T00:00:00+00:00https://www.yulingyao.com/blog/2023/overfit<h2 id="the-myth">The myth</h2>
<p>A few years ago I saw a StackOverflow question (but I cannot find it now): allegedly Andrew Gelman blogged that Bayesian models would not overfit and would only underfit (I also cannot find Andrew’s blog post now but I think what he meant by “Bayesian models only underfits” is that you can never fully model the true data process in the population).</p>
<p>It reminded me that I received several emails asking me why we need cross-validation for Bayesian models: since Bayesian inference is not set to be an <em>empirical risk minimization</em> or an M-estimate, there is no immediate reason that why the empirical training risk/loss needs to be under-estimate the test data risk. Such argument seems popular in the Bayesian world, for example, a quick google search of “Bayes does not overfit” shows this <a href="https://www.inf.ed.ac.uk/teaching/courses/mlpr/2016/notes/w7a_bayesian_complexity_control.html#:~:text=Fully%20Bayesian%20procedures%20can%27t,and%20choice%20of%20prior%20however.">lecture notes</a> from Edinburgh:</p>
<blockquote>
<p>Fully Bayesian procedures can’t suffer from “overfitting”, because parameters aren’t fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not maximizing. The predictions can depend heavily on the model and choice of prior however.</p>
</blockquote>
<h2 id="the-truth">The truth</h2>
<p>I have a different view. <em>Bayesian model does overfit</em>.</p>
<p>Moreover, Bayes is <em>guaranteed</em> to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).</p>
<p>Moreover, Bayes is guaranteed to overfit on <em>every realization</em> of training data, not just in expectation.</p>
<p>Moreover, Bayes is guaranteed to overfit on <em>every single point</em> of the training data, not just in the summation.</p>
<h2 id="the-guarantee">The guarantee</h2>
<p>To see this<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, let’s work with a general model setting with exchangeable observations $y_1, \dots, y_n$: the data model is any $p(y \vert \theta)$ and you could have any prior $p(\theta)$. Wherever the model is correct or wrong is not relevant to our discussion here. The posterior predictive distribution of any future unseen data is $p(\cdot \vert y) = \int p(\cdot \vert \theta ) p(\theta \vert y) d\theta$. We will evaluate the prediction by its expected log score. The in-sample log score is</p>
\[\sum_{i=1}^n \log p(y_i \vert y), ~~~ p(y_i \vert y)= \int p(y_i \vert \theta) p(\theta \vert y)d\theta.\]
<p>It is a sum of $n$ terms. Each term, $p(y_i \vert y)$ is a (weighted) <strong>arithmetic mean</strong> of $p(y_i \vert \theta)$. You could imagine if you have access to $S$ Monte Carlo draws from $p(\theta \vert y)$, then this in sample individual predictive density is the arithmetic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.</p>
<p>To evaluate the out-of-sample prediction, we can leave this $i$-th data out. The leave-one-out log predictive density is $\sum \log p(y_i \vert y_{-i})$, where the $i$-th term predictive density is</p>
\[p(y_i \vert y_{-i})= \int p(y_i \vert \theta) p(\theta \vert y) \frac{p(y_i \vert \theta)^{-1}} {\int p(y_i \vert \theta^\prime p(\theta^{\prime} \vert y) )^{-1} d\theta^\prime} d\theta = \frac{1} {\int p(\theta \vert y) p(y_i \vert \theta)^{-1} d\theta},\]
<p>that is the weighted <strong>harmonic mean</strong> of $p(y_i \vert \theta)$. Again, when you have $S$ Monte Carlo draws from $p(\theta \vert y)$, then the out-of-sample individual predictive density is precisely the harmonic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.</p>
<p>Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality $p(y_i \vert y_{-i}) < p(y_i \vert y)$, for any point $i$ and any model.</p>
<h2 id="what-is-the-most-overfitting-inference">What is the most overfitting inference?</h2>
<p>Bayes always overfit, how about other inferences?</p>
<p>From the harmonic mean inequality, you can think about an abstract math problem: the pointwise in-sample-score is the mean of $p(y_i \vert \theta)$, while the leave-one-out version is the inner product of $p(y_i \vert \theta)$ and $\frac{1/p(y_i \vert \theta)}{ \int 1/p(y_i \vert \theta) d\theta}$. The second term is a $\theta$-density that is proportional to $\frac{1}{p(y_i \vert \theta)}$.</p>
<p>The Bayesian update can be quantified by its sequential update: what you do when you see a data point $y_i$. Bayes multiplies the posterior by a factor ${p(y_i \vert \theta)}$, but you also need to normalize it cuz the density needs to have expectation 1. But there are other updates, for example, if you choose to update the posterior by
${p(y_i \vert \theta)}^{\alpha}$, such that your posterior is proportional to $p(y\vert \theta)^\alpha p(\theta)$, then the out-of-sample predictive density will be the inner product of $p(y_i \vert \theta)$ and a number proportional to $\frac{1}{p(y_i \vert \theta)^\alpha}$.</p>
<p>From Jensen’s inequality, it is easy to see that the generalization gap is strictly monotonic on $\alpha$. When $\alpha=0$, the inference always returns the prior and you have no overfitting. When $\alpha=\infty$, the posterior always returns the MLE, and you get the maximum amount of overfitting.</p>
<p>The math we need is elementary: suppose you have a sequence of (fixed) positive numbers: $a_1, \dots, a_J.$ There is another sequence of variable positive numbers: $b_1, \dots, b_J,$ and $\sum_{j=1}^J b_j=1$.
We compute their inner product $\sum_{j=1}^J a_j b_j$. When $b_j=1/J$, you get the arithmetic mean, corresponding to the in-sample predictive density; when $b_j \propto 1/a_j$, you get the harmonic mean, corresponding to the out-of-sample Bayes predictive density; When $b_j = \mathbf{1} (a_j= \min(a_1, \dots, a_j))$, you get the $\min(a_j)$, which is the minimum you can obtain from this summation, corresponding to the out-of-sample MLE.</p>
<h2 id="the-bottom-line">The bottom line:</h2>
<ol>
<li>Bayes is guaranteed to overfit any training data.</li>
<li>MLE is guaranteed to overfit more than Bayes.</li>
<li>Both of these two claims hold for any model and any prior. The claims hold point-wise, not just on average.</li>
<li>These strict inequalities come from the convexity of $f(x)=1/x, ~x>0$.</li>
</ol>
<h2 id="ps">P.S.</h2>
<p>This post is discussed a lot on Twitter. I thank all the readers.</p>
<p>To clarify, I think word <em>overfit</em> could mean two things:</p>
<ol>
<li>Test error is always larger than training error, or/and,</li>
<li>A complicated model gives worse predictions than a simpler model in testing.</li>
</ol>
<p>In this blog post, I only refer to (1), not (2). Perhaps a more accurate saying should be that Bayesian posterior prediction always has a positive generalization gap for any single point. In participial, I do <em>not</em> discuss whether a Bayesian prediction necessarily have a better or worse test prediction error than MLE.</p>
<p><strong>PPS.</strong> When you do have the correct model, the Bayesian prediction is optimal in the sense of prior replication, see our <a href="https://arxiv.org/abs/2304.12218">review paper on Bayes prediction</a>. But even then, the in-sample prediction error will be underestimated when you have finite sample size. Here, the underestimation is an issue of the model evaluation, not of the model design.</p>
<p><strong>PPPS.</strong> Don’t get me wrong: I do use, like, and advocate Bayesian approaches. But when there is a distinction between treat-Bayes-rule-as-an-always-correct-blackbox and an open-minded view of use-bayes-and-bayesian-decision-theory-as-building-blocks-toward-model-improvement, I would prefer the latter view.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>I am not aware of whether this justification has appeared before. If you would like to cite this “proof”, you can cite my recent <a href="https://arxiv.org/abs/2304.12218">review paper on Bayesian prediction</a> where I add this simple math as a side note. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Yuling YaoThe myth A few years ago I saw a StackOverflow question (but I cannot find it now): allegedly Andrew Gelman blogged that Bayesian models would not overfit and would only underfit (I also cannot find Andrew’s blog post now but I think what he meant by “Bayesian models only underfits” is that you can never fully model the true data process in the population).How much saving do you need in the end? zero?2022-12-30T00:00:00+00:002022-12-30T00:00:00+00:00https://www.yulingyao.com/blog/2022/diewithzero<p>I came across this book called “<a href="https://www.nytimes.com/2020/07/10/business/awkward-timing-book-financial-.html">Die with zero</a>”. Along with many other yolo ideas, the book prompts the attitude that one must maximize net fulfillment over net worth to the extent of “DIE WITH ZERO”. Indeed, as the book pointed out on the cover</p>
<blockquote>
<p>In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.</p>
</blockquote>
<p>I don’t know. I would be moderately shocked if there was indeed an academic effort to show this circular-argument-type result. Seems very trivial.</p>
<p>At the risk of a pedantic tone, I think one common flaw is that people ignore uncertainty. I will show that even in this seemingly trivial problem, uncertainty leads to a surprise.</p>
<p>We need some math here. Suppose one has a fixed total income throughout their life, say $C$ dollars. For simplicity, we would assume this person has no investment nor loan; only spending: they spent $0 \leq Y \leq C$ dollars in his lifetime, such that he has $C-Y$ dollars left in the saving account. In addition, the total amount of necessary costs (food, medical, etc) in the late stage is denoted by $X$ dollars. Naturally, we would wish that the saving could cover such necessary costs, and the final balance sheet $C-Y-X$ is not negative. If $C-Y-X=0$ then a die-with-zero situation occurs.</p>
<p>It seems reasonable to assume that the utility comes from the following two parts:</p>
<ul>
<li>whether the necessity is satisfied, that is $1(C-Y-X \geq 0)$. If $C-Y<X$, then the end-of-life bill cannot be paid, bad.</li>
<li>the spending of leisure: It is a consumerism world, so it is certainly an axiom that the $Y$ dollars spending directly contributed to a utility increment equal to$Y$.</li>
</ul>
<p>The combined utility is then $1(C-Y-X \geq 0) + Y$, subject to $0 \leq Y \leq C$.</p>
<p>The uncertainty comes from $X$: we do not know the necessary cost at death. Indeed, if X is known, then clearly</p>
\[\mathrm{argmax}_{0 \leq Y \leq C} 1(C-Y-X \geq 0) + Y = C-X,\]
<p>at which one dies with zero. So yes, we shall all die with zero if we have already calculated our life flawlessly.</p>
<p>In practice, $X$ is a random variable. The decision problem maximizes the expected utility</p>
\[\mathrm{max}_{0 \leq Y \leq C} \mathrm{E}_{X} [ 1(C-Y-X \geq 0) + Y ].\]
<p>You can check your intuition here: with this uncertainty on $X$, shall we expect more savings? That seems what grandma would say, no? But isn’t the utility a linear function and why would uncertainty matter at all?</p>
<p>We assume $X$ is a normal $(0, \sigma)$ random variable. The expected utility function can be written as $\Pr(X \leq C-Y) + Y$. The derivative of this function with respect to $Y$ is $- \frac{1}{\sigma \sqrt {2\pi}} \exp (-\frac{(C-Y)^2}{2\sigma^2}) + 1$.</p>
<ol>
<li>If $\sigma=0$, then yes, there is no uncertainty, such that dying with zero is optimal.</li>
<li>If $\sigma$ is not too big, $\sigma \leq 1/ \sqrt {2\pi}$, then the optimum $\hat y= C-\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) }$. The expected saving at death, is $\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) } >0$, not zero. This is the price one pays for the uncertainty.</li>
<li>If $\sigma$ is too big, $\sigma > 1/ \sqrt {2\pi}$. Then the derivative of the objective function is always positive hence the optimal $\hat Y=C$. That is, since the future is just too chaotic, one simply adopts a yolo lifestyle and forgot about the necessity at all. The expected saving at death is either $-X$ or 0 depending on the healthcare system.</li>
</ol>
<p>The bottom line: die-with-zero is an oversimplification. When there is moderate uncertainty of necessary cost, the optimum would prefer extra non-zero saving at death. When the uncertainty is too big, just yolo.</p>Yuling YaoI came across this book called “Die with zero”. Along with many other yolo ideas, the book prompts the attitude that one must maximize net fulfillment over net worth to the extent of “DIE WITH ZERO”. Indeed, as the book pointed out on the cover In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.control variate other than the score2022-12-08T00:00:00+00:002022-12-08T00:00:00+00:00https://www.yulingyao.com/blog/2022/gamma<p>In Bayesian computation, we use control variate to reduces Monte Carlo (MC) variance. The idea if we want to compute $E_{p} h(x)$ from MC draws $x_{1, \dots, S}$, instead of computing the sample mean of $h(x_i)$ , we seek a mean zero function m(x): $E_{p} m(x)= 0$, such that $h (x)- m(x)$ has lower variance.</p>
<p>As far as I know, most control variate takes the form of <strong>score</strong> function; Either the gradient of the log density, $\nabla_{x} \log p(x)$, or the stein gradient $\nabla \log p(x) g(x) + \nabla_{x} \cdot (g(x))$.</p>
<p>Are there other zero-mean functions? Well, at least another one: $\nabla_{x} p(x)$ because $E_p \nabla_{x} p(x)=0$ for all $p$. I might be ignorant, but I just notice this identity today.</p>
<p><strong>Edit:</strong>
Actually, $\nabla_{x} p(x)$ is still generated by the score function. Just take $g(x)=p(x)$ in the formula in the second paragraph, then we get the $\nabla_{x} p(x)$.</p>Yuling YaoIn Bayesian computation, we use control variate to reduces Monte Carlo (MC) variance. The idea if we want to compute $E_{p} h(x)$ from MC draws $x_{1, \dots, S}$, instead of computing the sample mean of $h(x_i)$ , we seek a mean zero function m(x): $E_{p} m(x)= 0$, such that $h (x)- m(x)$ has lower variance.What is wrong with this marginalize-out trick2022-08-23T00:00:00+00:002022-08-23T00:00:00+00:00https://www.yulingyao.com/blog/2022/marginal<p>Consider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, tau);
y ~ normal(mu, sigma);
}
</code></pre></div></div>
<p>Tau is a fixed hyper-parameter, say 10. We can make inference on $\mu$ and $\tau$. That is easy.</p>
<p>But now I decide that I want to apply marginalization out trick to get rid of $\mu$. That is easy because it is a normal-normal model, such that the marginalized out model is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real<lower=0> sigma;
}
model {
y ~ normal(0, hypot(sigma, tau));
}
</code></pre></div></div>
<p>The problem is that this two models are not the same. Mathematically, the full joint model reads</p>
\[y\vert \mu , \sigma \sim N(\mu, \sigma^2),~~
\mu\sim N(0, \tau^2),\]
<p>It looks so tempting to marginalize out $\mu$ and write</p>
\[y\vert \sigma \sim N(0, \sigma^2+ \tau^2).\]
<p>But they just cannot be the same: The MAP estimate of model 1 is $\tau^2= Var(y)$ and $\tau^2= \sum_{i=1}^n y_i^2 / n - \tau^2$ for model 2.</p>
<p>The problem is <strong>y</strong> is a vector. $y_i$ are conditionally independent given $\mu$ and $\sigma$, but not so when only conditioning on $\sigma$. It is true that the marginal-marginal of $y_i$ is</p>
\[y_i\sim N(0, \sigma^2+ \tau^2).\]
<p>However, the joint-marginal is no longer factorizable. Indeed, Cov$(y_i, y_j)= \tau^2$. So the correct marginalized-out model $y \vert \sigma$
should be a MVN with mean 0 and a covariance matrix whose diagonals are $\sigma^2+ \tau^2$ and off-diagonals $\tau^2$.</p>
<p><strong>The bottomline:</strong> <a href="https://mc-stan.org/docs/2_20/stan-users-guide/rao-blackwell-section.html">Marginalization</a> is a great trick to boost computing efficienty. But it is your obligation to validate the conditional independence after the marginalization.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Yuling YaoConsider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code1: Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. ↩Alternaitves to two stage modeling2022-08-22T00:00:00+00:002022-08-22T00:00:00+00:00https://www.yulingyao.com/blog/2022/two-stage<p>Sometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.</p>
<p>To have an easiest example, consider a Stan model with data <code class="language-plaintext highlighter-rouge">y</code> and parameter <code class="language-plaintext highlighter-rouge">mu</code>, <code class="language-plaintext highlighter-rouge">sigma</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y ~ normal (mu, sigma);
</code></pre></div></div>
<p>For some reason, we have already fitted <code class="language-plaintext highlighter-rouge">mu</code> from a different module or from a different dataset. We have obtained $\mu_1, \dots, \mu_S$. The goal is to make inference on $p(\sigma \vert y, \mu_1, \dots, \mu_S)$.</p>
<p>To be clear, till now we have already lost full-Bayeisanity now since we do not fit a joint model. But hey, we are inclusive of non-bayeisan methods.</p>
<p>There are three seemingly reasonable approaches to do for the second stage model:</p>
<ol>
<li>
<p><strong>Multiple imputation.</strong> We run the model <code class="language-plaintext highlighter-rouge">y ~ normal (mu[i], sigma);</code> separately for each $i$ and collect draws $p(\sigma \vert y, \mu_i)$; we then mix these draws altogether. We run this method in MI, Cut.</p>
</li>
<li>
<p><strong>Plugin estimate.</strong> When we do two stage least-square fit, we simply plugin the first stage point estimate, say the posterior mean. This amount to a new model $y \sim normal (\bar {\mu}, \sigma)$, where $\bar {\mu}= 1/S \sum_{i=1}^S \mu_i$.</p>
</li>
<li>
<p><strong>Mixed log likelihood.</strong> At least seemingly doable, we may also mix the log density from these draws, which in Stan reads</p>
</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (i in 1:S)
target += 1/S * normal_lpdf (y | mu[i], sigma);
</code></pre></div></div>
<p>In this model example, the mixed-log-likelihood-approach is identical to the plugin estimate, although generally all these three methods will differ. Using the conditional variance formula, we can see that the multiple imputation delivers that largest estimate of $\sigma$.</p>
<p>OK, I know that in most cases approach 1 is the only acceptable answer. The justification is straight from the Bayes rule:</p>
\[p(\sigma \vert y) = \int p(\sigma \vert y, \mu) p(\mu \vert y ) d\mu.\]
<p>My controversial objective is that the Bayes rule is only relevant is we are running a joint model and infer $\mu$ and $\sigma$ together. That is SMC. But in a situation like Cut, we are placing doubt on the model in the first place, and still keep the obsession over this bayes rule seems a little bit stubborn to me.</p>
<p>Approach 1 and approach 3 differ in how they mix the conditional sampling model $p(y \vert \sigma, \mu)$. Approach 1 is using a mixture (coherent with the joint model)</p>
\[p(y \vert \sigma) := \int (p(y \vert \sigma, \mu) p(\mu \vert \sigma) ) d\mu,\]
<p>while approach 3 is using log-linear-pooling (this line does not correspond to any joint model):</p>
\[\log p(y \vert \sigma) := \int \log p(y \vert \sigma, \mu) p(\mu \vert y) d\mu + Constant.\]
<p>I wonder if this approach 3 has any actual application. I do not know.</p>Yuling YaoSometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.Score matching, Bayesian predictions, tempering, and invariance2022-08-20T00:00:00+00:002022-08-20T00:00:00+00:00https://www.yulingyao.com/blog/2022/score<h2 id="score-matching">Score matching</h2>
<p>Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.</p>
<p>But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.</p>
<p>Since the seminal work by Hyvarinen (2005), <em>score matching</em> has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant <em>disappears</em> by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the <em>score</em> function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,</p>
\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]
<p>hence the name <em>score matching</em>.</p>
<p>In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is</p>
\[H(y_{1:n}, q_{\theta})
=\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]
<p>In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.</p>
<h2 id="unnormalized-models-in-bayesian-statistics">Unnormalized models in Bayesian statistics</h2>
<p>There are three levels of unnormalized models in Bayesian statistics.</p>
<h3 id="level-1-a-harmless-normalization-constant-comes-from-the-bayes-rule">Level 1: A harmless normalization constant comes from the Bayes rule.</h3>
<p>In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,</p>
\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]
<h3 id="level-2-intractable-posterior-predictive-distribution">Level 2: Intractable posterior predictive distribution.</h3>
<p>Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that</p>
\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~
z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]
<p>Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.</p>
<h3 id="level-3-intractable-likelihood">Level 3: Intractable likelihood.</h3>
<p>If the likelihood is also intractable, meaning we are only able to evaluate
$q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant
$z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called <strong>doubly intractable</strong>. For example, in <em>alpha-liklihood</em>, the likelihood function is</p>
\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~
z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]
<p>Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:</p>
\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]
<p>The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:</p>
\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]
<h2 id="matching-for-doubly-intractable-bayesian-predictions-or-a-mixture-of-unnormalized-densities">Matching for doubly intractable Bayesian predictions, or a mixture of unnormalized densities?</h2>
<p>“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,</p>
\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]
<p>But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?</p>
<p>Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that</p>
\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]
<p>The answer is negative for any $S\geq 2$.</p>
<p>A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.</p>
<h3 id="the-bottomline">The bottomline:</h3>
<p>Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.</p>
<p>Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.</p>Yuling YaoScore matching Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.How to generate unbiased estimate of 1/E[x] using one random draw?2022-04-19T00:00:00+00:002022-04-19T00:00:00+00:00https://www.yulingyao.com/blog/2022/nonlinearMC<p><strong>Quiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?</strong></p>
<p>You might want to think about this quiz before reading my solution.</p>
<p>Apart from mathematical fun, this type of problem comes out in stochastic approximation, in which we needs an unbiased estimate using a very small number of Monte Carlo draws. The unbiasedness here means that this sampling step will be repeated many times, but each time you are only shown one sample point $x$, and we wish the estimate to be unbiased under repeated sampling.</p>
<p>Here, the obvious wrong answer is to use $1/x$. You can try</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=10000
x=rbeta(n,2,2)
mean(1/x)
</code></pre></div></div>
<p>It is clear that E $[1/x]= 3 $ while our desired quantity 1/E $[x]= 2 $. Indeed it is also clear that E $[1/x] >$ 1/E $[x]$ for positive $x$.</p>
<p>How about some Taylor series expansion? something like $1/E[p(X)] = 1- (E(x)-1) + O(E(x)-1)^2$? It is legitimate but then you get some crude approximation $2-x$, provided that I believe that $E(x) \approx 1$, which we typically do not know in the first place.</p>
<p>I find one solution from rejection sampling. The idea is that self-normalized
importance sampling is only unbiased asymptotically, while rejection sampling is always unbiased even if you have MC size 1.</p>
<p>Here is the method. To make it work, I need to know the upper bound of $x$, it has to be a bounded variable. Say the upper bound is $c$. Each time I saw a realization $x$, then independent I generate a random number from uniform (0,1). If $u$ is smaller than $x/c$, accept, and report $1/x$. If $u$ is larger than $x/c$, do not report any estimate.</p>
<p>Then whenever I report the accepted $1/x$, it is an unbiased estimate of 1/E$[x]
$.</p>
<p>Here is a demo code for $p(x)=$ Beta$(2,2)$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=1000000
x=rbeta(n,2,2)
u=runif(n,0,1)
Unbiased_estiamte=rep(NA, n)
Unbiased_estiamte[u<x]= (1/x)[u<x]
#check the answer:
mean(Unbiased_estiamte, na.rm = T)- 1/mean(x)
</code></pre></div></div>Yuling YaoQuiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?Statistics intuitions for intergals2022-03-29T00:00:00+00:002022-03-29T00:00:00+00:00https://www.yulingyao.com/blog/2022/gamma<p>I have not done any math for a long while. Today I happen to need to compute an integral</p>
\[S(k, \sigma) =\int_{0}^\infty x\log (x)/\sigma (1+kx/\sigma)^{(-1/k-1)} dx.\]
<p>It is the expectation of $x\log x$ under generalized Pareto distribution. Surely it will be finite as long as $k <1$.</p>
<p>I tried for a while then I was very sure I cannot solve it. So I opened some symbolic integral tool and the result turned out easy</p>
\[S(k, \sigma)= \frac{\sigma \left( 1-\mathrm{HarmonicNumber}[-2+\frac{1}{k}] - \log(\frac{k}{\sigma}) \right) }{1-k}.\]
<p>Except that I do not understand what <code class="language-plaintext highlighter-rouge">HarmonicNumber</code> is. I think it is some special function, so I looked it up. <a href="https://en.wikipedia.org/wiki/Harmonic_number">Wikipedia</a> told me that</p>
<blockquote>
<p>In mathematics, the n-th harmonic number is the sum of the reciprocals of the first n natural numbers:</p>
</blockquote>
\[H_{n}=1+{\frac {1}{2}}+{\frac {1}{3}}+\cdots +{\frac {1}{n}}=\sum _{k=1}^{n}{\frac {1}{k}}.\]
<p>Except it is not helpful to me cuz apparently i have non-integer $n= -2+\frac{1}{k}$ here. I studied complex analysis in college but I have never used it ever since. But that is ok, I trust my symbolic integral tool.</p>
<p>Indeed I only want to evaluate this integral near 1. Because the mean and variance of generalized Pareto distribution is of the order $O((1-k)^{-1})$ and $O((1-k)^{-2}(1-2k))$ respectively, my best conjecture is that this $S$ should be $O((1-k)^{-m}), 1\leq m \leq 2$ as k is close to 1.</p>
<p>So I searched one more minute I found that $H_{x}= \frac{\Gamma^\prime(x+1)}{\Gamma(x+1)}+\gamma$, in which $\gamma$ is the Euler constant and $\Gamma$ is the Gamma function.</p>
<p>The appearance of Euler constant and Gamma function in applied statistics is like a six pleat shirring on a shirt: fancy to the wearer but seldom useful to the audience.</p>
<p>It appeared that I needed the derivative of the Gamma function near 0. But I found that $\frac{\Gamma^\prime(x)}{\Gamma(x)}$ is itself called <a href="https://en.wikipedia.org/wiki/Digamma_function">digamma function</a> $\psi(x)$. OK, I am not proud for being ignorant here, but it is still fun to learn. So I looked up Wikipedia again and I found $\psi(x)\approx \log x - 1/2x$. I plugged this approximation into my expression and is it not the same as my conjecture. Ohh, of course, the $\psi(x)\approx \log x - 1/2x$ approximation is only applicable if x is large. For small $x\approx 0$, I found that $\psi(x) \approx -1/x - \gamma$. I plugged this in and then $\gamma$ cancelled out. So the final answer is that $S(k, \sigma) = \sigma k / (1-k)^2 $ + small order terms as k goes to 1. Done.</p>
<p>But this is not why I wrote this post. The point is that sometimes statistics intuitions can help to do tedious math. To be clear this math problem is only tedious to me cuz I am ignorant on digamma function or gamma function. I am sure the previous problem is trivial to Euler. That said, I have already used stats intuition once that I know the order must be between -1 and -2 because $x\log x$ is bounded between the first and second moments.</p>
<p>Indeed, a more statistically intuitive solution here is that I can simply replace the generalized Pareto distribution by a Pareto distribution. This time I can do it by hand:</p>
\[S(k,1)\approx
\int_{1}^{\infty} r \log r \frac{1}{k} r^{-1/k-1} dr= k(1-k)^{-2}.\]
<p>This expression is different but has the same order I obtained using the digamma function when k is close 1, which can be used for many crude approximaitons. Again, it would be nice if i have known more about gamma fucntion, but solving a tedious math by some simple statitics approximation is equally fun.</p>Yuling YaoI have not done any math for a long while. Today I happen to need to compute an integralMarginal liklihood and the Lindley paradox2021-11-22T00:00:00+00:002021-11-22T00:00:00+00:00https://www.yulingyao.com/blog/2021/BF<p>I read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.</p>
<p>Wagenmakers and Ly pointed out two approaches to escape the Lindley Paradox: either to avoid using a point hypothesis in the Bayes test, or to avoid a vague prior. Notably, we may still have the Lindley Paradox when the null is a spiky continuous distribution rather than the point mass.</p>
<p>To make the discussion, consider a Bernoulli experiemnt $y\sim \mathrm{Bin} (n,p)$ and we observe $p = 5001$, and $n=10000$. We specify a point null $\theta=.5$ and the alternitve $\theta\neq .5$ or $\theta \sim $ uniform (0,1). The p-value for the null is approximately $\Pr(z> 1/ (0.5 / sqrt(n)))= \Pr(z>200)$=0, while BF is some very big number as $\theta=.5$ predicts the outcome much better than the vague prior $\theta \sim $ Uniform (0,1).</p>
<p>We have made this point in our hierarchical stacking article: a model being true or false is not directly related to it being good or bad in terms of data fitting. Indeed a wronger model may make a better prediction depending on your chosen metric. In the Lindley paradox, at least I think, a Bayesian shall not judge that $\theta=.5$ in light of a very big BF, because we know a priori that Pr(point null)=0.</p>
<p>The marginal likelihood is only weakly related to how the model fits the data. It reflects the average leave-q-out log predictive density when q varies from 0 to $n$, among which $q=n$ accounts for a non-proportional share because prior typically has a bad predictive power.</p>
<p>To me, this irrelevance to the prediction task is the larger problem of BF: BF is aimed to test which model is more correct, rather than which model fits the data better. Worse, a model consists of two parts: the structure and the magnitude (the specific value in the prior). To appeal to BF, you need to do well on both parts. At some point, it is a test of the prior rather than the test of the model. In contrast, in hypothesis testing/LOO-model comparison/posterior predictive check, the prior is not or less relevant because these approaches examine the prediction ability of the inferred model other than the prior.</p>
<p>BF/marginal likelihood does have its merit: we can easily trick empirical loss by using an overfitting model, in which the empirical loss approaches zero while BF will typically be very small because of the large/complex parameter space in the prior. In that sense, BF <em>never</em> overfits; BF <em>always</em> underfit.</p>
<p>Can we make BF less sensitive on priors?
Yes, use intrinsic BMA, or its $n=1$ limit, the pseudo-BMA (LOO-elpd weighting).</p>
<p>Can we monitor empirical loss to test the model being <em>true</em> or <em>false</em> (other than <em>good</em> or <em>bad</em>)?
Yes, stay tuned.</p>Yuling YaoI read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.Terrace and gradient2021-10-05T00:00:00+00:002021-10-05T00:00:00+00:00https://www.yulingyao.com/blog/2021/gradient<p>I come across a paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306294/">“The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask”</a> by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that</p>
<blockquote>
<p>From a mathematical viewpoint, the adaptive biasing force method, just like adaptive biasing potential methods, is an adaptive importance-sampling procedure. There is, however, a salient difference between these two techniques. In the latter, the potential of mean force or, equivalently, the corresponding probability distribution along the transition coordinate is being adapted. In contrast, the former relies on biasing the force, i.e., the gradient of the potential. This difference is more important than it might appear at first sight, as potentials and probability distributions are global properties whereas gradients are defined locally. In terms of probability distributions, it means that the count of samples in the neighborhood of a given value of the transition coordinate is insufficient to estimate probability. Knowledge of the underlying probability distribution over a much broader range of $\xi$ is required. This may considerably impede efficient adaptation. In contrast, all that is needed to estimate the gradient is the knowledge of local behavior of the potential of mean force. Other regions along the transition coordinate do not have to be visited. Thus, in many instances, adaptation proceeds markedly faster. Using a common metaphor, the difference between the adaptive biasing potential and adaptive biasing force methods can be compared to inundating the valleys of the free-energy landscape as opposed to plowing over its barriers to yield an approximately flat terrain, conducive to unhampered diffusion.</p>
</blockquote>
<p>I like the plowing metaphor. I found a photo of Rice Terraces in Yunnan:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/2007_1206_Cleared_Hani_rice_terraces.jpg/640px-2007_1206_Cleared_Hani_rice_terraces.jpg" />
<p>which is in contrast to:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/2017_Aerial_view_Hoover_Dam_4774.jpg/600px-2017_Aerial_view_Hoover_Dam_4774.jpg" />
<p>Aside from the context of free energy computation, the exact same reason implied by the previous metaphor suggests that the gradient-based method is often more an alternative dual approach to the zero order method:</p>
<ol>
<li>In survival analysis, the Nelson–Aalen estimator is sort of the gradient version of of the Kaplan–Meier estimator (product limit).</li>
<li>In optimization, finding the mode of convext function is equivalent to finding the minimin the abs(gradient) function.</li>
<li>In cross-validation, the jackknife is the gradient-alternative to importance sampling.</li>
<li>In optimization convergence test, we can either monitor if the objective is stable, or if the gradient becomes zero.</li>
<li>In MCMC convergence test, we can either monitor if the sample draws have mixed, or if the gradient of the log density has mean zero.</li>
</ol>
<p>Should we compute more gradients?</p>Yuling YaoI come across a paper “The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask” by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that