Jekyll2022-08-29T04:18:12+00:00https://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoWhat is wrong with this marginalize-out trick2022-08-23T00:00:00+00:002022-08-23T00:00:00+00:00https://www.yulingyao.com/blog/2022/marginal<p>Consider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, tau);
y ~ normal(mu, sigma);
}
</code></pre></div></div>
<p>Tau is a fixed hyper-parameter, say 10. We can make inference on $\mu$ and $\tau$. That is easy.</p>
<p>But now I decide that I want to apply marginalization out trick to get rid of $\mu$. That is easy because it is a normal-normal model, such that the marginalized out model is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real<lower=0> sigma;
}
model {
y ~ normal(0, hypot(sigma, tau));
}
</code></pre></div></div>
<p>The problem is that this two models are not the same. Mathematically, the full joint model reads</p>
\[y\vert \mu , \sigma \sim N(\mu, \sigma^2),~~
\mu\sim N(0, \tau^2),\]
<p>It looks so tempting to marginalize out $\mu$ and write</p>
\[y\vert \sigma \sim N(0, \sigma^2+ \tau^2).\]
<p>But they just cannot be the same: The MAP estimate of model 1 is $\tau^2= Var(y)$ and $\tau^2= \sum_{i=1}^n y_i^2 / n - \tau^2$ for model 2.</p>
<p>The problem is <strong>y</strong> is a vector. $y_i$ are conditionally independent given $\mu$ and $\sigma$, but not so when only conditioning on $\sigma$. It is true that the marginal-marginal of $y_i$ is</p>
\[y_i\sim N(0, \sigma^2+ \tau^2).\]
<p>However, the joint-marginal is no longer factorizable. Indeed, Cov$(y_i, y_j)= \tau^2$. So the correct marginalized-out model $y \vert \sigma$
should be a MVN with mean 0 and a covariance matrix whose diagonals are $\sigma^2+ \tau^2$ and off-diagonals $\tau^2$.</p>
<p><strong>The bottomline:</strong> <a href="https://mc-stan.org/docs/2_20/stan-users-guide/rao-blackwell-section.html">Marginalization</a> is a great trick to boost computing efficienty. But it is your obligation to validate the conditional independence after the marginalization.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Yuling YaoConsider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code1: Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. ↩Alternaitves to two stage modeling2022-08-22T00:00:00+00:002022-08-22T00:00:00+00:00https://www.yulingyao.com/blog/2022/two-stage<p>Sometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.</p>
<p>To have an easiest example, consider a Stan model with data <code class="language-plaintext highlighter-rouge">y</code> and parameter <code class="language-plaintext highlighter-rouge">mu</code>, <code class="language-plaintext highlighter-rouge">sigma</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y ~ normal (mu, sigma);
</code></pre></div></div>
<p>For some reason, we have already fitted <code class="language-plaintext highlighter-rouge">mu</code> from a different module or from a different dataset. We have obtained $\mu_1, \dots, \mu_S$. The goal is to make inference on $p(\sigma \vert y, \mu_1, \dots, \mu_S)$.</p>
<p>To be clear, till now we have already lost full-Bayeisanity now since we do not fit a joint model. But hey, we are inclusive of non-bayeisan methods.</p>
<p>There are three seemingly reasonable approaches to do for the second stage model:</p>
<ol>
<li>
<p><strong>Multiple imputation.</strong> We run the model <code class="language-plaintext highlighter-rouge">y ~ normal (mu[i], sigma);</code> separately for each $i$ and collect draws $p(\sigma \vert y, \mu_i)$; we then mix these draws altogether. We run this method in MI, Cut.</p>
</li>
<li>
<p><strong>Plugin estimate.</strong> When we do two stage least-square fit, we simply plugin the first stage point estimate, say the posterior mean. This amount to a new model $y \sim normal (\bar {\mu}, \sigma)$, where $\bar {\mu}= 1/S \sum_{i=1}^S \mu_i$.</p>
</li>
<li>
<p><strong>Mixed log likelihood.</strong> At least seemingly doable, we may also mix the log density from these draws, which in Stan reads</p>
</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (i in 1:S)
target += 1/S * normal_lpdf (y | mu[i], sigma);
</code></pre></div></div>
<p>In this model example, the mixed-log-likelihood-approach is identical to the plugin estimate, although generally all these three methods will differ. Using the conditional variance formula, we can see that the multiple imputation delivers that largest estimate of $\sigma$.</p>
<p>OK, I know that in most cases approach 1 is the only acceptable answer. The justification is straight from the Bayes rule:</p>
\[p(\sigma \vert y) = \int p(\sigma \vert y, \mu) p(\mu \vert y ) d\mu.\]
<p>My controversial objective is that the Bayes rule is only relevant is we are running a joint model and infer $\mu$ and $\sigma$ together. That is SMC. But in a situation like Cut, we are placing doubt on the model in the first place, and still keep the obsession over this bayes rule seems a little bit stubborn to me.</p>
<p>Approach 1 and approach 3 differ in how they mix the conditional sampling model $p(y \vert \sigma, \mu)$. Approach 1 is using a mixture (coherent with the joint model)</p>
\[p(y \vert \sigma) := \int (p(y \vert \sigma, \mu) p(\mu \vert \sigma) ) d\mu,\]
<p>while approach 3 is using log-linear-pooling (this line does not correspond to any joint model):</p>
\[\log p(y \vert \sigma) := \int \log p(y \vert \sigma, \mu) p(\mu \vert y) d\mu + Constant.\]
<p>I wonder if this approach 3 has any actual application. I do not know.</p>Yuling YaoSometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.Score matching, Bayesian predictions, tempering, and invariance2022-08-20T00:00:00+00:002022-08-20T00:00:00+00:00https://www.yulingyao.com/blog/2022/score<h2 id="score-matching">Score matching</h2>
<p>Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.</p>
<p>But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.</p>
<p>Since the seminal work by Hyvarinen (2005), <em>score matching</em> has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant <em>disappears</em> by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the <em>score</em> function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,</p>
\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]
<p>hence the name <em>score matching</em>.</p>
<p>In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is</p>
\[H(y_{1:n}, q_{\theta})
=\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]
<p>In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.</p>
<h2 id="unnormalized-models-in-bayesian-statistics">Unnormalized models in Bayesian statistics</h2>
<p>There are three levels of unnormalized models in Bayesian statistics.</p>
<h3 id="level-1-a-harmless-normalization-constant-comes-from-the-bayes-rule">Level 1: A harmless normalization constant comes from the Bayes rule.</h3>
<p>In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,</p>
\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]
<h3 id="level-2-intractable-posterior-predictive-distribution">Level 2: Intractable posterior predictive distribution.</h3>
<p>Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that</p>
\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~
z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]
<p>Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.</p>
<h3 id="level-3-intractable-likelihood">Level 3: Intractable likelihood.</h3>
<p>If the likelihood is also intractable, meaning we are only able to evaluate
$q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant
$z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called <strong>doubly intractable</strong>. For example, in <em>alpha-liklihood</em>, the likelihood function is</p>
\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~
z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]
<p>Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:</p>
\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]
<p>The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:</p>
\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]
<h2 id="matching-for-doubly-intractable-bayesian-predictions-or-a-mixture-of-unnormalized-densities">Matching for doubly intractable Bayesian predictions, or a mixture of unnormalized densities?</h2>
<p>“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,</p>
\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]
<p>But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?</p>
<p>Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that</p>
\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]
<p>The answer is negative for any $S\geq 2$.</p>
<p>A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.</p>
<h3 id="the-bottomline">The bottomline:</h3>
<p>Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.</p>
<p>Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.</p>Yuling YaoScore matching Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.How to generate unbiased estimate of 1/E[x] using one random draw?2022-04-19T00:00:00+00:002022-04-19T00:00:00+00:00https://www.yulingyao.com/blog/2022/nonlinearMC<p><strong>Quiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?</strong></p>
<p>You might want to think about this quiz before reading my solution.</p>
<p>Apart from mathematical fun, this type of problem comes out in stochastic approximation, in which we needs an unbiased estimate using a very small number of Monte Carlo draws. The unbiasedness here means that this sampling step will be repeated many times, but each time you are only shown one sample point $x$, and we wish the estimate to be unbiased under repeated sampling.</p>
<p>Here, the obvious wrong answer is to use $1/x$. You can try</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=10000
x=rbeta(n,2,2)
mean(1/x)
</code></pre></div></div>
<p>It is clear that E $[1/x]= 3 $ while our desired quantity 1/E $[x]= 2 $. Indeed it is also clear that E $[1/x] >$ 1/E $[x]$ for positive $x$.</p>
<p>How about some Taylor series expansion? something like $1/E[p(X)] = 1- (E(x)-1) + O(E(x)-1)^2$? It is legitimate but then you get some crude approximation $2-x$, provided that I believe that $E(x) \approx 1$, which we typically do not know in the first place.</p>
<p>I find one solution from rejection sampling. The idea is that self-normalized
importance sampling is only unbiased asymptotically, while rejection sampling is always unbiased even if you have MC size 1.</p>
<p>Here is the method. To make it work, I need to know the upper bound of $x$, it has to be a bounded variable. Say the upper bound is $c$. Each time I saw a realization $x$, then independent I generate a random number from uniform (0,1). If $u$ is smaller than $x/c$, accept, and report $1/x$. If $u$ is larger than $x/c$, do not report any estimate.</p>
<p>Then whenever I report the accepted $1/x$, it is an unbiased estimate of 1/E$[x]
$.</p>
<p>Here is a demo code for $p(x)=$ Beta$(2,2)$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=1000000
x=rbeta(n,2,2)
u=runif(n,0,1)
Unbiased_estiamte=rep(NA, n)
Unbiased_estiamte[u<x]= (1/x)[u<x]
#check the answer:
mean(Unbiased_estiamte, na.rm = T)- 1/mean(x)
</code></pre></div></div>Yuling YaoQuiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?Statistics intuitions for intergals2022-03-29T00:00:00+00:002022-03-29T00:00:00+00:00https://www.yulingyao.com/blog/2022/gamma<p>I have not done any math for a long while. Today I happen to need to compute an integral</p>
\[S(k, \sigma) =\int_{0}^\infty x\log (x)/\sigma (1+kx/\sigma)^{(-1/k-1)} dx.\]
<p>It is the expectation of $x\log x$ under generalized Pareto distribution. Surely it will be finite as long as $k <1$.</p>
<p>I tried for a while then I was very sure I cannot solve it. So I opened some symbolic integral tool and the result turned out easy</p>
\[S(k, \sigma)= \frac{\sigma \left( 1-\mathrm{HarmonicNumber}[-2+\frac{1}{k}] - \log(\frac{k}{\sigma}) \right) }{1-k}.\]
<p>Except that I do not understand what <code class="language-plaintext highlighter-rouge">HarmonicNumber</code> is. I think it is some special function, so I looked it up. <a href="https://en.wikipedia.org/wiki/Harmonic_number">Wikipedia</a> told me that</p>
<blockquote>
<p>In mathematics, the n-th harmonic number is the sum of the reciprocals of the first n natural numbers:</p>
</blockquote>
\[H_{n}=1+{\frac {1}{2}}+{\frac {1}{3}}+\cdots +{\frac {1}{n}}=\sum _{k=1}^{n}{\frac {1}{k}}.\]
<p>Except it is not helpful to me cuz apparently i have non-integer $n= -2+\frac{1}{k}$ here. I studied complex analysis in college but I have never used it ever since. But that is ok, I trust my symbolic integral tool.</p>
<p>Indeed I only want to evaluate this integral near 1. Because the mean and variance of generalized Pareto distribution is of the order $O((1-k)^{-1})$ and $O((1-k)^{-2}(1-2k))$ respectively, my best conjecture is that this $S$ should be $O((1-k)^{-m}), 1\leq m \leq 2$ as k is close to 1.</p>
<p>So I searched one more minute I found that $H_{x}= \frac{\Gamma^\prime(x+1)}{\Gamma(x+1)}+\gamma$, in which $\gamma$ is the Euler constant and $\Gamma$ is the Gamma function.</p>
<p>The appearance of Euler constant and Gamma function in applied statistics is like a six pleat shirring on a shirt: fancy to the wearer but seldom useful to the audience.</p>
<p>It appeared that I needed the derivative of the Gamma function near 0. But I found that $\frac{\Gamma^\prime(x)}{\Gamma(x)}$ is itself called <a href="https://en.wikipedia.org/wiki/Digamma_function">digamma function</a> $\psi(x)$. OK, I am not proud for being ignorant here, but it is still fun to learn. So I looked up Wikipedia again and I found $\psi(x)\approx \log x - 1/2x$. I plugged this approximation into my expression and is it not the same as my conjecture. Ohh, of course, the $\psi(x)\approx \log x - 1/2x$ approximation is only applicable if x is large. For small $x\approx 0$, I found that $\psi(x) \approx -1/x - \gamma$. I plugged this in and then $\gamma$ cancelled out. So the final answer is that $S(k, \sigma) = \sigma k / (1-k)^2 $ + small order terms as k goes to 1. Done.</p>
<p>But this is not why I wrote this post. The point is that sometimes statistics intuitions can help to do tedious math. To be clear this math problem is only tedious to me cuz I am ignorant on digamma function or gamma function. I am sure the previous problem is trivial to Euler. That said, I have already used stats intuition once that I know the order must be between -1 and -2 because $x\log x$ is bounded between the first and second moments.</p>
<p>Indeed, a more statistically intuitive solution here is that I can simply replace the generalized Pareto distribution by a Pareto distribution. This time I can do it by hand:</p>
\[S(k,1)\approx
\int_{1}^{\infty} r \log r \frac{1}{k} r^{-1/k-1} dr= k(1-k)^{-2}.\]
<p>This expression is different but has the same order I obtained using the digamma function when k is close 1, which can be used for many crude approximaitons. Again, it would be nice if i have known more about gamma fucntion, but solving a tedious math by some simple statitics approximation is equally fun.</p>Yuling YaoI have not done any math for a long while. Today I happen to need to compute an integralMarginal liklihood and the Lindley paradox2021-11-22T00:00:00+00:002021-11-22T00:00:00+00:00https://www.yulingyao.com/blog/2021/BF<p>I read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.</p>
<p>Wagenmakers and Ly pointed out two approaches to escape the Lindley Paradox: either to avoid using a point hypothesis in the Bayes test, or to avoid a vague prior. Notably, we may still have the Lindley Paradox when the null is a spiky continuous distribution rather than the point mass.</p>
<p>To make the discussion, consider a Bernoulli experiemnt $y\sim \mathrm{Bin} (n,p)$ and we observe $p = 5001$, and $n=10000$. We specify a point null $\theta=.5$ and the alternitve $\theta\neq .5$ or $\theta \sim $ uniform (0,1). The p-value for the null is approximately $\Pr(z> 1/ (0.5 / sqrt(n)))= \Pr(z>200)$=0, while BF is some very big number as $\theta=.5$ predicts the outcome much better than the vague prior $\theta \sim $ Uniform (0,1).</p>
<p>We have made this point in our hierarchical stacking article: a model being true or false is not directly related to it being good or bad in terms of data fitting. Indeed a wronger model may make a better prediction depending on your chosen metric. In the Lindley paradox, at least I think, a Bayesian shall not judge that $\theta=.5$ in light of a very big BF, because we know a priori that Pr(point null)=0.</p>
<p>The marginal likelihood is only weakly related to how the model fits the data. It reflects the average leave-q-out log predictive density when q varies from 0 to $n$, among which $q=n$ accounts for a non-proportional share because prior typically has a bad predictive power.</p>
<p>To me, this irrelevance to the prediction task is the larger problem of BF: BF is aimed to test which model is more correct, rather than which model fits the data better. Worse, a model consists of two parts: the structure and the magnitude (the specific value in the prior). To appeal to BF, you need to do well on both parts. At some point, it is a test of the prior rather than the test of the model. In contrast, in hypothesis testing/LOO-model comparison/posterior predictive check, the prior is not or less relevant because these approaches examine the prediction ability of the inferred model other than the prior.</p>
<p>BF/marginal likelihood does have its merit: we can easily trick empirical loss by using an overfitting model, in which the empirical loss approaches zero while BF will typically be very small because of the large/complex parameter space in the prior. In that sense, BF <em>never</em> overfits; BF <em>always</em> underfit.</p>
<p>Can we make BF less sensitive on priors?
Yes, use intrinsic BMA, or its $n=1$ limit, the pseudo-BMA (LOO-elpd weighting).</p>
<p>Can we monitor empirical loss to test the model being <em>true</em> or <em>false</em> (other than <em>good</em> or <em>bad</em>)?
Yes, stay tuned.</p>Yuling YaoI read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.Terrace and gradient2021-10-05T00:00:00+00:002021-10-05T00:00:00+00:00https://www.yulingyao.com/blog/2021/gradient<p>I come across a paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306294/">“The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask”</a> by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that</p>
<blockquote>
<p>From a mathematical viewpoint, the adaptive biasing force method, just like adaptive biasing potential methods, is an adaptive importance-sampling procedure. There is, however, a salient difference between these two techniques. In the latter, the potential of mean force or, equivalently, the corresponding probability distribution along the transition coordinate is being adapted. In contrast, the former relies on biasing the force, i.e., the gradient of the potential. This difference is more important than it might appear at first sight, as potentials and probability distributions are global properties whereas gradients are defined locally. In terms of probability distributions, it means that the count of samples in the neighborhood of a given value of the transition coordinate is insufficient to estimate probability. Knowledge of the underlying probability distribution over a much broader range of $\xi$ is required. This may considerably impede efficient adaptation. In contrast, all that is needed to estimate the gradient is the knowledge of local behavior of the potential of mean force. Other regions along the transition coordinate do not have to be visited. Thus, in many instances, adaptation proceeds markedly faster. Using a common metaphor, the difference between the adaptive biasing potential and adaptive biasing force methods can be compared to inundating the valleys of the free-energy landscape as opposed to plowing over its barriers to yield an approximately flat terrain, conducive to unhampered diffusion.</p>
</blockquote>
<p>I like the plowing metaphor. I found a photo of Rice Terraces in Yunnan:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/2007_1206_Cleared_Hani_rice_terraces.jpg/640px-2007_1206_Cleared_Hani_rice_terraces.jpg" />
<p>which is in contrast to:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/2017_Aerial_view_Hoover_Dam_4774.jpg/600px-2017_Aerial_view_Hoover_Dam_4774.jpg" />
<p>Aside from the context of free energy computation, the exact same reason implied by the previous metaphor suggests that the gradient-based method is often more an alternative dual approach to the zero order method:</p>
<ol>
<li>In survival analysis, the Nelson–Aalen estimator is sort of the gradient version of of the Kaplan–Meier estimator (product limit).</li>
<li>In optimization, finding the mode of convext function is equivalent to finding the minimin the abs(gradient) function.</li>
<li>In cross-validation, the jackknife is the gradient-alternative to importance sampling.</li>
<li>In optimization convergence test, we can either monitor if the objective is stable, or if the gradient becomes zero.</li>
<li>In MCMC convergence test, we can either monitor if the sample draws have mixed, or if the gradient of the log density has mean zero.</li>
</ol>
<p>Should we compute more gradients?</p>Yuling YaoI come across a paper “The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask” by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded thatHow do we compare two numbers2021-09-15T00:00:00+00:002021-09-15T00:00:00+00:00https://www.yulingyao.com/blog/2021/number<p>I was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.</p>
<p>Here are two examples:</p>
<ol>
<li>As per Google, Pete Buttigieg’s height is 5’8 and Gavin Newsom’s is 6’3, who are on the relatively short and tall end of the modern day politician’s height spectrum respectively. With these two numbers in mind, certainly 6’3 is much bigger than 5’8, right?</li>
<li>In the 2020 U.S. Presential election, the Democratic share in TX was 47% and the Republican share was 52%. Hey, it was 47 and 52: what a tossup!</li>
</ol>
<p>The point is that 6’3 / 5’8 = 190.5 cm / 172.7 cm = 1.10, and 52 / 47 = 1.11. These two sets of comparisons have the same multiplicative difference, but why do we automatically read that 6’3 $»$ 5’8, while 52 $\approx$ 47?</p>
<p>One explanation is some sort of anchor effect. We encounter this arbitrary anchor choice in data visualization too: when comparing two coefficients, what $y$-axis scale are we using? Here by looking at the multiplicative difference, we have implicitly included zero as the lower end of the $y$-axis. But an adult male politician’s heigh cannot be zero, so maybe implicitly we have a different lower end point, or the anchor, say 5’6, then the actual multiplicative difference we are reading in mind is (6’3 - 5’6) / (5’8-5’6) = 4.6.</p>
<p>Another explanation is that we have mapped the parameters into some decision theory. When a computer reads 6’3, it is just some 32-bit integer. But we are not computers after all. We automatically generate a decision theory, in which the integer 6’3 is mapped to a masculine man wearing a brooks brother suit and oxford shoes, while the number 52% is mapped to some annoying recounting and the reflection of 2000. None of such additional information is coded by the numbers as they are presented.</p>Yuling YaoI was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.MEBA—Make Empirical-Bayes Bayes again2021-08-19T00:00:00+00:002021-08-19T00:00:00+00:00https://www.yulingyao.com/blog/2021/meba<p>Assuming there are some hyperparameters $\beta$ in the model involving data $y$. We have four ways to get some inference of $\beta$.</p>
<h2 id="map-is-bad">MAP is bad</h2>
<p>First, we have MAP, or empirical loss optimization. That is, for each $\beta$, we could train the model and obtain some in sample loss $l(y_i \mid \beta )$. Then we minimize this loss:
$
\hat \beta_{MAP}= \min \sum_{i} l(y_i \mid \beta ).
$</p>
<p>We could add some prior regularization $p(\beta)$ too, which will modify it to be</p>
\[\hat \beta_{MAP}= \min \sum_{i} l(\beta \mid y_i) - \log p(\beta).\]
<h2 id="we-can-go-loo-or-we-can-go-bayes">We can go LOO, or we can go Bayes</h2>
<p>The above procedure is attacked in two ways. One argument is that the empirical loss optimization overfits because of the misuse of in-sample error. We can adjust for this error by using cross-validation. For example, incorporating the leave one out cv and empirical loss optimization, we have</p>
\[\hat \beta_{LOO}= \min \sum_{i} l( y_i \mid \beta y_{-i}) - \log p(\beta).\]
<p>This LOO step is related to empirical Bayes if we are using LOO metrics in exmperical Bayes.</p>
<p>Yet another attack to MAP is that it is a point estimate. “You overfit cuz you ignore the uncertainty”. As an attempt to fix it, we have some generalized Bayesian step:</p>
\[\log p (\beta \mid y) =- \sum_{i} l(y_i \mid \beta ) + p(\beta).\]
<h2 id="can-we-go-both">Can we go both?</h2>
<p>It is natural to ask: which one is better: Bayes or LOO-MAP? The answer depends. For example, in the context of regression, LASSO (where the hyper parameter is tuned by LOO) is much better than bayesian lasso (in which the hyper parameter is treated as a parameter to fit using the Bayes rule).</p>
<p>But an even larger picture is a 2 by 2 table</p>
<table>
<thead>
<tr>
<th style="text-align: center">MAP 😱</th>
<th style="text-align: center">LOO-MAP (Empirical Bayes) 😐</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><strong>Bayes</strong> 😐</td>
<td style="text-align: center">😊</td>
</tr>
</tbody>
</table>
<p>We have two directions to improve MAP: either using LOO or using Bayes. But can we combine them? Can we reach that 😊 block?</p>
<h2 id="bayesianize-the-empirical-bayes">Bayesianize the Empirical-Bayes</h2>
<p>The idea is to define a posterior density via the leave one out likelihood:</p>
\[\log p (\beta \mid y)= - \sum_{i} l( y_i \mid \beta, y_{-i}) + \log p(\beta).\]
<p>Is it justified to be full-bayes? Yes. It can be viewed as data-augmentation. Assuming there is hold out dataset, we could use one dataset to first obtain conditional parameter inference $p( \theta \mid y, \beta)$, we then obtain exact Bayesian inference on hyperparameter $\beta$ as $p( \beta \mid y, y^\prime)$ using hold out data $y^\prime$ and integrating out $\theta$. Now instead of having this hold-out dataset, we integrate it out. That is the LOO likelihood part.</p>
<p>Is there an example in which this idea yield success? Yes, we have shown in our hierarchical stacking paper that this LOO-likelihood sampling (hierarchical stacking) yields better predictions than LOO-optimization (no-pooling stacking).</p>
<p>Is there computaitonal advantaves over LOO-MAP? Yes, LOO-MAP is often done by grid-search. But we can now use gradient information (wrt $beta$) when sampling this density.</p>
<p>Can we extend this to a general inference paradigm, which will sit parallel, if not above, to MAP, Bayes, and empirical Bayes? Highly promising. I am looking forward to that.</p>Yuling YaoAssuming there are some hyperparameters $\beta$ in the model involving data $y$. We have four ways to get some inference of $\beta$.Decision theory is hard2021-06-04T00:00:00+00:002021-06-04T00:00:00+00:00https://www.yulingyao.com/blog/2021/decision<p>One mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would prefer</p>
<ol>
<li>chicken tikka masala,</li>
<li>chicken madras,</li>
<li>apple pie.</li>
</ol>
<p>There is a limited supply and you are asked to order your preference, which is not necessarily honored. To be fair, I don’t think these dishes are on any actual menu, but the point of this example is that options (1) and (2) are nearly identical (from your humble author’s point of view).</p>
<h2 id="selection-or--mixing">selection or mixing</h2>
<p>One psychological confusion is the difficulty to distinguish between “selection” and “mixing”. The ability of “ordering your preference” refers to first generate a list of latent preferences and then order them. Assuming I have a sophisticated mind and I automatically self-normalize the latent preferences into $x_1, x_2, x_3$ such that $x_i\geq 0$ and $x_1+ x_2+ x_3=1 $.</p>
<p>But it is not clear how reliable our preference generation ability is. There are two orthogonal approaches to do so. First, one by one. We figure out how much utility we would have when having only chicken tikka masala and so on. Maybe I would prefer chicken overall much better than an apple pie, so I will have $x_1=0.46, x_2=0.44, x_3=0.1$.</p>
<p>Second, we can embed this discrete problem into a larger continuous problem: We imagine there is a tasting menu that mixes these three dishes, and we are considering the optimal mixing proportion. This time, because human generally has a convex utility function, the Jensen’s inequality vividly suggests I should not order two chicken curry dishes simultaneously. Then this optimal weight would inflate the preference on the third item, such as $x_1=0.3, x_2=0.3, x_3=0.4$. That is an order flip.</p>
<h2 id="sequential-decision-making">sequential decision making</h2>
<p>When it comes to sequential decision-making, it is even harder. Instead of an order of the list, we are now asked to bid for a dish one at a time. Also assume my actual preference is 0.5 0.1, 0.4. Because (1) and (2) are alike, my mental process might first distinguish between (1) and (2)— (1) is an easy win. It is like matching, if most coordinates match perfectly, ordering is easy.
Then I will process curry dishes and apple pie, in which I might have some struggle: they are just very different two items, and I can typically make up reasons for both of them. But anyway, I find curry better than apple pie after some self-fighting. So I tell the flight crew I will order (1).</p>
<p>But then the flight crew checks the headcount in the kitchen, and dish (1) is sold out. So I am asked to choose between (2) and (3) again.</p>
<p>A good mental process should be consistent in some way: the behavior of</p>
<ul>
<li>choosing between (2) and (3) conditioning on (1) being not available</li>
<li>choosing between (2) and (3) if (1) had not been brought out at all</li>
</ul>
<p>should be the same.
It is like multinomial classification with $K$ categories is equivalent to a $K-1$ binomial classifications. If that is the case, I should pick item (3) for $x_3=0.4$.</p>
<p>Except no, my mental process is often not a martingale. It is natural to be sad when learning (1) is not honored, and that will influence how I make my next stage decision: I might tend to pick (2), just due to its similarity to (1) and this similarity compensates for my disappointment/regretfulness. Is it necessarily irrational? Maybe, but the disappointment is a real feeling, and maximizing the utility of the whole process including the decision-making phase is also a sensible goal.</p>Yuling YaoOne mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would prefer