Jekyll2023-01-07T02:22:50+00:00https://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoHow much saving do you need in the end? zero?2022-12-30T00:00:00+00:002022-12-30T00:00:00+00:00https://www.yulingyao.com/blog/2022/diewithzero<p>I came across this book called “<a href="https://www.nytimes.com/2020/07/10/business/awkward-timing-book-financial-.html">Die with zero</a>”. Along with many other yolo ideas, the book prompts the attitude that one must maximize net fulfillment over net worth to the extent of “DIE WITH ZERO”. Indeed, as the book pointed out on the cover</p>
<blockquote>
<p>In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.</p>
</blockquote>
<p>I don’t know. I would be moderately shocked if there was indeed an academic effort to show this circular-argument-type result. Seems very trivial.</p>
<p>At the risk of a pedantic tone, I think one common flaw is that people ignore uncertainty. I will show that even in this seemingly trivial problem, uncertainty leads to a surprise.</p>
<p>We need some math here. Suppose one has a fixed total income throughout their life, say $C$ dollars. For simplicity, we would assume this person has no investment nor loan; only spending: they spent $0 \leq Y \leq C$ dollars in his lifetime, such that he has $C-Y$ dollars left in the saving account. In addition, the total amount of necessary costs (food, medical, etc) in the late stage is denoted by $X$ dollars. Naturally, we would wish that the saving could cover such necessary costs, and the final balance sheet $C-Y-X$ is not negative. If $C-Y-X=0$ then a die-with-zero situation occurs.</p>
<p>It seems reasonable to assume that the utility comes from the following two parts:</p>
<ul>
<li>whether the necessity is satisfied, that is $1(C-Y-X \geq 0)$. If $C-Y<X$, then the end-of-life bill cannot be paid, bad.</li>
<li>the spending of leisure: It is a consumerism world, so it is certainly an axiom that the $Y$ dollars spending directly contributed to a utility increment equal to$Y$.</li>
</ul>
<p>The combined utility is then $1(C-Y-X \geq 0) + Y$, subject to $0 \leq Y \leq C$.</p>
<p>The uncertainty comes from $X$: we do not know the necessary cost at death. Indeed, if X is known, then clearly</p>
\[\mathrm{argmax}_{0 \leq Y \leq C} 1(C-Y-X \geq 0) + Y = C-X,\]
<p>at which one dies with zero. So yes, we shall all die with zero if we have already calculated our life flawlessly.</p>
<p>In practice, $X$ is a random variable. The decision problem maximizes the expected utility</p>
\[\mathrm{max}_{0 \leq Y \leq C} \mathrm{E}_{X} [ 1(C-Y-X \geq 0) + Y ].\]
<p>You can check your intuition here: with this uncertainty on $X$, shall we expect more savings? That seems what grandma would say, no? But isn’t the utility a linear function and why would uncertainty matter at all?</p>
<p>We assume $X$ is a normal $(0, \sigma)$ random variable. The expected utility function can be written as $\Pr(X \leq C-Y) + Y$. The derivative of this function with respect to $Y$ is $- \frac{1}{\sigma \sqrt {2\pi}} \exp (-\frac{(C-Y)^2}{2\sigma^2}) + 1$.</p>
<ol>
<li>If $\sigma=0$, then yes, there is no uncertainty, such that dying with zero is optimal.</li>
<li>If $\sigma$ is not too big, $\sigma \leq 1/ \sqrt {2\pi}$, then the optimum $\hat y= C-\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) }$. The expected saving at death, is $\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) } >0$, not zero. This is the price one pays for the uncertainty.</li>
<li>If $\sigma$ is too big, $\sigma > 1/ \sqrt {2\pi}$. Then the derivative of the objective function is always positive hence the optimal $\hat Y=C$. That is, since the future is just too chaotic, one simply adopts a yolo lifestyle and forgot about the necessity at all. The expected saving at death is either $-X$ or 0 depending on the healthcare system.</li>
</ol>
<p>The bottom line: die-with-zero is an oversimplification. When there is moderate uncertainty of necessary cost, the optimum would prefer extra non-zero saving at death. When the uncertainty is too big, just yolo.</p>Yuling YaoI came across this book called “Die with zero”. Along with many other yolo ideas, the book prompts the attitude that one must maximize net fulfillment over net worth to the extent of “DIE WITH ZERO”. Indeed, as the book pointed out on the cover In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.control variate other than the score2022-12-08T00:00:00+00:002022-12-08T00:00:00+00:00https://www.yulingyao.com/blog/2022/gamma<p>In Bayesian computation, we use control variate to reduces Monte Carlo (MC) variance. The idea if we want to compute $E_{p} h(x)$ from MC draws $x_{1, \dots, S}$, instead of computing the sample mean of $h(x_i)$ , we seek a mean zero function m(x): $E_{p} m(x)= 0$, such that $h (x)- m(x)$ has lower variance.</p>
<p>As far as I know, most control variate takes the form of <strong>score</strong> function; Either the gradient of the log density, $\nabla_{x} \log p(x)$, or the stein gradient $\nabla \log p(x) g(x) + \nabla_{x} \cdot (g(x))$.</p>
<p>Are there other zero-mean functions? Well, at least another one: $\nabla_{x} p(x)$ because $E_p \nabla_{x} p(x)=0$ for all $p$. I might be ignorant, but I just notice this identity today.</p>
<p><strong>Edit:</strong>
Actually, $\nabla_{x} p(x)$ is still generated by the score function. Just take $g(x)=p(x)$ in the formula in the second paragraph, then we get the $\nabla_{x} p(x)$.</p>Yuling YaoIn Bayesian computation, we use control variate to reduces Monte Carlo (MC) variance. The idea if we want to compute $E_{p} h(x)$ from MC draws $x_{1, \dots, S}$, instead of computing the sample mean of $h(x_i)$ , we seek a mean zero function m(x): $E_{p} m(x)= 0$, such that $h (x)- m(x)$ has lower variance.What is wrong with this marginalize-out trick2022-08-23T00:00:00+00:002022-08-23T00:00:00+00:00https://www.yulingyao.com/blog/2022/marginal<p>Consider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, tau);
y ~ normal(mu, sigma);
}
</code></pre></div></div>
<p>Tau is a fixed hyper-parameter, say 10. We can make inference on $\mu$ and $\tau$. That is easy.</p>
<p>But now I decide that I want to apply marginalization out trick to get rid of $\mu$. That is easy because it is a normal-normal model, such that the marginalized out model is</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real<lower=0> sigma;
}
model {
y ~ normal(0, hypot(sigma, tau));
}
</code></pre></div></div>
<p>The problem is that this two models are not the same. Mathematically, the full joint model reads</p>
\[y\vert \mu , \sigma \sim N(\mu, \sigma^2),~~
\mu\sim N(0, \tau^2),\]
<p>It looks so tempting to marginalize out $\mu$ and write</p>
\[y\vert \sigma \sim N(0, \sigma^2+ \tau^2).\]
<p>But they just cannot be the same: The MAP estimate of model 1 is $\tau^2= Var(y)$ and $\tau^2= \sum_{i=1}^n y_i^2 / n - \tau^2$ for model 2.</p>
<p>The problem is <strong>y</strong> is a vector. $y_i$ are conditionally independent given $\mu$ and $\sigma$, but not so when only conditioning on $\sigma$. It is true that the marginal-marginal of $y_i$ is</p>
\[y_i\sim N(0, \sigma^2+ \tau^2).\]
<p>However, the joint-marginal is no longer factorizable. Indeed, Cov$(y_i, y_j)= \tau^2$. So the correct marginalized-out model $y \vert \sigma$
should be a MVN with mean 0 and a covariance matrix whose diagonals are $\sigma^2+ \tau^2$ and off-diagonals $\tau^2$.</p>
<p><strong>The bottomline:</strong> <a href="https://mc-stan.org/docs/2_20/stan-users-guide/rao-blackwell-section.html">Marginalization</a> is a great trick to boost computing efficienty. But it is your obligation to validate the conditional independence after the marginalization.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Yuling YaoConsider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code1: Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. ↩Alternaitves to two stage modeling2022-08-22T00:00:00+00:002022-08-22T00:00:00+00:00https://www.yulingyao.com/blog/2022/two-stage<p>Sometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.</p>
<p>To have an easiest example, consider a Stan model with data <code class="language-plaintext highlighter-rouge">y</code> and parameter <code class="language-plaintext highlighter-rouge">mu</code>, <code class="language-plaintext highlighter-rouge">sigma</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>y ~ normal (mu, sigma);
</code></pre></div></div>
<p>For some reason, we have already fitted <code class="language-plaintext highlighter-rouge">mu</code> from a different module or from a different dataset. We have obtained $\mu_1, \dots, \mu_S$. The goal is to make inference on $p(\sigma \vert y, \mu_1, \dots, \mu_S)$.</p>
<p>To be clear, till now we have already lost full-Bayeisanity now since we do not fit a joint model. But hey, we are inclusive of non-bayeisan methods.</p>
<p>There are three seemingly reasonable approaches to do for the second stage model:</p>
<ol>
<li>
<p><strong>Multiple imputation.</strong> We run the model <code class="language-plaintext highlighter-rouge">y ~ normal (mu[i], sigma);</code> separately for each $i$ and collect draws $p(\sigma \vert y, \mu_i)$; we then mix these draws altogether. We run this method in MI, Cut.</p>
</li>
<li>
<p><strong>Plugin estimate.</strong> When we do two stage least-square fit, we simply plugin the first stage point estimate, say the posterior mean. This amount to a new model $y \sim normal (\bar {\mu}, \sigma)$, where $\bar {\mu}= 1/S \sum_{i=1}^S \mu_i$.</p>
</li>
<li>
<p><strong>Mixed log likelihood.</strong> At least seemingly doable, we may also mix the log density from these draws, which in Stan reads</p>
</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (i in 1:S)
target += 1/S * normal_lpdf (y | mu[i], sigma);
</code></pre></div></div>
<p>In this model example, the mixed-log-likelihood-approach is identical to the plugin estimate, although generally all these three methods will differ. Using the conditional variance formula, we can see that the multiple imputation delivers that largest estimate of $\sigma$.</p>
<p>OK, I know that in most cases approach 1 is the only acceptable answer. The justification is straight from the Bayes rule:</p>
\[p(\sigma \vert y) = \int p(\sigma \vert y, \mu) p(\mu \vert y ) d\mu.\]
<p>My controversial objective is that the Bayes rule is only relevant is we are running a joint model and infer $\mu$ and $\sigma$ together. That is SMC. But in a situation like Cut, we are placing doubt on the model in the first place, and still keep the obsession over this bayes rule seems a little bit stubborn to me.</p>
<p>Approach 1 and approach 3 differ in how they mix the conditional sampling model $p(y \vert \sigma, \mu)$. Approach 1 is using a mixture (coherent with the joint model)</p>
\[p(y \vert \sigma) := \int (p(y \vert \sigma, \mu) p(\mu \vert \sigma) ) d\mu,\]
<p>while approach 3 is using log-linear-pooling (this line does not correspond to any joint model):</p>
\[\log p(y \vert \sigma) := \int \log p(y \vert \sigma, \mu) p(\mu \vert y) d\mu + Constant.\]
<p>I wonder if this approach 3 has any actual application. I do not know.</p>Yuling YaoSometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.Score matching, Bayesian predictions, tempering, and invariance2022-08-20T00:00:00+00:002022-08-20T00:00:00+00:00https://www.yulingyao.com/blog/2022/score<h2 id="score-matching">Score matching</h2>
<p>Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.</p>
<p>But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.</p>
<p>Since the seminal work by Hyvarinen (2005), <em>score matching</em> has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant <em>disappears</em> by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the <em>score</em> function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,</p>
\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]
<p>hence the name <em>score matching</em>.</p>
<p>In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is</p>
\[H(y_{1:n}, q_{\theta})
=\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]
<p>In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.</p>
<h2 id="unnormalized-models-in-bayesian-statistics">Unnormalized models in Bayesian statistics</h2>
<p>There are three levels of unnormalized models in Bayesian statistics.</p>
<h3 id="level-1-a-harmless-normalization-constant-comes-from-the-bayes-rule">Level 1: A harmless normalization constant comes from the Bayes rule.</h3>
<p>In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,</p>
\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]
<h3 id="level-2-intractable-posterior-predictive-distribution">Level 2: Intractable posterior predictive distribution.</h3>
<p>Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that</p>
\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~
z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]
<p>Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.</p>
<h3 id="level-3-intractable-likelihood">Level 3: Intractable likelihood.</h3>
<p>If the likelihood is also intractable, meaning we are only able to evaluate
$q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant
$z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called <strong>doubly intractable</strong>. For example, in <em>alpha-liklihood</em>, the likelihood function is</p>
\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~
z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]
<p>Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:</p>
\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]
<p>The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:</p>
\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]
<h2 id="matching-for-doubly-intractable-bayesian-predictions-or-a-mixture-of-unnormalized-densities">Matching for doubly intractable Bayesian predictions, or a mixture of unnormalized densities?</h2>
<p>“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,</p>
\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]
<p>But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?</p>
<p>Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that</p>
\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]
<p>The answer is negative for any $S\geq 2$.</p>
<p>A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.</p>
<h3 id="the-bottomline">The bottomline:</h3>
<p>Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.</p>
<p>Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.</p>Yuling YaoScore matching Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.How to generate unbiased estimate of 1/E[x] using one random draw?2022-04-19T00:00:00+00:002022-04-19T00:00:00+00:00https://www.yulingyao.com/blog/2022/nonlinearMC<p><strong>Quiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?</strong></p>
<p>You might want to think about this quiz before reading my solution.</p>
<p>Apart from mathematical fun, this type of problem comes out in stochastic approximation, in which we needs an unbiased estimate using a very small number of Monte Carlo draws. The unbiasedness here means that this sampling step will be repeated many times, but each time you are only shown one sample point $x$, and we wish the estimate to be unbiased under repeated sampling.</p>
<p>Here, the obvious wrong answer is to use $1/x$. You can try</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=10000
x=rbeta(n,2,2)
mean(1/x)
</code></pre></div></div>
<p>It is clear that E $[1/x]= 3 $ while our desired quantity 1/E $[x]= 2 $. Indeed it is also clear that E $[1/x] >$ 1/E $[x]$ for positive $x$.</p>
<p>How about some Taylor series expansion? something like $1/E[p(X)] = 1- (E(x)-1) + O(E(x)-1)^2$? It is legitimate but then you get some crude approximation $2-x$, provided that I believe that $E(x) \approx 1$, which we typically do not know in the first place.</p>
<p>I find one solution from rejection sampling. The idea is that self-normalized
importance sampling is only unbiased asymptotically, while rejection sampling is always unbiased even if you have MC size 1.</p>
<p>Here is the method. To make it work, I need to know the upper bound of $x$, it has to be a bounded variable. Say the upper bound is $c$. Each time I saw a realization $x$, then independent I generate a random number from uniform (0,1). If $u$ is smaller than $x/c$, accept, and report $1/x$. If $u$ is larger than $x/c$, do not report any estimate.</p>
<p>Then whenever I report the accepted $1/x$, it is an unbiased estimate of 1/E$[x]
$.</p>
<p>Here is a demo code for $p(x)=$ Beta$(2,2)$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>n=1000000
x=rbeta(n,2,2)
u=runif(n,0,1)
Unbiased_estiamte=rep(NA, n)
Unbiased_estiamte[u<x]= (1/x)[u<x]
#check the answer:
mean(Unbiased_estiamte, na.rm = T)- 1/mean(x)
</code></pre></div></div>Yuling YaoQuiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?Statistics intuitions for intergals2022-03-29T00:00:00+00:002022-03-29T00:00:00+00:00https://www.yulingyao.com/blog/2022/gamma<p>I have not done any math for a long while. Today I happen to need to compute an integral</p>
\[S(k, \sigma) =\int_{0}^\infty x\log (x)/\sigma (1+kx/\sigma)^{(-1/k-1)} dx.\]
<p>It is the expectation of $x\log x$ under generalized Pareto distribution. Surely it will be finite as long as $k <1$.</p>
<p>I tried for a while then I was very sure I cannot solve it. So I opened some symbolic integral tool and the result turned out easy</p>
\[S(k, \sigma)= \frac{\sigma \left( 1-\mathrm{HarmonicNumber}[-2+\frac{1}{k}] - \log(\frac{k}{\sigma}) \right) }{1-k}.\]
<p>Except that I do not understand what <code class="language-plaintext highlighter-rouge">HarmonicNumber</code> is. I think it is some special function, so I looked it up. <a href="https://en.wikipedia.org/wiki/Harmonic_number">Wikipedia</a> told me that</p>
<blockquote>
<p>In mathematics, the n-th harmonic number is the sum of the reciprocals of the first n natural numbers:</p>
</blockquote>
\[H_{n}=1+{\frac {1}{2}}+{\frac {1}{3}}+\cdots +{\frac {1}{n}}=\sum _{k=1}^{n}{\frac {1}{k}}.\]
<p>Except it is not helpful to me cuz apparently i have non-integer $n= -2+\frac{1}{k}$ here. I studied complex analysis in college but I have never used it ever since. But that is ok, I trust my symbolic integral tool.</p>
<p>Indeed I only want to evaluate this integral near 1. Because the mean and variance of generalized Pareto distribution is of the order $O((1-k)^{-1})$ and $O((1-k)^{-2}(1-2k))$ respectively, my best conjecture is that this $S$ should be $O((1-k)^{-m}), 1\leq m \leq 2$ as k is close to 1.</p>
<p>So I searched one more minute I found that $H_{x}= \frac{\Gamma^\prime(x+1)}{\Gamma(x+1)}+\gamma$, in which $\gamma$ is the Euler constant and $\Gamma$ is the Gamma function.</p>
<p>The appearance of Euler constant and Gamma function in applied statistics is like a six pleat shirring on a shirt: fancy to the wearer but seldom useful to the audience.</p>
<p>It appeared that I needed the derivative of the Gamma function near 0. But I found that $\frac{\Gamma^\prime(x)}{\Gamma(x)}$ is itself called <a href="https://en.wikipedia.org/wiki/Digamma_function">digamma function</a> $\psi(x)$. OK, I am not proud for being ignorant here, but it is still fun to learn. So I looked up Wikipedia again and I found $\psi(x)\approx \log x - 1/2x$. I plugged this approximation into my expression and is it not the same as my conjecture. Ohh, of course, the $\psi(x)\approx \log x - 1/2x$ approximation is only applicable if x is large. For small $x\approx 0$, I found that $\psi(x) \approx -1/x - \gamma$. I plugged this in and then $\gamma$ cancelled out. So the final answer is that $S(k, \sigma) = \sigma k / (1-k)^2 $ + small order terms as k goes to 1. Done.</p>
<p>But this is not why I wrote this post. The point is that sometimes statistics intuitions can help to do tedious math. To be clear this math problem is only tedious to me cuz I am ignorant on digamma function or gamma function. I am sure the previous problem is trivial to Euler. That said, I have already used stats intuition once that I know the order must be between -1 and -2 because $x\log x$ is bounded between the first and second moments.</p>
<p>Indeed, a more statistically intuitive solution here is that I can simply replace the generalized Pareto distribution by a Pareto distribution. This time I can do it by hand:</p>
\[S(k,1)\approx
\int_{1}^{\infty} r \log r \frac{1}{k} r^{-1/k-1} dr= k(1-k)^{-2}.\]
<p>This expression is different but has the same order I obtained using the digamma function when k is close 1, which can be used for many crude approximaitons. Again, it would be nice if i have known more about gamma fucntion, but solving a tedious math by some simple statitics approximation is equally fun.</p>Yuling YaoI have not done any math for a long while. Today I happen to need to compute an integralMarginal liklihood and the Lindley paradox2021-11-22T00:00:00+00:002021-11-22T00:00:00+00:00https://www.yulingyao.com/blog/2021/BF<p>I read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.</p>
<p>Wagenmakers and Ly pointed out two approaches to escape the Lindley Paradox: either to avoid using a point hypothesis in the Bayes test, or to avoid a vague prior. Notably, we may still have the Lindley Paradox when the null is a spiky continuous distribution rather than the point mass.</p>
<p>To make the discussion, consider a Bernoulli experiemnt $y\sim \mathrm{Bin} (n,p)$ and we observe $p = 5001$, and $n=10000$. We specify a point null $\theta=.5$ and the alternitve $\theta\neq .5$ or $\theta \sim $ uniform (0,1). The p-value for the null is approximately $\Pr(z> 1/ (0.5 / sqrt(n)))= \Pr(z>200)$=0, while BF is some very big number as $\theta=.5$ predicts the outcome much better than the vague prior $\theta \sim $ Uniform (0,1).</p>
<p>We have made this point in our hierarchical stacking article: a model being true or false is not directly related to it being good or bad in terms of data fitting. Indeed a wronger model may make a better prediction depending on your chosen metric. In the Lindley paradox, at least I think, a Bayesian shall not judge that $\theta=.5$ in light of a very big BF, because we know a priori that Pr(point null)=0.</p>
<p>The marginal likelihood is only weakly related to how the model fits the data. It reflects the average leave-q-out log predictive density when q varies from 0 to $n$, among which $q=n$ accounts for a non-proportional share because prior typically has a bad predictive power.</p>
<p>To me, this irrelevance to the prediction task is the larger problem of BF: BF is aimed to test which model is more correct, rather than which model fits the data better. Worse, a model consists of two parts: the structure and the magnitude (the specific value in the prior). To appeal to BF, you need to do well on both parts. At some point, it is a test of the prior rather than the test of the model. In contrast, in hypothesis testing/LOO-model comparison/posterior predictive check, the prior is not or less relevant because these approaches examine the prediction ability of the inferred model other than the prior.</p>
<p>BF/marginal likelihood does have its merit: we can easily trick empirical loss by using an overfitting model, in which the empirical loss approaches zero while BF will typically be very small because of the large/complex parameter space in the prior. In that sense, BF <em>never</em> overfits; BF <em>always</em> underfit.</p>
<p>Can we make BF less sensitive on priors?
Yes, use intrinsic BMA, or its $n=1$ limit, the pseudo-BMA (LOO-elpd weighting).</p>
<p>Can we monitor empirical loss to test the model being <em>true</em> or <em>false</em> (other than <em>good</em> or <em>bad</em>)?
Yes, stay tuned.</p>Yuling YaoI read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.Terrace and gradient2021-10-05T00:00:00+00:002021-10-05T00:00:00+00:00https://www.yulingyao.com/blog/2021/gradient<p>I come across a paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306294/">“The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask”</a> by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that</p>
<blockquote>
<p>From a mathematical viewpoint, the adaptive biasing force method, just like adaptive biasing potential methods, is an adaptive importance-sampling procedure. There is, however, a salient difference between these two techniques. In the latter, the potential of mean force or, equivalently, the corresponding probability distribution along the transition coordinate is being adapted. In contrast, the former relies on biasing the force, i.e., the gradient of the potential. This difference is more important than it might appear at first sight, as potentials and probability distributions are global properties whereas gradients are defined locally. In terms of probability distributions, it means that the count of samples in the neighborhood of a given value of the transition coordinate is insufficient to estimate probability. Knowledge of the underlying probability distribution over a much broader range of $\xi$ is required. This may considerably impede efficient adaptation. In contrast, all that is needed to estimate the gradient is the knowledge of local behavior of the potential of mean force. Other regions along the transition coordinate do not have to be visited. Thus, in many instances, adaptation proceeds markedly faster. Using a common metaphor, the difference between the adaptive biasing potential and adaptive biasing force methods can be compared to inundating the valleys of the free-energy landscape as opposed to plowing over its barriers to yield an approximately flat terrain, conducive to unhampered diffusion.</p>
</blockquote>
<p>I like the plowing metaphor. I found a photo of Rice Terraces in Yunnan:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/2007_1206_Cleared_Hani_rice_terraces.jpg/640px-2007_1206_Cleared_Hani_rice_terraces.jpg" />
<p>which is in contrast to:</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/2017_Aerial_view_Hoover_Dam_4774.jpg/600px-2017_Aerial_view_Hoover_Dam_4774.jpg" />
<p>Aside from the context of free energy computation, the exact same reason implied by the previous metaphor suggests that the gradient-based method is often more an alternative dual approach to the zero order method:</p>
<ol>
<li>In survival analysis, the Nelson–Aalen estimator is sort of the gradient version of of the Kaplan–Meier estimator (product limit).</li>
<li>In optimization, finding the mode of convext function is equivalent to finding the minimin the abs(gradient) function.</li>
<li>In cross-validation, the jackknife is the gradient-alternative to importance sampling.</li>
<li>In optimization convergence test, we can either monitor if the objective is stable, or if the gradient becomes zero.</li>
<li>In MCMC convergence test, we can either monitor if the sample draws have mixed, or if the gradient of the log density has mean zero.</li>
</ol>
<p>Should we compute more gradients?</p>Yuling YaoI come across a paper “The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask” by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded thatHow do we compare two numbers2021-09-15T00:00:00+00:002021-09-15T00:00:00+00:00https://www.yulingyao.com/blog/2021/number<p>I was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.</p>
<p>Here are two examples:</p>
<ol>
<li>As per Google, Pete Buttigieg’s height is 5’8 and Gavin Newsom’s is 6’3, who are on the relatively short and tall end of the modern day politician’s height spectrum respectively. With these two numbers in mind, certainly 6’3 is much bigger than 5’8, right?</li>
<li>In the 2020 U.S. Presential election, the Democratic share in TX was 47% and the Republican share was 52%. Hey, it was 47 and 52: what a tossup!</li>
</ol>
<p>The point is that 6’3 / 5’8 = 190.5 cm / 172.7 cm = 1.10, and 52 / 47 = 1.11. These two sets of comparisons have the same multiplicative difference, but why do we automatically read that 6’3 $»$ 5’8, while 52 $\approx$ 47?</p>
<p>One explanation is some sort of anchor effect. We encounter this arbitrary anchor choice in data visualization too: when comparing two coefficients, what $y$-axis scale are we using? Here by looking at the multiplicative difference, we have implicitly included zero as the lower end of the $y$-axis. But an adult male politician’s heigh cannot be zero, so maybe implicitly we have a different lower end point, or the anchor, say 5’6, then the actual multiplicative difference we are reading in mind is (6’3 - 5’6) / (5’8-5’6) = 4.6.</p>
<p>Another explanation is that we have mapped the parameters into some decision theory. When a computer reads 6’3, it is just some 32-bit integer. But we are not computers after all. We automatically generate a decision theory, in which the integer 6’3 is mapped to a masculine man wearing a brooks brother suit and oxford shoes, while the number 52% is mapped to some annoying recounting and the reflection of 2000. None of such additional information is coded by the numbers as they are presented.</p>Yuling YaoI was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.