Jekyll2020-12-26T05:57:35+00:00http://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoThe likelihood principle in model check and model evaluation2020-12-16T00:00:00+00:002020-12-16T00:00:00+00:00http://www.yulingyao.com/blog/2020/likelihood<p>The likelihood principle is often phrase as an axiom in Bayesian statistics. My interpretation of the likelihood principle reads:</p>
<p>We are (only) interested in estimating an unknown parameter $\theta$, and there are two data generating experiments both involving $\theta$ with observable outcomes $y_1$ and $y_2$ and likelihoods $p_1(y_1 \vert \theta)$ and $p_2(y_2 \vert \theta)$. If the outcome-experiment pair satisfies $p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$, (viewed as a function of $\theta$) then these two experiments and two observations will provide the same amount of information about $\theta$.</p>
<p>Consider a classic example. Someone is doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability $\theta$ by independent Bernoulli trial likelihood: $y\sim binomial (\theta\vert n=10)$. Other informative priors can exist but is not relevant to our discussion here.</p>
<p>What is relevant is that later the manager found that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until $y=9$ are positive. The actual random outcome is n, while y is fixed. So the correct model is $10=n\sim$ negative binomial $(\theta\vert y=9)$.</p>
<p>Luckily, the likelihood principle kicks in for the fact that
binomial_lpmf $(y\vert n, \theta) =$ neg_binomial_lpmf $(y\vert n, \theta)$ + constant. Hence no matter how the experiment is done, they yield the same inference.</p>
<p>At the abstract level, the likelihood principle says the information of $\theta$ can only be extracted via the likelihood, not from experiments that could have been done.</p>
<p>For example, in hypothesis testing, all the type-1 error is about a hypothetical experiment (e.g., the null is $\theta=0$). A classic example is that one has two scalers which return $y\sim$ N$(\theta, 1)$ or N$(\theta, 10000)$ respectively, and which scaler is to be used is determined by a coin flip. But even if in one trail one knows he is using the precise scaler, the hypothesis testing still uses the inflated p-value $p= \Pr(X_{mix} >\vert y \vert)$. where $X_{mix}$ come from a mixture density $X_{mix} \sim .5 N(0,1)+.5 N(0,10000)$.</p>
<h2 id="what-can-go-wrong-in-model-check">What can go wrong in model check</h2>
<p>The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of $y$.</p>
<p>In the binomial/negative binomial example, it is OK to stop at the inference of $\theta$. But as long as we want to check the model, we do need to distinguish between the two possible sampling model and which variable ($n$ or $y$) is random.</p>
<p>Consider we observe y=9 positive cases among n=10 trials and the estimated $\theta=0.9$, the likelihood of binomial and negative binomial models are</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> y=9
> n=10
> dnbinom(n-y,y,0.9)
0.3486784
> dbinom(y,n, 0.9)
0.3874205
</code></pre></div></div>
<p>Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> prob_list=seq(0.5,0.95,length.out = 100)
> dnbinom(n-y,y, prob=prob_list)/dbinom(y,n, prob=prob_list)
</code></pre></div></div>
<p>The result is a constant ratio, $0.9$.</p>
<p>However, the posterior predictive check (PPC) will have different p-values:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> 1-pnbinom(n-y,y, 0.9)
0.2639011
> 1-pbinom(y,n, 0.9)
0.3486784
</code></pre></div></div>
<p>The difference of the PPC-p-value can be even more dramatic with other $\theta$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> 1-pnbinom(n-y,y, 0.99)
0.0042662
> 1-pbinom(y,n, 0.99)
0.9043821
</code></pre></div></div>
<p>Just very different!</p>
<p>Clearly using Bayesian posterior of $\theta$ does not fix the issue. The problem is that likelihood ensures some constant ratio on $\theta$, not on $y_1$ nor $y_2$.</p>
<h2 id="model-selection">Model selection?</h2>
<p>Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.</p>
<p>In the previous AB testing example, given data $(y,n)$, if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate $\hat \theta=0.9$. Then we obtain a likelihood ratio test, with the ratio $0.9$, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant $y/n$, independent of the posterior distribution of $\theta$. If $y/n=0.001$, then we get a Bayes factor 1000 favoring the binomial model.</p>
<p>Except it is wrong. It is not sensible to compare a likelihood on $y$ and a likelihood on $n$.</p>
<h2 id="what-can-go-wrong-in-cross-validation">What can go wrong in cross-validation</h2>
<p>CV requires some loss function, and the same likelihood does not imply the same loss function (L2 loss, interval loss, etc). For adherence, we adopt log predictive densities for now.</p>
<p>CV also needs some part of the data to be exchange, which depends on the sampling distribution.</p>
<p>On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Using the two model notation with $M1: p_1(\theta\vert y_1)$ and $M2: p_2(\theta\vert y_2)$</p>
\[\text{LOOCV}_1= \sum_i \log \int_\theta {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }} \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1} p_1 (y_\vert\theta) d\theta,\]
<p>and replace all 1 with 2 in $\text{LOOCV}_2$.</p>
<p>The likelihood principle does say that $p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2) $,
and if there is some generalized likelihood principle ensureing that $p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta)$, then $\text{LOOCV}_1= \text{constant} + \text{LOOCV}_2$.</p>
<p>Sure, but it is extra assumption. And arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.</p>
<p>The basic form of the likelihood principle does not have the notation of $y_i$. It is possibles that $y_2$ and $y_1$ have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with $y_i\sim binomial(n_i, \theta)$. If I have 100 polls, I have 100 data points. Alternatively I can view data from $\sum {n_i}$ Bernoulli trials, and the sample size becomes $\sum_{i=1}^{100} {n_i}$.</p>
<p>Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare $\text{LOOCV}_1$ with $\text{LOOCV}_2$. They are scoring rules on two different spaces (probability measures on $y_1$ and $y_2$ respectively) and should not be compared directly.</p>
<h2 id="ppc-again">PPC again</h2>
<p>Although it is a bad practice, we sometimes compare PPC p-value for the purpose of model comparison. In the y=9, n=10, $\hat \theta=0.99$ case, we can compute the two sided p-value
$\min (\Pr(y_{sim} > y \vert n), \Pr(y_{sim} < y \vert n))$ for the binomial model and $\min (\Pr(n_{sim} > n \vert y), \Pr(n_{sim} < n \vert y))$ for the NB model respectively.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> min(pnbinom(n-y,y, 0.99), 1-pnbinom(n-y,y, 0.99) )
0.004717254
> min( pbinom(y,n, 0.99), 1-pbinom(y,n, 0.99))
0.09561792
</code></pre></div></div>
<p>In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?</p>
<p>Still we cannot. We should not compare p-values at all.</p>
<h2 id="the-likelihood-principle-and-the-sampling-distribution">The likelihood principle and the sampling distribution</h2>
<p>To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a
product space: both $y$ and $n$ are now viewed as random variables.</p>
<p>The binomial/negative binomial narrative specify two models $p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta)$ and $p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta)$.</p>
<p>The ratio of these two densities only admit three values:
0, infinity, or a constant y/n.</p>
<p>If we observe several paris of $(n, y)$, we can easily decide which margin is fixed. The harder problem is we only observe one $(n,y)$. Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).</p>
<p>Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under both sampling model, the event admitting nontrivial ratio, $1(y=y_{obs}) 1(n=n_{obs})$, has zero measure. We could do whatever we want at this point without affecting any asymptotic property in almost sure sense.</p>Yuling YaoThe likelihood principle is often phrase as an axiom in Bayesian statistics. My interpretation of the likelihood principle reads:Monte Carlo estimate of quantile2020-11-24T00:00:00+00:002020-11-24T00:00:00+00:00http://www.yulingyao.com/blog/2020/tail<p>This comes a lot in Monte Carlo computation: we are only given finite draws but we want to compute extreme quantiles.</p>
<p>Our go-to estimate is the sample quantile. Given posterior draws $\theta_1, \dots,\theta_S$, the $[S\alpha]$-th sample (ordered increasingly) $\theta_{[S\alpha]}$ is an asymptotically unbiased estimate of the $\alpha$ quantile of the distribution, which we shall call $\theta^*_\alpha$.
Using some basic theory from order statistics, the variance of this term is (asymptotically)</p>
<p>\(\frac{\alpha (1-\alpha)}{f^2(\theta^*_\alpha)}+ o(S^{-1}).\)
This can be derived from the CLT of empirical CDF.
Inside this expression, the nominator is bounded from above, while the denominator can go very ugly if the tail probability $f(\theta^*_\alpha)$ is small. This approximation is more complicated is $\theta_i$ is drawn from a Markov chain, but this is the basic idea why tail quantile estimate is hard.</p>
<p>Can we improve sample quantiles?</p>
<h2 id="the-quantile-estimate-is-as-good-as-cdf-estimate">The quantile estimate is as good as CDF estimate</h2>
<p>To link the estimation error in CDF and in quantile, a useful approximation is <em>Bahadur representation</em>:</p>
\[\theta_{[S\alpha]}= \theta^*_\alpha + \frac{\alpha- \hat F_S(\theta^*_\alpha)}{f(\theta^*_\alpha)} + O(S^{-3/4}\log\log S)\]
<p>The remainder term has quicker convergence rate than the usual root S. Hence, to have a good quantile estimate is almost the same as to have a good CDF estimate.
The empirical CDF is a Monte Carlo sum $\hat F(u)= 1/S \sum_{i=1}^S c(\theta_i)$, for $c(\theta_i)=I(\theta_{(i)}>u)$, an indicator function. The usual variance reduction technique applies here and we may use add a mean-zero control variate in the Monte Carlo sum. For example we can use parametric model to learn $\theta_{(i)}$, and subtract the deterministic part from the summand—in particular, if $\theta$ is discrete, and can first sum over the categories with high probability and only run Monte Carlo on the remaining categories. This step can also be viewed as Rao-Balckwellization.</p>
<h2 id="control-variate">control variate</h2>
<p>Typically $\theta$ is continuous, yet we may still choose to use numerical methods to compute the bulk probability $\Pr(a<\theta<b)$, and run Monte Carlo on the tail intervals:</p>
\[\hat F(u)= \begin{cases} 1/S\sum_{i=1}^S I(\theta_{(i)}>u), ~if ~ u<a; \\
\hat F (a)+ \Pr(a<\theta<u) , ~if ~ a<u<b;\\
\hat F (a)+ \Pr(a<\theta<b)+ (1-1/S\sum_{i=1}^S I(a< \theta_{(i)}< u)), ~if ~ u>b \\
\end{cases}\]
<p>something like this. The point if we can reduce some variance by computing the bulk $\Pr(a<\theta<u)$ deterministically.</p>
<h2 id="regularize-the-tail">regularize the tail</h2>
<p>We can even do better in the tail. For example, we can estimate the polynomial tail and use the its expected value instead.</p>
<p>A parametric model, however, might not necessarily extrapolate well in the extreme tail. Even in the lucky case where we can evaluate the pointwise marginal density exactly $p(\theta_i)$, and we could fit a parametric model $p(\theta_i)\approx f(\theta_i)$. No matter how flexible $f$ is in fitting the observed samples, it relies on certain tail shape assumptions. A cubic spline assumes a cubic tail. A gaussian process regression on $(x=\theta_i, y=p(\theta_i)$ does not model the tail directly, but the mean function of $x^\star$ is $k(x, x^\star) C$ where $C$ is the matrix not relevant to $x^\star$. Hence the whole expression models the gaussian tail ($\log pdf(x)= O(x^2)$) if $k(,)$ is squared-exponential tail—it is not trivial, why should the R.V. being Gaussian if I use GP to fit its pdf?</p>
<h2 id="l-estimate">L estimate</h2>
<p>We can use all order statistics instead of only one. That is, $1/S \sum_{i=1}^S \theta_{(i)}w_i$ where $w_i$ is optimized such that the variance of this summand is minimized, while the constraint $1/S \sum_{i=1}^S w_i \hat F^{-1}(\theta_{(i)})= \alpha$ would ensure that the final estimate is asymptotically unbiased.</p>
<p>This estimate can be viewed as a special case of the previous section with multiple control variate (all order statistics).</p>
<p>It seems we have many cute techniques and we seldom apply any of them in practice. A natural question is that what is the most general approach? Maybe the starting point is to think about the UMVUE of the population quantile given both samples and the marginal density.</p>Yuling YaoThis comes a lot in Monte Carlo computation: we are only given finite draws but we want to compute extreme quantiles.Book review “Discrete Distribution”2020-11-21T00:00:00+00:002020-11-21T00:00:00+00:00http://www.yulingyao.com/blog/2020/discretebook<p>Today I was reading the book “Discrete Distribution” by Johnson and Kotz. I did not realize it has a newer version until I started this blog post—-the <a href="https://books.google.com/books?id=1w7vAAAAMAAJ&source=gbs_book_other_versions">edition</a> I read was published in 1969 by Willey.</p>
<p>Nevertheless, it is almost exotic by noticing how big shift the study focus has shifted in half a century. As a pre-computer era publication, this book spent a lot of space on various approximation tricks. For example, in a binomial distribution $X\sim Bin( \theta, N)$, a transformation</p>
\[\mathrm{arcsin}(\frac{X+3/8}{N+3/4})\]
<p>leads to an approximation normal(sin$^{-1}(\sqrt \theta, 1/2\sqrt{N})$). But that is not the end of the story, cuz the author spent the next page showing a better transformation</p>
\[\frac{1}{2}(\mathrm{arcsin \sqrt{\frac{X}{N+1}}}+\sqrt{\frac{X+1}{N+1}})\]
<p>is an even better normal approximation with some decimal improvement.</p>
<p>There are many tricks like this in the book, toward which I have a conflicted feeling. On one hand they all look cute and stant for an extensive intellectual effort. On the other hand, they are really useless nowadays as an estimation technique. We no longer need this normal approximation for the ability to easily generate any posterior simulation for free.</p>
<p>Another exotic impression of this book for me is its dedication to a comprehensive review on compound distributions—and again they all have cute names: Polya-Eggenberger (Bin + Beta), Neyman Type-A (poisson + poisson), etc etc. Again, these cute names are not really that useful in modern modeling. For example, when we worry about the heterogeneity of the model, we can still use a restricted likelihood (e.g., $y_i\sim$ poisson$(\theta_i)$), and hierarchically model the subject-varying parameter $\theta_i\sim$ foo() without having to worrying about what the compound likelihood is. Of course in some cases, the compound likelihood has closed form density (e.g., Negative Binomial) and using this closed form density eliminates local parameter $\theta_i$ and hence improves the computation efficiency.</p>Yuling YaoToday I was reading the book “Discrete Distribution” by Johnson and Kotz. I did not realize it has a newer version until I started this blog post—-the edition I read was published in 1969 by Willey.Does MAP estimate always overfit?2020-10-21T00:00:00+00:002020-10-21T00:00:00+00:00http://www.yulingyao.com/blog/2020/point<p>This is wrong. Indeed it can be opposite.</p>
<p>Well, just to be clear, here I am talking about a specific situation: We have a model and a dataset, such that we can write down the joint posterior density. To estimate parameters, we can use</p>
<ol>
<li>posterior simulation draws from the joint posterior densities, for which we run MCMC.</li>
<li>MAP estimate: the joint mode of the density, for which we run optimization such as sgd.</li>
</ol>
<p>In any hierarchical model with the form:
\(y_{ik}\sim \mathrm{N}(\mu_k, \sigma); ~~
\mu_k \sim \mathrm{N}(\mu_0, \tau); \tau \sim 1.\)</p>
<p>The joint mode is achieved at the complete pooling subspace: $\tau=0, \mu_k = \mu_0$. And, this complete pooling model is always simpler than the hierarchical model in terms of training-testing error gap. If anything, the MAP underfits.</p>
<p>Of course I believe there are other empales in which MAP does ovefit. The relation is just not definite. Is there any general characterizations? I don’t know.</p>Yuling YaoThis is wrong. Indeed it can be opposite.Why Bayesian models could have better predictions.2020-09-28T00:00:00+00:002020-09-28T00:00:00+00:00http://www.yulingyao.com/blog/2020/predict<p>In a predictive paradigm, no one really cares about how I obtain the estimation or the prediction. It can come from some MLE, MAP of risk minimization, or some Bayes procedure. Also, when we talk about a predictive paradigm, a big selling point of Bayesian procedure for offering confidence interval for free is largely undermined. A hedge fund manager does not really care about the p value of strategy A outperforming strategy B. If it is weak evidence the winning effect is N(0.1,1), decision theory says he should adopt the empirical winner.</p>
<p>However, even with this black box prediction attitude, a Bayesian model analysis is still useful for the following reasons.</p>
<ol>
<li>
<p>Prior = regionalization = robustness. I agree this argument is lame, as everyone is using regionalization anyway.</p>
</li>
<li>
<p>The posterior distribution is simply more expressive than a MAP point estimate. Sure, I suppose from the probabilistic prediction point of view, an even more general procedure is to optimize over all probabilistic distributions subject to the decision problem. But the space of all probabilistic distributions is tooo big, and bayesian inference gives a coherent approximation to this infeasible goal. A warning is that bernstein von mises theorem says “However, in the long run, we are all dead” cuz bayes has to converge to a point estimate anyway. But that actually means the model is not given or fixed.</p>
</li>
<li>
<p>It is easier to work with Bayesian models on hierarchical data. To design a good hierarchical model to reflect the data structure is pretty much to have a customized NN infrastructure, except you do not publish every new model in NIPS.</p>
</li>
<li>
<p>Bayesian procedure is computationally harder, but as long as we obtain the final posterior simulation, it becomes easier to run other post-processing and adversarial-training. For example, BMA gives a coherent model ensemble (which in general is not feasible in MAPs). Simulated tempering is a more sophisticated version of NN-distillation.</p>
</li>
<li>
<p>The ability to simulate “generated quantity” for free. When we model the stock price by a linear regression a normal model, we have to use normal approximation for the option. But with MCMC draws, the option striking probability can be calculated directly. The same blessing applies to intervals and quantiles.</p>
</li>
</ol>
<p>To be sure, there is NOT a guarantee that a bayesian posterior must produce a better prediction—-we do not even expect it to be so for an arbitrary model. The model-has-been-fit-so-we-are-done era has gone.</p>
<p>Unfortunately, the previous points 3-5 are pretty much unaddressed in “Bayesian learning”. Most theories are based on exchangeable data; minimal post-processing is done; the customized prediction is not really a thing. Much room for developments.</p>Yuling YaoIn a predictive paradigm, no one really cares about how I obtain the estimation or the prediction. It can come from some MLE, MAP of risk minimization, or some Bayes procedure. Also, when we talk about a predictive paradigm, a big selling point of Bayesian procedure for offering confidence interval for free is largely undermined. A hedge fund manager does not really care about the p value of strategy A outperforming strategy B. If it is weak evidence the winning effect is N(0.1,1), decision theory says he should adopt the empirical winner.Some difficulty of fitting splines on latent variables2020-09-16T00:00:00+00:002020-09-16T00:00:00+00:00http://www.yulingyao.com/blog/2020/spline<p>By spline I mean B-spline, and by B-spline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.</p>
<p>I think there are three distinct usage of spline:</p>
<ol>
<li>
<p>Interpolation: we enforce the spline to go across given (x, y) points, and make predictions for new x.</p>
</li>
<li>
<p>Smoothing: instead of exact interpolation, we allow some fitting error, but we also penalizing the smoothness of the fitted curve (the integral of 2rd order derivative in 1D or laplacian in 2D).</p>
</li>
<li>
<p>Filtering. It functions likes 2 that we regress the observations to the basis function matrix, but instead of viewing the penalization term as a bias-variance tradeoff, we shift the goal to find a latent variable surface. The bias in 2 is now interpreted as observations noise and the latent surface is now the true value.</p>
</li>
</ol>
<p>Now, in certain models, we are model some latent variables by splines. For example, the observations $y_{1i}, y_{2i}, i=1, \dots, n$ is modeled by two gaussian process
\(y_{1i}\sim \mathrm{gp} ({f_{1i}}, \sigma), \quad y_{2i}\sim \mathrm{gp} (f_{2i}, \sigma)\)
But the actual goal is to model the auto regression dynamic \(f_{2i} \sim \mathrm{~some~function~of~} f_{1i}\)</p>
<p>It is actually straightforward to do that in Stan. We just model the autoregression using splines.</p>
<p>However, there are still some challenge here:</p>
<ol>
<li>For observed values, the spline basis matrix $B_j(x_i)$ can be pre-computed and stored. The basis function becomes a feature extraction and is as simple as a regression. But for latent modeling, the basis matrix $B_j(f_i)$ has to be evaluated for every new value of parameter $f_i$. That is a lot of evaluation if either the number of basis or the sample size is big.</li>
<li>The usual B-spline regression can be GREATLY improved in efficiency by <em>sparse matrix</em>. In one-D, suppose the $K$ knots are the data quantiles with probability $1/K, … (K-1)/K$, the B_j(x_i) will only have $O(1/K)$ non-zero elements. For two-D with tensor product of basis, that is $O(1/K^2)$. The B matrix is almost always sparse for latent variable splines too. But then how can we do sparse matrix when the matrix changes everyone iterations? I mean, we can still convert that B to be a sparse matrix, but we have to convert it in every iteration.</li>
<li>The most horrible part is that the spline is zero outside its boundary knots— but we do not really know the boundary of the latent variables cuz they are latent. That is bad. If the sampler or the optimizer enters a region where the latent variables {$f_i$} are inferred to be wider than the knots— which are often prefixed, the gradient of the coefficient before the B matrix becomes zero. It either gets stuck in optimizer, or becomes random walk in HMC.</li>
<li>A careful analysis shows that the region where the above gradients are non-zero is indeed a very thin slice of the joint space. One solution is to trim the space to match it: put a soft constraint on the span of the latent variables. I think in general B-spline is sensitive to the boundary knots. The unknown support amplifies such sensitivity.</li>
</ol>Yuling YaoBy spline I mean B-spline, and by B-spline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.Gaussian process regressions having opinions or speculation.2020-05-19T00:00:00+00:002020-05-19T00:00:00+00:00http://www.yulingyao.com/blog/2020/gp<p>I occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:</p>
<ol>
<li>facts,</li>
<li>informed extrapolations from analogies to other viruses and</li>
<li>opinion or speculation.</li>
</ol>
<p>That is right. Statistician needs some stationarity and smoothlization assumption so as to learn from data— and thereby always place ourselves in the risk of over-extrapolation.</p>
<p>In machine learning, the novelty detector and out-of-distribution uncertainty used be a hotspot especially given its connection to AI safety, and I have followed papers in this area for a while. (I think it is still a hotspot, but I don’t know for sure— indeed if someone tell you that he would be completely sure on the presence or the future, he is completely extrapolation, but anyway, it is fairly to assume the heat of areas does not shrink overnight, so maybe it is still at least a warm spot.)</p>
<p>In part many deep models ignored the parameter uncertainty and is overconfident. But I feel like there is a dangerous tendency that people treat some non-parametric bayesian model as always-right-but-hard-to-fit-model, as if we would never worry about novelty detector and out-of-sample uncertainty if we know how to fit a gaussian process with 10^10 points.</p>
<p>But gp is not immune to extrapolations. Here I generate 2 two-D data (x,y) with x only supported near 1 and 3 (.5 N(1, 0.01) + .5 N(3,0.01)). I could still fit a gp, and it does return results that fit the data in their support.</p>
<p><img src="/blog/images/2020/gp_demo.png" alt="gp" title="gp" /></p>
<p>But wait, why is it so sure about what happens in between— there is zero data in the middle! How could you know the f(x) at 2 is identically 0, instead of -20004, or 343583? The model is completely extrapolation.</p>
<p>You could probably guess the fitted length scale is very big— indeed longer than the x span so it effectively becomes a linear regression. It is not wrong, a linear regression can be useful too.</p>
<p>Even worse, such over-confidence is self-confirmed. Wikipedia says “the length scale tells how safe it is extrapolate outside the data span”. It is wrong. It does not tell use how safe it is. The inference in the no-data-zone comes from the prior, which is a mean zero gp, and it is very dangerous, arrogant and reckless if treat that as always the true model and the right uncertainty we are looking for.</p>
<p>Ideally we want a model/inference that goes on strike and yelling at the user when it perceived it is making non-data-ful extrapolation. Of course anything beyond data comes from prior, and prior is just more data, so technically it is as kosher to estimate the posterior outside the data domain by a crazy gp prior as to estimate the empirical density by a delta function — which are two extremes on the spectrum of how we weigh the relative reliance on prior and data. If you do not yell at the empirical process, why should the gp yell at you?</p>
<p>It is not utterly sane to stop this post with the previous question which I do not know the answer. But the main message is clear, gp is not always right. And gp can as over-extrapolating as a linear regression. And in many cases we do not know if the gp we are running is over-extrapolating or not.</p>Yuling YaoI occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:A very short introduction on the large deviation principle2020-05-06T00:00:00+00:002020-05-06T00:00:00+00:00http://www.yulingyao.com/blog/2020/introLDP<p>I took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).</p>
<h2 id="from-law-of-large-numbers-to-the-large-deviation-principle">From Law of Large Numbers To The Large Deviation Principle</h2>
<p>Given a probability measures ${\mu_{\epsilon}}$ on a space $(\mathcal{X}, \mathcal{B})$, instead of a limiting measure ( for example $\mu_{\epsilon}\ (\Gamma) \to 0$), we may also be interested in how quick such convergence happen. The Large deviation principle describes the limiting rate of such sequence, where the rate is characterized by a lower-semicontinuous mapping I from $\mathcal{X}$ to $[0, \infty]$, which we call a <em>rate function</em>.</p>
<p><strong>Definition</strong>: $\mu_{\epsilon}$ satisfies the large deviation principle with a rate function I, if for all set $\Gamma\in \mathcal{B}$,</p>
\[\inf_{x\in \Gamma^0} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (\Gamma) \leq \lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (\Gamma) \leq \inf_{x\in \bar \Gamma} I(x)\]
<p>Consider a concrete example, if $S_n$ is the sample average of iid standard Gaussian random variables $X_1, \dots, X_n$, we known $S_n / \sqrt{n} = N(0, 1)$. Indeed as long as CLT holds, we know $P(\vert S_n\vert \geq \delta) \to 1- P(\vert N(0,1)\vert>\delta\sqrt n )$ which is 0 for any $\delta>0$. However, for this toy case, we can write replace the limit by identity and it leads to</p>
\[1/n \log P(\vert S_n\vert \geq \delta) \to -\delta^2/2.\]
<p>In general this precise rate is way beyond what a CLT can describe. A motivating example I have in mind is <em>importance sampling</em>: We draw $x_i$ from a proposal distribution $q$, and we can estimate $E_p h(x)$ by $S_n=1/n \sum_{i=1}^n h(x_i)r(x_i).$ with $r=p/q$ followed by self-normalization. We do know $S_n \to E_p h(x)$, but how fast is it? How can we describe characterize some large estimation error happens: $P(\vert S_n- E_p h(x) \vert \geq \delta)$? Indeed, even if $r$ has finite second moment and CLT holds, such large deviation probability still depends on the distribution of both $r$ and $h$.</p>
<p>Another practical situation that I recently consider is sequential design/active learning. For example in clinical trial we may adaptively sample until a interim decision boundary is reached (say some “p value” is “significant”). Aside from design hypothesis testing, we shall use $P(\vert S_n\vert \geq \delta)$ to compute the expected stopping time.</p>
<p>For the purpose of many proofs, we present a equivalent (equivalent when $\mathcal{B}$ contains the Boreal sigma filed of $\mathcal{X}$) definition:</p>
<p>$\mu_{\epsilon}$ satisfies the large deviation principle with a rate function $I()$, if</p>
<ol>
<li>
<p>For all closed set $F \subset \mathcal{X}$ ,
\(\lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (F) \leq \inf_{x\in F} I(x).\)</p>
</li>
<li>
<p>For all open set $G \subset \mathcal{X}$ ,</p>
</li>
</ol>
\[\inf_{x\in G} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (G) .\]
<h2 id="empirical-average-of-iid-samples--cramérs-theorem">Empirical average of IID samples: Cramér’s Theorem</h2>
<p>If we draw $X_1, \dots, X_n$ iid from the a $d$-dimensional real valued distribution $\mu$, we compute the empirical average $S_n=1/n \sum_{i=1}^n X_i$, of course we know $S_n\to E[X]$. The question is, how quick.</p>
<p><strong>Cramére’s Theorem</strong> states that the law of $S_n$, denoted by $\mu_n$, satisfies LDP with a convex rate function $\Lambda^*(\cdot)$.</p>
<p>To define $\Lambda^*$, we first define the log moment generating function</p>
\[\Lambda (\lambda)=\log \operatorname {E} [\exp(\langle\lambda, X\rangle)].\]
<p>where $\langle, \rangle$ is the inner product.</p>
<p>The desired rate function $\Lambda^*$ is its Fenchel-Legendre transform (the difference max between log sum exp and sum):</p>
\[\Lambda^* (x)= \sup_{\lambda \in R_d} ( \langle \lambda, X\rangle - \Lambda (\lambda) )\]
<p>In particular in 1-D, we have</p>
\[\lim_{n\ \to \infty} 1/n \log P(S_n \geq C) = -\inf_{x\geq C} \Lambda^*(x)\]
<p>The Cramére’s Theorem can also be extended to weak-dependence such as Markov chains, as well as martingales.</p>
<p>For example, for real valued iid $X_1, \dots, X_n$ and a function $Z= g_n (X_1, \dots, X_n)$ that satisfy
$\vert g_n(X_1, \dots, X_{k}, X_n) - g_n(X_1, \dots, X_{k’}, X_n)\vert <1$, then <strong>contraction inequality</strong> has</p>
\[1/n \log P(1/n (Z_n - E Z) \geq C ) \leq - H(\frac{C+1}{2} \vert \frac{1}{2})\]
<p>for H the KL between two Bernoullis.</p>
<p>Finally we can extend the result beyond $R^d$:</p>
<h3 id="craméres-theorem-for-abstract-empirical-measure">Cramére’s Theorem for abstract empirical measure</h3>
<p>We assume $\mu_n$ is the law of $S_n= \frac{1}{n} \sum_{i=1}^n X_i$ on a locally convex, Hausdorff, topological real vector space $\mathcal{X}$ such that there exists a polish space $\Xi \subset \mathcal{X}$ such that $\mu(\Xi)=1$.Then $\mu_n$ has LDP in both $\Xi$ and $\mathcal{X}$ with rate function $\Lambda^*$.</p>
<h2 id="transformation-of-ldps">Transformation of LDPs</h2>
<h3 id="contraction-inequality-of-a-mapping">contraction inequality of a mapping</h3>
<p>Let $\mathcal{X}$ and $\mathcal{Y}$ be two Hausdorff topological space and $f: \mathcal{X} \to \mathcal{Y}$ a continuous map. If ${\mu_\epsilon}$ satisfy LDP with rate function $I$, then ${\mu_{\epsilon} f^{-1}}$ satisfies LDP with rate function</p>
\[I'(y) = \inf\{I(x): y=f(x)\}.\]
<h3 id="ldp-from-exponential-approximation">LDP from exponential approximation</h3>
<p>Assuming two random variables ${Z_\epsilon}$ and ${Z_\epsilon’}$ with joint law $P_{\epsilon}$ have marginal probability measures ${\mu_\epsilon}$ and ${\mu_\epsilon’}$ on a metric space $(\mathcal{Y}, d)$. These two probability measure families are exponential equivalent if</p>
\[\lim_{n \to \infty}\sup \epsilon \log P_{\epsilon}( \Gamma_{\delta})= -\infty\]
<p>where the set $\Gamma_{\delta}= { (y, \tilde y ): d(y, \tilde y )> \delta }$.</p>
<p>Then the same LDP holds for ${\mu_\epsilon}$ and ${\mu_\epsilon’}$.</p>
<p>In practice we often approximate a distribution by a series of simplified distribution.</p>
<h3 id="laplace-approximation-varadhans-integral">Laplace approximation: Varadhan’s Integral</h3>
<p>In the normal case of Cramére’s Theorem, $I(x)=x^2 / 2\sigma^2$. Does $I(x)$ more reverent than inverse variance in some generalized Laplace approximation, especially when the variance is not even defined?</p>
<p>First, $\mu_{\epsilon}$ is on R, and we assume LDP: $\epsilon \log \mu_{\epsilon} (X<x_0 )= - I(x_0)$, take derivative on $x_0$ we have</p>
\[\frac {d\mu_{\epsilon}} {dx} \approx \exp(-\frac{I(x)}{\epsilon}).\]
<p>For any $\phi(x)$ We run Taylor expansion at $\bar x = \arg\max \phi(x)- I(x)$
and we have</p>
\[\phi(x)- I(x) = \phi(\bar x)- I(\bar x) + (x-\bar x)^2 \frac{d}{dx} (\phi( x)- I( x))\vert_{x=\xi}\]
<p>Hence we compute the integral by</p>
\[\epsilon \log \int_R \exp (\phi(x)/\epsilon) d \mu_{\epsilon} \approx (\phi(\bar x)- I(\bar x))\]
<p>Now, in the general space, Suppose ${\mu_{\epsilon}}$ satisfies LDP with rate $I()$ on space $\mathcal{X}$, and assume $\phi: \mathcal{X} \to R$ is any continuous function. With further either the tail condition</p>
\[\lim_{M\to \infty} \limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon}) 1_{\{Z_\epsilon\geq M \} } ] = -\infty\]
<p>or for some $\gamma>1$ holds</p>
\[\limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\gamma\phi(Z_\epsilon)}{\epsilon})] < \infty,\]
<p>then
\(\lim_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon})] = \sup_{x \in \mathcal{X}} (\phi(x)- I(x))\)</p>
<p>Varadhan’s Integral can often be used to approximate the normalization constant.</p>
<p>Varadhan’s Integral generalizes the MGF to any non-linear functions. We consider the invserse problem:</p>
<p>Define $\Gamma_{f}= \lim_{\epsilon \to 0} \log \int_x \exp(f(x)/\epsilon) d\mu_{\epsilon} $</p>
<p><strong>Bryc inverse lemma:</strong> Suppose $\mu_{\epsilon}$ are exponentially tight tight and $\Gamma_{f}$ exists for all continuous and bounded $f \in C_b(\mathcal{X})$. Then $\mu_{\epsilon}$ has good rate function (largest difference between sum and log sum exp)</p>
\[I(x)= \sup_{f \in C_b(\mathcal{X})} (f(x)-\Gamma_{f} )\]
<p>and dually</p>
\[\Gamma_{f} = \sup_{x \in \mathcal{X}} (f(x)- I(x) ).\]
<p>We may restrict $ C_b(\mathcal{X})$ to only linear functionals if $\mathcal{X}$ is a topological vector space.</p>
<h2 id="sanovs-theorem-for-empirical-measures">Sanov’s Theorem for empirical measures</h2>
<p>The LLN of the empirical mean of IID samples that motives the Cramére’s Theorem. Likewise, we know usually the empirical process converges the actual distribution. And Sanov’s Theorem answers how quick it is.</p>
<p>Consider iid random variables $Y_1, \dots, Y_n$ to be $\Sigma$-valued, where $\Sigma$ is a Polish space. $Y_i$ has probability measure $\mu \in M_1(\Sigma)$, where $M_1(\Sigma)$ is the space of all probability measures on $\Sigma$. We may estimate $\mu$ empirically by</p>
\[L_n = 1/n \sum_{i=1}^n\delta(y=Y_i)\]
<p>$L_n$ is also viewed as elements in $M_1(\Sigma)$.</p>
<p>We equip $M(\sigma)$ with weak topology (consider open set generated by open balls ${\nu: \vert\int \phi d\nu - x \vert < \delta}$ for all bounded continuous $\phi$. ) $M_1(\sigma)$ is a Polish space equipped with levy metric.</p>
<p>By abstract Cramér’s Theorem in Polish space (where we replace $X_i \in R$ by $\delta(Y_i) \in M_1(\Sigma)$), we know $L_n$ has LDP in $M_1(\Sigma)$ with convex rate function</p>
\[\Gamma^*(\nu)= \sup_{\phi\in C_b(\Sigma)}\{ \langle \phi, \nu \rangle - \Gamma(\phi) \}\]
<p>where $\Gamma(\phi)= \log E[\exp(\langle \phi, \delta(Y) \rangle )] = \log \int_\Sigma \exp(\phi) d\mu$</p>
<p>Such rate function is difficult to compute, but Sanov’s Theorem says</p>
\[\Gamma^*(\nu) = KL( \nu, \mu )= \int_\Sigma \log \frac{d \nu}{d \mu} d\nu\]
<p>Loosing speaking, for a closed set $\Gamma\subset M_1(\Sigma)$
\(\lim_{n\ \to \infty} 1/n \log P( L_n \in \Gamma) \approx -\inf_{\nu\in \Gamma} KL (\nu, \mu).\)</p>
<h2 id="sanovs-theorem-for-stationary-gaussian-processes">Sanov’s Theorem for Stationary Gaussian Processes</h2>
<p>Now the data is a sequence of stationary Gaussian process: ${X_k}$ ($-\infty < k < \infty$). We define the probability space $\Omega = \prod_{j = -\infty}^{\infty} \mathbb R_j$. $\omega = {x_j} \in \Omega$ with $\omega(j) = x_j$ and $P$ is that stationary Gaussian process probability measure on $\Omega$ induced by ${X_k}$. It has mean $\mathbb E [X_k] = 0$ and covariance: $\mathbb E[X_0 X_j] = \rho_j$.</p>
<p>How quick doe the empirical measure converge? Indeed we can still find LDP for it. The main result is from <a href="https://projecteuclid.org/euclid.cmp/1103941986">Donsker and Varadhan, 1986</a>.</p>
<p>Bochner’s Theorem says we can decompose the eigenfunction by frequency $Cov(X_0, X_j)=\rho_j = \frac{1}{2\pi} \int_0^{2\pi} e^{i j \theta} f(\theta) d\theta$, where we call $f(\theta)$ the <em>spectral density</em>. It is continuous on $[0, 2\pi]$ with $f(0) = f(2\pi)$.</p>
<p>Let $T$ be a shift operator on $\Omega$, i.e., $T(\omega) (j) = x_{j+1}$. We
construct $\omega^{(n)}$ by $\omega$, which is defined to be</p>
\[\dots, x_1, \dots, x_n, x_1,\dots,x_n, x_1, \dots, x_n, \dots\]
<p>This define a map $\pi_n$ from $\Omega$ to $M_{s}$:</p>
\[\pi_n(\omega) := \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}}).\]
<p>$M_{s}$ is the space of all stationary measure on $\Omega$.
$Q_n = \pi_n P^{-1}$ is probability measure on $M_{s}$ induced by $\pi_n$:
$Q_n (A)= P(\omega: \pi_n(\omega)\in A). $</p>
<p>Then $Q_n$ satisfies LPD with good rate function $H_f( R )$.
$H_f( R )$ is effectively the entropy of the stationary process $R$ with respect to the stationary Gaussian process $(X_{k})_{k=\infty}^{\infty}$.</p>
<p>For $R \in M_{s}$ and $A \subset R$, we let $R(A\vert \omega) = R(X_{0} \in A \vert X_{-1}, X_{-2}, \dots)$ be the regular conditional probability distribution of $X_{0}$ given the entire past. Denote by $r(y\vert \omega)$ the corresponding density. This gives the explicti form of the rate:
\(H_{f}( R ) = \mathbb E^R \left\{ \int_{-\infty}^{\infty} r(y\vert \omega) \log r(y\vert \omega) dy\right\} + \frac{1}{2} \log 2\pi
+ \frac{1}{4\pi} \int_0^{2\pi} \frac{dG(\theta)}{f(\theta)}
+ \frac{1}{4\pi} \int_0^{2\pi} \log f(\theta) d\theta.\)</p>
<h3 id="sketch-the-proof">Sketch the proof:</h3>
<p>By Fourier expansion, we write
$\sqrt{f(\theta)} = \sum_{n = -\infty}^{\infty} a_n e^{in\theta}.$
Let $(\xi_k)$ be a sequence of independent Gaussian random variables with mean 0 and variance 1. Then, by Parseval’s theorem, $(X_k)_{k=-\infty}^{\infty}$ defined by</p>
\[X_k = \sum_{n = -\infty}^{\infty} a_{n-k} \xi_n = \sum_{n = -\infty}^{\infty} a_n \xi_{n+k}\]
<p>is a stationary Gaussian process with mean 0 and covariance $ E[X_{0} X_{j}] = \rho_{j} = \frac{1}{2\pi} \int_{0}^{2\pi} e^{ij \theta} f(\theta) d\theta .$</p>
<p>Let $b_j = a_j \left(1 - \frac{\vert \vert j\vert}{N} \right)$ for $\vert j\vert < N$. For each positive integer $N$, define a new process $(X_k^N)_{k=-\infty}^{\infty}$ by
\vert
\(X_k^N = \sum_{\vert j\vert<N} b_{j} \xi_{j+k} = \sum_{\vert j\vert <N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k},\)</p>
<p>where $X_k^{N}$ is the Cesaro mean of the partial sums of $X_{k}$, i.e.,</p>
\[X_{k}^{N} = \frac{1}{N} \sum_{i = 0}^{N-1} \sum_{j=-i}^{i} a_{j} \xi_{j+k} = \sum_{\vert j\vert <N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k}.\]
<p>We define $F: \Omega \to \Omega$ by</p>
\[(F(\omega))(j) = \sum_{k=-\infty}^{\infty} a_{k} x_{j+k}.\]
<p>$F$ maps $(\xi_{k}) $ to $(X_{k})$.</p>
<p>We define $F_N: \Omega \to \Omega$ such that</p>
\[(F_N(\omega)) (j) = \sum_{\vert k\vert < N} b_k x_{j+k}.\]
<p>The mapping $F_{N}$ induces a corresponding map $\tilde F_N: M_s \to M_s$.</p>
<p>Let $\mu$ be the measure on $\Omega$ induced by $(\xi_k)$. Define $Q_n$ on $M_{s}$ such that $Q_n(A) := \mu{\omega: \pi_n \cdot F(\omega) \in A}$.
Define $Q_n^N$ on $M_{s}$ such that $Q_n^N(A) := \mu{\omega: \pi_n \cdot F_N(\omega) \in A}$.
Define $\tilde Q_n^N$ on $M_{s}$ such that</p>
\[\tilde Q_n^N(A) := \mu\{\omega: \tilde F_N \cdot \pi_n (\omega) \in A\}.\]
<p>Recall that $\pi_n(\omega) = \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}})$.
Let \(\tilde F_{N} \cdot \pi_n(\omega):= \frac{1}{n}[\delta_{F_N(\omega^{(n)})} + \delta_{F_N(T\omega^{(n)})} + \ldots + \delta_{F_N(T^{n-1}\omega^{(n)})}]\)
and \(\pi_n \cdot F_{N} (\omega):= \frac{1}{n}[\delta_{(F_N(\omega))^{(n)}} + \delta_{T(F_N(\omega))^{(n)}} + \ldots + \delta_{T^{n-1}(F_N(\omega))^{(n)}}].\)</p>
<p>We apply Donsker Theorem and obtain LDP for $\tilde F_{N} \cdot \pi_n$:</p>
<p>We total variation gap of $\vert \vert \tilde F_{N} \cdot \pi_n - \pi_n F_{N} \vert \vert_{TV} $ is $o(1)$, which further bounds the levy metric between them</p>
\[d(\tilde F_{N} \cdot \pi_n , \pi_n F_{N} ) = o(1).\]
<p>So they are <em>exponentially equivalent</em>.
This leads to the LDP for $Q_n^N$ using some triangle inequality.</p>
<p>Likewise, we claim from $Q_n^N$ is <em>exponentially approximation</em> of $Q_n$ using some triangle inequality and contraction theorem, and therefore LPD of $Q_n$ applies.</p>
<p>
</p>
<p>P.S. Gonzalo Mena reminds me the connection of Sanov’s Theorem and exp-family, which I just learned from these <a href="https://www.tau.ac.il/~tsirel/Courses/LargeModerateDev/lect2.pdf">lecture notes</a> by Tsirelson.</p>
<h2 id="any-large-deviation-is-done-in-the-least-unlikely-of-all-the-unlikely-ways">“Any large deviation is done in the least unlikely of all the unlikely ways!”</h2>
<p>For any measure $\mu$ on $\mathcal{X}$ and a function $u$: $\mathcal{X} \to R$ we can define a tilted measure</p>
\[\mu_u (x) \propto \mu (x) \exp ( u(x) )\]
<p>We can prove</p>
\[\mu_u = \arg\min_{\nu :\int u d\nu \geq \int u d \mu_u} KL(\nu, \mu)\]
<p>Further, if $\mathcal{X}={1, \dots, d}$, we endow $\mathcal{X}^n$ with the probability measure $\mu_n$, and count frequency $\eta_n(j) = 1/n \sum_{k=1}^n1_{{x_k=j}}$
for each realization of $X$.</p>
<p>Now conditioning on event $E_n={\int u d\eta_n \geq c} $, the random measure $\eta_n$ converges in probability to the tilted measure $\mu_{tu}$ where t > 0 is such that $\int u d\mu_{tu} = c$.</p>
<p>This is because</p>
\[\mu_{tu} = \arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)\]
<p>Generally, consider the set</p>
\[E_c=\{\int u d\nu \geq c\} \subset M_1(\mathcal{X}),\]
<p>for which we know</p>
\[1/n \log P(L_n \in E_c ) \approx KL(\mu_{tu}, \mu)\]
<p>Therefore \(\mu_n(\delta{\mu_{tu}} \vert E_0 ) = 1- \mu_n(E_0 / \{\mu_{tu}\})/ \mu_n(E_0)\)</p>
<p>as long as $\mu_{tu}$ is a unique minimizer of $\arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$, we can conclude \(Q_n (\delta{\mu_{tu}} \vert E_c ) \to 1.\)
where $Q_n (A) = P(L_n \in A) $ is the induced measure on $M_1(\mathcal{X})$.</p>Yuling YaoI took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).Back to January 18 when there were 60 cases globally2020-05-03T00:00:00+00:002020-05-03T00:00:00+00:00http://www.yulingyao.com/blog/2020/back<p>On January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of <a href="https://www.bbc.com/news/health-51148303">BBC</a> which covered a story of a brand new yet underemphasized virus in China:</p>
<blockquote>
<p>There have been more than 60 confirmed cases of the new coronavirus (globally).</p>
</blockquote>
<p>BBC reported an estimation by <a href="http://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/news--wuhan-coronavirus/">ICL</a> that the number of cases were likely to be understated, in fact,</p>
<blockquote>
<p>experts estimate a figure nearer 1,700.</p>
</blockquote>
<p>Indeed, I remember I was quite horrified when I read their prediction:</p>
<blockquote>
<p>The virus ‘will have infected hundreds’.</p>
</blockquote>
<p>So I immediately checked their model, which was effectively a capture-recapture model based on 3 positive cases confirmed outside China. You could estimate the total number in Wuhan, by 3 dividing the possibility that a patient will leave Wuhan for international travels, which can be further estimated by the total outbound traffic counts divided by the city population. There were many simplifications, but it was fine to me.</p>
<p>I knew Jonathan Auerbach had used a clever negative binomials on
counting the number of rats in NYC, a mathematically equvalent to this covid model. So I emailed Jonathan:</p>
<blockquote>
<p>The inference based on 3 positive cases seems not convincing, yet should I buy more Lysol now?</p>
</blockquote>
<p>As a statistician, this could have been a reasonable questions to ask, as the whole estimation was driven by three data points. Given that we are dealing with the scale of data like millions in our daily research, it was the limit of a proportionate response to three data points by kinda making fun of it in a harmless email.</p>
<p>Jon replied:</p>
<blockquote>
<p>You should blog about it! You can write a decision theory paper about how much lysol you should buy based on the data.</p>
</blockquote>
<p>Except I didn’t.</p>
<p>Except four months later, the whole world and 7 billion human beings are fundamentally changed, not only by three bathes of viruses, but also by our ignorance and indifference in the early stage, to which I probably contributed too.</p>
<p>Except I did (unusually) purchase put for LQD on a monthly rolling basis starting from December.</p>
<p>Except that was for a completely different reason that I heard someone talking about the credit market risks with sky high public spending.</p>
<p>Except in retrospect, I don’t know what should we really learn from this tragedy. Would I be more alert next time when I heard the word coronavirus?</p>
<p>Sure, expcet such response is overfitting.</p>
<p>Would I be alert next time by some analysis using three data points?</p>
<p>I doubt.</p>
<p>In a Bayes update regime, the posterior (for future) would hardly change with overwhelming evidences drawn everywhere else, even if those are the data we have drawn deliberately to make life sounds fully promising and hopeful.</p>
<p>As a statistician, we are trained to make inference on any give dataset. But it does not eliminate the room for agnosticism by having collected all genes of all creatures in this universe, or monitored all satellite images of all shopping malls.</p>Yuling YaoOn January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of BBC which covered a story of a brand new yet underemphasized virus in China: There have been more than 60 confirmed cases of the new coronavirus (globally).Sample sd of indirect effects in a multilevel mediation model2020-05-03T00:00:00+00:002020-05-03T00:00:00+00:00http://www.yulingyao.com/blog/2020/mediation<p>M asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$.
In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normal</p>
\[(a_j,b_j)^T \sim MVN ((a_0, b_0)^T, \Sigma),\]
<p>According to literature xxx, we could estimate the expectation of ab in a typical group by (the sample mean of)</p>
\[a_0 b_0 + \sigma_{12}^2,\]
<p>for $\sigma_{12}$ the off-diagonal element in $\Sigma$. Does it makes sense to summarize the uncertainty by the sample sd of the draws above?</p>
<p>The answer is No. The formula really comes from point estimation context with</p>
\[E[ab]= E[a]E[b] + Cov(a,b) = a_0 b_0 + \sigma_{12}^2\]
<p>The law of total variance says</p>
\[Var[ab]= E Var [ab| a_0 b_0 + \sigma_{12}^2 ] + Var E[ab| a_0 b_0 + \sigma_{12}^2 ]\]
<p>The sample deviation of $a_0 b_0 + \sigma_{12}^2$ draws only amounts to the second term and I don’t think it means anything. The easiest way to solve the problem is to obtain posterior draws of group-level indirect effects $a_jb_j$ directly.</p>Yuling YaoM asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$. In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normal