Jekyll2020-10-22T02:11:34+00:00http://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoDoes MAP estimate always overfit?2020-10-21T00:00:00+00:002020-10-21T00:00:00+00:00http://www.yulingyao.com/blog/2020/point<p>This is wrong. Indeed it can be opposite.</p> <p>Well, just to be clear, here I am talking about a specific situation: We have a model and a dataset, such that we can write down the joint posterior density. To estimate parameters, we can use</p> <ol> <li>posterior simulation draws from the joint posterior densities, for which we run MCMC.</li> <li>MAP estimate: the joint mode of the density, for which we run optimization such as sgd.</li> </ol> <p>In any hierarchical model with the form: $$y_{ik}\sim \mathrm{N}(\mu_k, \sigma); ~~ \mu_k \sim \mathrm{N}(\mu_0, \tau); \tau \sim 1.$$</p> <p>The joint mode is achieved at the complete pooling subspace: $\tau=0, \mu_k = \mu_0$. And, this complete pooling model is always simpler than the hierarchical model in terms of training-testing error gap. If anything, the MAP underfits.</p> <p>Of course I believe there are other empales in which MAP does ovefit. The relation is just not definite. Is there any general characterizations? I don’t know.</p>Yuling YaoThis is wrong. Indeed it can be opposite.Why Bayesian models could have better predictions.2020-09-28T00:00:00+00:002020-09-28T00:00:00+00:00http://www.yulingyao.com/blog/2020/predict<p>In a predictive paradigm, no one really cares about how I obtain the estimation or the prediction. It can come from some MLE, MAP of risk minimization, or some Bayes procedure. Also, when we talk about a predictive paradigm, a big selling point of Bayesian procedure for offering confidence interval for free is largely undermined. A hedge fund manager does not really care about the p value of strategy A outperforming strategy B. If it is weak evidence the winning effect is N(0.1,1), decision theory says he should adopt the empirical winner.</p> <p>However, even with this black box prediction attitude, a Bayesian model analysis is still useful for the following reasons.</p> <ol> <li> <p>Prior = regionalization = robustness. I agree this argument is lame, as everyone is using regionalization anyway.</p> </li> <li> <p>The posterior distribution is simply more expressive than a MAP point estimate. Sure, I suppose from the probabilistic prediction point of view, an even more general procedure is to optimize over all probabilistic distributions subject to the decision problem. But the space of all probabilistic distributions is tooo big, and bayesian inference gives a coherent approximation to this infeasible goal. A warning is that bernstein von mises theorem says “However, in the long run, we are all dead” cuz bayes has to converge to a point estimate anyway. But that actually means the model is not given or fixed.</p> </li> <li> <p>It is easier to work with Bayesian models on hierarchical data. To design a good hierarchical model to reflect the data structure is pretty much to have a customized NN infrastructure, except you do not publish every new model in NIPS.</p> </li> <li> <p>Bayesian procedure is computationally harder, but as long as we obtain the final posterior simulation, it becomes easier to run other post-processing and adversarial-training. For example, BMA gives a coherent model ensemble (which in general is not feasible in MAPs). Simulated tempering is a more sophisticated version of NN-distillation.</p> </li> <li> <p>The ability to simulate “generated quantity” for free. When we model the stock price by a linear regression a normal model, we have to use normal approximation for the option. But with MCMC draws, the option striking probability can be calculated directly. The same blessing applies to intervals and quantiles.</p> </li> </ol> <p>To be sure, there is NOT a guarantee that a bayesian posterior must produce a better prediction—-we do not even expect it to be so for an arbitrary model. The model-has-been-fit-so-we-are-done era has gone.</p> <p>Unfortunately, the previous points 3-5 are pretty much unaddressed in “Bayesian learning”. Most theories are based on exchangeable data; minimal post-processing is done; the customized prediction is not really a thing. Much room for developments.</p>Yuling YaoIn a predictive paradigm, no one really cares about how I obtain the estimation or the prediction. It can come from some MLE, MAP of risk minimization, or some Bayes procedure. Also, when we talk about a predictive paradigm, a big selling point of Bayesian procedure for offering confidence interval for free is largely undermined. A hedge fund manager does not really care about the p value of strategy A outperforming strategy B. If it is weak evidence the winning effect is N(0.1,1), decision theory says he should adopt the empirical winner.Some difficulty of fitting splines on latent variables2020-09-16T00:00:00+00:002020-09-16T00:00:00+00:00http://www.yulingyao.com/blog/2020/spline<p>By spline I mean B-spline, and by B-spline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.</p> <p>I think there are three distinct usage of spline:</p> <ol> <li> <p>Interpolation: we enforce the spline to go across given (x, y) points, and make predictions for new x.</p> </li> <li> <p>Smoothing: instead of exact interpolation, we allow some fitting error, but we also penalizing the smoothness of the fitted curve (the integral of 2rd order derivative in 1D or laplacian in 2D).</p> </li> <li> <p>Filtering. It functions likes 2 that we regress the observations to the basis function matrix, but instead of viewing the penalization term as a bias-variance tradeoff, we shift the goal to find a latent variable surface. The bias in 2 is now interpreted as observations noise and the latent surface is now the true value.</p> </li> </ol> <p>Now, in certain models, we are model some latent variables by splines. For example, the observations $y_{1i}, y_{2i}, i=1, \dots, n$ is modeled by two gaussian process $$y_{1i}\sim \mathrm{gp} ({f_{1i}}, \sigma), \quad y_{2i}\sim \mathrm{gp} (f_{2i}, \sigma)$$ But the actual goal is to model the auto regression dynamic $$f_{2i} \sim \mathrm{~some~function~of~} f_{1i}$$</p> <p>It is actually straightforward to do that in Stan. We just model the autoregression using splines.</p> <p>However, there are still some challenge here:</p> <ol> <li>For observed values, the spline basis matrix $B_j(x_i)$ can be pre-computed and stored. The basis function becomes a feature extraction and is as simple as a regression. But for latent modeling, the basis matrix $B_j(f_i)$ has to be evaluated for every new value of parameter $f_i$. That is a lot of evaluation if either the number of basis or the sample size is big.</li> <li>The usual B-spline regression can be GREATLY improved in efficiency by <em>sparse matrix</em>. In one-D, suppose the $K$ knots are the data quantiles with probability $1/K, … (K-1)/K$, the B_j(x_i) will only have $O(1/K)$ non-zero elements. For two-D with tensor product of basis, that is $O(1/K^2)$. The B matrix is almost always sparse for latent variable splines too. But then how can we do sparse matrix when the matrix changes everyone iterations? I mean, we can still convert that B to be a sparse matrix, but we have to convert it in every iteration.</li> <li>The most horrible part is that the spline is zero outside its boundary knots— but we do not really know the boundary of the latent variables cuz they are latent. That is bad. If the sampler or the optimizer enters a region where the latent variables {$f_i$} are inferred to be wider than the knots— which are often prefixed, the gradient of the coefficient before the B matrix becomes zero. It either gets stuck in optimizer, or becomes random walk in HMC.</li> <li>A careful analysis shows that the region where the above gradients are non-zero is indeed a very thin slice of the joint space. One solution is to trim the space to match it: put a soft constraint on the span of the latent variables. I think in general B-spline is sensitive to the boundary knots. The unknown support amplifies such sensitivity.</li> </ol>Yuling YaoBy spline I mean B-spline, and by B-spline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.Gaussian process regressions having opinions or speculation.2020-05-19T00:00:00+00:002020-05-19T00:00:00+00:00http://www.yulingyao.com/blog/2020/gp<p>I occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:</p> <ol> <li>facts,</li> <li>informed extrapolations from analogies to other viruses and</li> <li>opinion or speculation.</li> </ol> <p>That is right. Statistician needs some stationarity and smoothlization assumption so as to learn from data— and thereby always place ourselves in the risk of over-extrapolation.</p> <p>In machine learning, the novelty detector and out-of-distribution uncertainty used be a hotspot especially given its connection to AI safety, and I have followed papers in this area for a while. (I think it is still a hotspot, but I don’t know for sure— indeed if someone tell you that he would be completely sure on the presence or the future, he is completely extrapolation, but anyway, it is fairly to assume the heat of areas does not shrink overnight, so maybe it is still at least a warm spot.)</p> <p>In part many deep models ignored the parameter uncertainty and is overconfident. But I feel like there is a dangerous tendency that people treat some non-parametric bayesian model as always-right-but-hard-to-fit-model, as if we would never worry about novelty detector and out-of-sample uncertainty if we know how to fit a gaussian process with 10^10 points.</p> <p>But gp is not immune to extrapolations. Here I generate 2 two-D data (x,y) with x only supported near 1 and 3 (.5 N(1, 0.01) + .5 N(3,0.01)). I could still fit a gp, and it does return results that fit the data in their support.</p> <p><img src="/blog/images/2020/gp_demo.png" alt="gp" title="gp" /></p> <p>But wait, why is it so sure about what happens in between— there is zero data in the middle! How could you know the f(x) at 2 is identically 0, instead of -20004, or 343583? The model is completely extrapolation.</p> <p>You could probably guess the fitted length scale is very big— indeed longer than the x span so it effectively becomes a linear regression. It is not wrong, a linear regression can be useful too.</p> <p>Even worse, such over-confidence is self-confirmed. Wikipedia says “the length scale tells how safe it is extrapolate outside the data span”. It is wrong. It does not tell use how safe it is. The inference in the no-data-zone comes from the prior, which is a mean zero gp, and it is very dangerous, arrogant and reckless if treat that as always the true model and the right uncertainty we are looking for.</p> <p>Ideally we want a model/inference that goes on strike and yelling at the user when it perceived it is making non-data-ful extrapolation. Of course anything beyond data comes from prior, and prior is just more data, so technically it is as kosher to estimate the posterior outside the data domain by a crazy gp prior as to estimate the empirical density by a delta function — which are two extremes on the spectrum of how we weigh the relative reliance on prior and data. If you do not yell at the empirical process, why should the gp yell at you?</p> <p>It is not utterly sane to stop this post with the previous question which I do not know the answer. But the main message is clear, gp is not always right. And gp can as over-extrapolating as a linear regression. And in many cases we do not know if the gp we are running is over-extrapolating or not.</p>Yuling YaoI occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:A very short introduction on the large deviation principle2020-05-06T00:00:00+00:002020-05-06T00:00:00+00:00http://www.yulingyao.com/blog/2020/introLDP<p>I took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).</p> <h2 id="from-law-of-large-numbers-to-the-large-deviation-principle">From Law of Large Numbers To The Large Deviation Principle</h2> <p>Given a probability measures ${\mu_{\epsilon}}$ on a space $(\mathcal{X}, \mathcal{B})$, instead of a limiting measure ( for example $\mu_{\epsilon}\ (\Gamma) \to 0$), we may also be interested in how quick such convergence happen. The Large deviation principle describes the limiting rate of such sequence, where the rate is characterized by a lower-semicontinuous mapping I from $\mathcal{X}$ to $[0, \infty]$, which we call a <em>rate function</em>.</p> <p><strong>Definition</strong>: $\mu_{\epsilon}$ satisfies the large deviation principle with a rate function I, if for all set $\Gamma\in \mathcal{B}$,</p> $\inf_{x\in \Gamma^0} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (\Gamma) \leq \lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (\Gamma) \leq \inf_{x\in \bar \Gamma} I(x)$ <p>Consider a concrete example, if $S_n$ is the sample average of iid standard Gaussian random variables $X_1, \dots, X_n$, we known $S_n / \sqrt{n} = N(0, 1)$. Indeed as long as CLT holds, we know $P(\vert S_n\vert \geq \delta) \to 1- P(\vert N(0,1)\vert&gt;\delta\sqrt n )$ which is 0 for any $\delta&gt;0$. However, for this toy case, we can write replace the limit by identity and it leads to</p> $1/n \log P(\vert S_n\vert \geq \delta) \to -\delta^2/2.$ <p>In general this precise rate is way beyond what a CLT can describe. A motivating example I have in mind is <em>importance sampling</em>: We draw $x_i$ from a proposal distribution $q$, and we can estimate $E_p h(x)$ by $S_n=1/n \sum_{i=1}^n h(x_i)r(x_i).$ with $r=p/q$ followed by self-normalization. We do know $S_n \to E_p h(x)$, but how fast is it? How can we describe characterize some large estimation error happens: $P(\vert S_n- E_p h(x) \vert \geq \delta)$? Indeed, even if $r$ has finite second moment and CLT holds, such large deviation probability still depends on the distribution of both $r$ and $h$.</p> <p>Another practical situation that I recently consider is sequential design/active learning. For example in clinical trial we may adaptively sample until a interim decision boundary is reached (say some “p value” is “significant”). Aside from design hypothesis testing, we shall use $P(\vert S_n\vert \geq \delta)$ to compute the expected stopping time.</p> <p>For the purpose of many proofs, we present a equivalent (equivalent when $\mathcal{B}$ contains the Boreal sigma filed of $\mathcal{X}$) definition:</p> <p>$\mu_{\epsilon}$ satisfies the large deviation principle with a rate function $I()$, if</p> <ol> <li> <p>For all closed set $F \subset \mathcal{X}$ , $$\lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (F) \leq \inf_{x\in F} I(x).$$</p> </li> <li> <p>For all open set $G \subset \mathcal{X}$ ,</p> </li> </ol> $\inf_{x\in G} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (G) .$ <h2 id="empirical-average-of-iid-samples--cramérs-theorem">Empirical average of IID samples: Cramér’s Theorem</h2> <p>If we draw $X_1, \dots, X_n$ iid from the a $d$-dimensional real valued distribution $\mu$, we compute the empirical average $S_n=1/n \sum_{i=1}^n X_i$, of course we know $S_n\to E[X]$. The question is, how quick.</p> <p><strong>Cramére’s Theorem</strong> states that the law of $S_n$, denoted by $\mu_n$, satisfies LDP with a convex rate function $\Lambda^*(\cdot)$.</p> <p>To define $\Lambda^*$, we first define the log moment generating function</p> $\Lambda (\lambda)=\log \operatorname {E} [\exp(\langle\lambda, X\rangle)].$ <p>where $\langle, \rangle$ is the inner product.</p> <p>The desired rate function $\Lambda^*$ is its Fenchel-Legendre transform (the difference max between log sum exp and sum):</p> $\Lambda^* (x)= \sup_{\lambda \in R_d} ( \langle \lambda, X\rangle - \Lambda (\lambda) )$ <p>In particular in 1-D, we have</p> $\lim_{n\ \to \infty} 1/n \log P(S_n \geq C) = -\inf_{x\geq C} \Lambda^*(x)$ <p>The Cramére’s Theorem can also be extended to weak-dependence such as Markov chains, as well as martingales.</p> <p>For example, for real valued iid $X_1, \dots, X_n$ and a function $Z= g_n (X_1, \dots, X_n)$ that satisfy $\vert g_n(X_1, \dots, X_{k}, X_n) - g_n(X_1, \dots, X_{k’}, X_n)\vert &lt;1$, then <strong>contraction inequality</strong> has</p> $1/n \log P(1/n (Z_n - E Z) \geq C ) \leq - H(\frac{C+1}{2} \vert \frac{1}{2})$ <p>for H the KL between two Bernoullis.</p> <p>Finally we can extend the result beyond $R^d$:</p> <h3 id="craméres-theorem-for-abstract-empirical-measure">Cramére’s Theorem for abstract empirical measure</h3> <p>We assume $\mu_n$ is the law of $S_n= \frac{1}{n} \sum_{i=1}^n X_i$ on a locally convex, Hausdorff, topological real vector space $\mathcal{X}$ such that there exists a polish space $\Xi \subset \mathcal{X}$ such that $\mu(\Xi)=1$.Then $\mu_n$ has LDP in both $\Xi$ and $\mathcal{X}$ with rate function $\Lambda^*$.</p> <h2 id="transformation-of-ldps">Transformation of LDPs</h2> <h3 id="contraction-inequality-of-a-mapping">contraction inequality of a mapping</h3> <p>Let $\mathcal{X}$ and $\mathcal{Y}$ be two Hausdorff topological space and $f: \mathcal{X} \to \mathcal{Y}$ a continuous map. If ${\mu_\epsilon}$ satisfy LDP with rate function $I$, then ${\mu_{\epsilon} f^{-1}}$ satisfies LDP with rate function</p> $I'(y) = \inf\{I(x): y=f(x)\}.$ <h3 id="ldp-from-exponential-approximation">LDP from exponential approximation</h3> <p>Assuming two random variables ${Z_\epsilon}$ and ${Z_\epsilon’}$ with joint law $P_{\epsilon}$ have marginal probability measures ${\mu_\epsilon}$ and ${\mu_\epsilon’}$ on a metric space $(\mathcal{Y}, d)$. These two probability measure families are exponential equivalent if</p> $\lim_{n \to \infty}\sup \epsilon \log P_{\epsilon}( \Gamma_{\delta})= -\infty$ <p>where the set $\Gamma_{\delta}= { (y, \tilde y ): d(y, \tilde y )&gt; \delta }$.</p> <p>Then the same LDP holds for ${\mu_\epsilon}$ and ${\mu_\epsilon’}$.</p> <p>In practice we often approximate a distribution by a series of simplified distribution.</p> <h3 id="laplace-approximation-varadhans-integral">Laplace approximation: Varadhan’s Integral</h3> <p>In the normal case of Cramére’s Theorem, $I(x)=x^2 / 2\sigma^2$. Does $I(x)$ more reverent than inverse variance in some generalized Laplace approximation, especially when the variance is not even defined?</p> <p>First, $\mu_{\epsilon}$ is on R, and we assume LDP: $\epsilon \log \mu_{\epsilon} (X&lt;x_0 )= - I(x_0)$, take derivative on $x_0$ we have</p> $\frac {d\mu_{\epsilon}} {dx} \approx \exp(-\frac{I(x)}{\epsilon}).$ <p>For any $\phi(x)$ We run Taylor expansion at $\bar x = \arg\max \phi(x)- I(x)$ and we have</p> $\phi(x)- I(x) = \phi(\bar x)- I(\bar x) + (x-\bar x)^2 \frac{d}{dx} (\phi( x)- I( x))\vert_{x=\xi}$ <p>Hence we compute the integral by</p> $\epsilon \log \int_R \exp (\phi(x)/\epsilon) d \mu_{\epsilon} \approx (\phi(\bar x)- I(\bar x))$ <p>Now, in the general space, Suppose ${\mu_{\epsilon}}$ satisfies LDP with rate $I()$ on space $\mathcal{X}$, and assume $\phi: \mathcal{X} \to R$ is any continuous function. With further either the tail condition</p> $\lim_{M\to \infty} \limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon}) 1_{\{Z_\epsilon\geq M \} } ] = -\infty$ <p>or for some $\gamma&gt;1$ holds</p> $\limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\gamma\phi(Z_\epsilon)}{\epsilon})] &lt; \infty,$ <p>then $$\lim_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon})] = \sup_{x \in \mathcal{X}} (\phi(x)- I(x))$$</p> <p>Varadhan’s Integral can often be used to approximate the normalization constant.</p> <p>Varadhan’s Integral generalizes the MGF to any non-linear functions. We consider the invserse problem:</p> <p>Define $\Gamma_{f}= \lim_{\epsilon \to 0} \log \int_x \exp(f(x)/\epsilon) d\mu_{\epsilon}$</p> <p><strong>Bryc inverse lemma:</strong> Suppose $\mu_{\epsilon}$ are exponentially tight tight and $\Gamma_{f}$ exists for all continuous and bounded $f \in C_b(\mathcal{X})$. Then $\mu_{\epsilon}$ has good rate function (largest difference between sum and log sum exp)</p> $I(x)= \sup_{f \in C_b(\mathcal{X})} (f(x)-\Gamma_{f} )$ <p>and dually</p> $\Gamma_{f} = \sup_{x \in \mathcal{X}} (f(x)- I(x) ).$ <p>We may restrict $C_b(\mathcal{X})$ to only linear functionals if $\mathcal{X}$ is a topological vector space.</p> <h2 id="sanovs-theorem-for-empirical-measures">Sanov’s Theorem for empirical measures</h2> <p>The LLN of the empirical mean of IID samples that motives the Cramére’s Theorem. Likewise, we know usually the empirical process converges the actual distribution. And Sanov’s Theorem answers how quick it is.</p> <p>Consider iid random variables $Y_1, \dots, Y_n$ to be $\Sigma$-valued, where $\Sigma$ is a Polish space. $Y_i$ has probability measure $\mu \in M_1(\Sigma)$, where $M_1(\Sigma)$ is the space of all probability measures on $\Sigma$. We may estimate $\mu$ empirically by</p> $L_n = 1/n \sum_{i=1}^n\delta(y=Y_i)$ <p>$L_n$ is also viewed as elements in $M_1(\Sigma)$.</p> <p>We equip $M(\sigma)$ with weak topology (consider open set generated by open balls ${\nu: \vert\int \phi d\nu - x \vert &lt; \delta}$ for all bounded continuous $\phi$. ) $M_1(\sigma)$ is a Polish space equipped with levy metric.</p> <p>By abstract Cramér’s Theorem in Polish space (where we replace $X_i \in R$ by $\delta(Y_i) \in M_1(\Sigma)$), we know $L_n$ has LDP in $M_1(\Sigma)$ with convex rate function</p> $\Gamma^*(\nu)= \sup_{\phi\in C_b(\Sigma)}\{ \langle \phi, \nu \rangle - \Gamma(\phi) \}$ <p>where $\Gamma(\phi)= \log E[\exp(\langle \phi, \delta(Y) \rangle )] = \log \int_\Sigma \exp(\phi) d\mu$</p> <p>Such rate function is difficult to compute, but Sanov’s Theorem says</p> $\Gamma^*(\nu) = KL( \nu, \mu )= \int_\Sigma \log \frac{d \nu}{d \mu} d\nu$ <p>Loosing speaking, for a closed set $\Gamma\subset M_1(\Sigma)$ $$\lim_{n\ \to \infty} 1/n \log P( L_n \in \Gamma) \approx -\inf_{\nu\in \Gamma} KL (\nu, \mu).$$</p> <h2 id="sanovs-theorem-for-stationary-gaussian-processes">Sanov’s Theorem for Stationary Gaussian Processes</h2> <p>Now the data is a sequence of stationary Gaussian process: ${X_k}$ ($-\infty &lt; k &lt; \infty$). We define the probability space $\Omega = \prod_{j = -\infty}^{\infty} \mathbb R_j$. $\omega = {x_j} \in \Omega$ with $\omega(j) = x_j$ and $P$ is that stationary Gaussian process probability measure on $\Omega$ induced by ${X_k}$. It has mean $\mathbb E [X_k] = 0$ and covariance: $\mathbb E[X_0 X_j] = \rho_j$.</p> <p>How quick doe the empirical measure converge? Indeed we can still find LDP for it. The main result is from <a href="https://projecteuclid.org/euclid.cmp/1103941986">Donsker and Varadhan, 1986</a>.</p> <p>Bochner’s Theorem says we can decompose the eigenfunction by frequency $Cov(X_0, X_j)=\rho_j = \frac{1}{2\pi} \int_0^{2\pi} e^{i j \theta} f(\theta) d\theta$, where we call $f(\theta)$ the <em>spectral density</em>. It is continuous on $[0, 2\pi]$ with $f(0) = f(2\pi)$.</p> <p>Let $T$ be a shift operator on $\Omega$, i.e., $T(\omega) (j) = x_{j+1}$. We construct $\omega^{(n)}$ by $\omega$, which is defined to be</p> $\dots, x_1, \dots, x_n, x_1,\dots,x_n, x_1, \dots, x_n, \dots$ <p>This define a map $\pi_n$ from $\Omega$ to $M_{s}$:</p> $\pi_n(\omega) := \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}}).$ <p>$M_{s}$ is the space of all stationary measure on $\Omega$. $Q_n = \pi_n P^{-1}$ is probability measure on $M_{s}$ induced by $\pi_n$: $Q_n (A)= P(\omega: \pi_n(\omega)\in A).$</p> <p>Then $Q_n$ satisfies LPD with good rate function $H_f( R )$. $H_f( R )$ is effectively the entropy of the stationary process $R$ with respect to the stationary Gaussian process $(X_{k})_{k=\infty}^{\infty}$.</p> <p>For $R \in M_{s}$ and $A \subset R$, we let $R(A\vert \omega) = R(X_{0} \in A \vert X_{-1}, X_{-2}, \dots)$ be the regular conditional probability distribution of $X_{0}$ given the entire past. Denote by $r(y\vert \omega)$ the corresponding density. This gives the explicti form of the rate: $$H_{f}( R ) = \mathbb E^R \left\{ \int_{-\infty}^{\infty} r(y\vert \omega) \log r(y\vert \omega) dy\right\} + \frac{1}{2} \log 2\pi + \frac{1}{4\pi} \int_0^{2\pi} \frac{dG(\theta)}{f(\theta)} + \frac{1}{4\pi} \int_0^{2\pi} \log f(\theta) d\theta.$$</p> <h3 id="sketch-the-proof">Sketch the proof:</h3> <p>By Fourier expansion, we write $\sqrt{f(\theta)} = \sum_{n = -\infty}^{\infty} a_n e^{in\theta}.$ Let $(\xi_k)$ be a sequence of independent Gaussian random variables with mean 0 and variance 1. Then, by Parseval’s theorem, $(X_k)_{k=-\infty}^{\infty}$ defined by</p> $X_k = \sum_{n = -\infty}^{\infty} a_{n-k} \xi_n = \sum_{n = -\infty}^{\infty} a_n \xi_{n+k}$ <p>is a stationary Gaussian process with mean 0 and covariance $E[X_{0} X_{j}] = \rho_{j} = \frac{1}{2\pi} \int_{0}^{2\pi} e^{ij \theta} f(\theta) d\theta .$</p> <p>Let $b_j = a_j \left(1 - \frac{\vert \vert j\vert}{N} \right)$ for $\vert j\vert &lt; N$. For each positive integer $N$, define a new process $(X_k^N)_{k=-\infty}^{\infty}$ by \vert $$X_k^N = \sum_{\vert j\vert&lt;N} b_{j} \xi_{j+k} = \sum_{\vert j\vert &lt;N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k},$$</p> <p>where $X_k^{N}$ is the Cesaro mean of the partial sums of $X_{k}$, i.e.,</p> $X_{k}^{N} = \frac{1}{N} \sum_{i = 0}^{N-1} \sum_{j=-i}^{i} a_{j} \xi_{j+k} = \sum_{\vert j\vert &lt;N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k}.$ <p>We define $F: \Omega \to \Omega$ by</p> $(F(\omega))(j) = \sum_{k=-\infty}^{\infty} a_{k} x_{j+k}.$ <p>$F$ maps $(\xi_{k})$ to $(X_{k})$.</p> <p>We define $F_N: \Omega \to \Omega$ such that</p> $(F_N(\omega)) (j) = \sum_{\vert k\vert &lt; N} b_k x_{j+k}.$ <p>The mapping $F_{N}$ induces a corresponding map $\tilde F_N: M_s \to M_s$.</p> <p>Let $\mu$ be the measure on $\Omega$ induced by $(\xi_k)$. Define $Q_n$ on $M_{s}$ such that $Q_n(A) := \mu{\omega: \pi_n \cdot F(\omega) \in A}$. Define $Q_n^N$ on $M_{s}$ such that $Q_n^N(A) := \mu{\omega: \pi_n \cdot F_N(\omega) \in A}$. Define $\tilde Q_n^N$ on $M_{s}$ such that</p> $\tilde Q_n^N(A) := \mu\{\omega: \tilde F_N \cdot \pi_n (\omega) \in A\}.$ <p>Recall that $\pi_n(\omega) = \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}})$. Let $$\tilde F_{N} \cdot \pi_n(\omega):= \frac{1}{n}[\delta_{F_N(\omega^{(n)})} + \delta_{F_N(T\omega^{(n)})} + \ldots + \delta_{F_N(T^{n-1}\omega^{(n)})}]$$ and $$\pi_n \cdot F_{N} (\omega):= \frac{1}{n}[\delta_{(F_N(\omega))^{(n)}} + \delta_{T(F_N(\omega))^{(n)}} + \ldots + \delta_{T^{n-1}(F_N(\omega))^{(n)}}].$$</p> <p>We apply Donsker Theorem and obtain LDP for $\tilde F_{N} \cdot \pi_n$:</p> <p>We total variation gap of $\vert \vert \tilde F_{N} \cdot \pi_n - \pi_n F_{N} \vert \vert_{TV}$ is $o(1)$, which further bounds the levy metric between them</p> $d(\tilde F_{N} \cdot \pi_n , \pi_n F_{N} ) = o(1).$ <p>So they are <em>exponentially equivalent</em>. This leads to the LDP for $Q_n^N$ using some triangle inequality.</p> <p>Likewise, we claim from $Q_n^N$ is <em>exponentially approximation</em> of $Q_n$ using some triangle inequality and contraction theorem, and therefore LPD of $Q_n$ applies.</p> <p>   </p> <p>P.S. Gonzalo Mena reminds me the connection of Sanov’s Theorem and exp-family, which I just learned from these <a href="https://www.tau.ac.il/~tsirel/Courses/LargeModerateDev/lect2.pdf">lecture notes</a> by Tsirelson.</p> <h2 id="any-large-deviation-is-done-in-the-least-unlikely-of-all-the-unlikely-ways">“Any large deviation is done in the least unlikely of all the unlikely ways!”</h2> <p>For any measure $\mu$ on $\mathcal{X}$ and a function $u$: $\mathcal{X} \to R$ we can define a tilted measure</p> $\mu_u (x) \propto \mu (x) \exp ( u(x) )$ <p>We can prove</p> $\mu_u = \arg\min_{\nu :\int u d\nu \geq \int u d \mu_u} KL(\nu, \mu)$ <p>Further, if $\mathcal{X}={1, \dots, d}$, we endow $\mathcal{X}^n$ with the probability measure $\mu_n$, and count frequency $\eta_n(j) = 1/n \sum_{k=1}^n1_{{x_k=j}}$ for each realization of $X$.</p> <p>Now conditioning on event $E_n={\int u d\eta_n \geq c}$, the random measure $\eta_n$ converges in probability to the tilted measure $\mu_{tu}$ where t &gt; 0 is such that $\int u d\mu_{tu} = c$.</p> <p>This is because</p> $\mu_{tu} = \arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$ <p>Generally, consider the set</p> $E_c=\{\int u d\nu \geq c\} \subset M_1(\mathcal{X}),$ <p>for which we know</p> $1/n \log P(L_n \in E_c ) \approx KL(\mu_{tu}, \mu)$ <p>Therefore $$\mu_n(\delta{\mu_{tu}} \vert E_0 ) = 1- \mu_n(E_0 / \{\mu_{tu}\})/ \mu_n(E_0)$$</p> <p>as long as $\mu_{tu}$ is a unique minimizer of $\arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$, we can conclude $$Q_n (\delta{\mu_{tu}} \vert E_c ) \to 1.$$ where $Q_n (A) = P(L_n \in A)$ is the induced measure on $M_1(\mathcal{X})$.</p>Yuling YaoI took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).Back to January 18 when there were 60 cases globally2020-05-03T00:00:00+00:002020-05-03T00:00:00+00:00http://www.yulingyao.com/blog/2020/back<p>On January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of <a href="https://www.bbc.com/news/health-51148303">BBC</a> which covered a story of a brand new yet underemphasized virus in China:</p> <blockquote> <p>There have been more than 60 confirmed cases of the new coronavirus (globally).</p> </blockquote> <p>BBC reported an estimation by <a href="http://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/news--wuhan-coronavirus/">ICL</a> that the number of cases were likely to be understated, in fact,</p> <blockquote> <p>experts estimate a figure nearer 1,700.</p> </blockquote> <p>Indeed, I remember I was quite horrified when I read their prediction:</p> <blockquote> <p>The virus ‘will have infected hundreds’.</p> </blockquote> <p>So I immediately checked their model, which was effectively a capture-recapture model based on 3 positive cases confirmed outside China. You could estimate the total number in Wuhan, by 3 dividing the possibility that a patient will leave Wuhan for international travels, which can be further estimated by the total outbound traffic counts divided by the city population. There were many simplifications, but it was fine to me.</p> <p>I knew Jonathan Auerbach had used a clever negative binomials on counting the number of rats in NYC, a mathematically equvalent to this covid model. So I emailed Jonathan:</p> <blockquote> <p>The inference based on 3 positive cases seems not convincing, yet should I buy more Lysol now?</p> </blockquote> <p>As a statistician, this could have been a reasonable questions to ask, as the whole estimation was driven by three data points. Given that we are dealing with the scale of data like millions in our daily research, it was the limit of a proportionate response to three data points by kinda making fun of it in a harmless email.</p> <p>Jon replied:</p> <blockquote> <p>You should blog about it! You can write a decision theory paper about how much lysol you should buy based on the data.</p> </blockquote> <p>Except I didn’t.</p> <p>Except four months later, the whole world and 7 billion human beings are fundamentally changed, not only by three bathes of viruses, but also by our ignorance and indifference in the early stage, to which I probably contributed too.</p> <p>Except I did (unusually) purchase put for LQD on a monthly rolling basis starting from December.</p> <p>Except that was for a completely different reason that I heard someone talking about the credit market risks with sky high public spending.</p> <p>Except in retrospect, I don’t know what should we really learn from this tragedy. Would I be more alert next time when I heard the word coronavirus?</p> <p>Sure, expcet such response is overfitting.</p> <p>Would I be alert next time by some analysis using three data points?</p> <p>I doubt.</p> <p>In a Bayes update regime, the posterior (for future) would hardly change with overwhelming evidences drawn everywhere else, even if those are the data we have drawn deliberately to make life sounds fully promising and hopeful.</p> <p>As a statistician, we are trained to make inference on any give dataset. But it does not eliminate the room for agnosticism by having collected all genes of all creatures in this universe, or monitored all satellite images of all shopping malls.</p>Yuling YaoOn January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of BBC which covered a story of a brand new yet underemphasized virus in China: There have been more than 60 confirmed cases of the new coronavirus (globally).Sample sd of indirect effects in a multilevel mediation model2020-05-03T00:00:00+00:002020-05-03T00:00:00+00:00http://www.yulingyao.com/blog/2020/mediation<p>M asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$. In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normal</p> $(a_j,b_j)^T \sim MVN ((a_0, b_0)^T, \Sigma),$ <p>According to literature xxx, we could estimate the expectation of ab in a typical group by (the sample mean of)</p> $a_0 b_0 + \sigma_{12}^2,$ <p>for $\sigma_{12}$ the off-diagonal element in $\Sigma$. Does it makes sense to summarize the uncertainty by the sample sd of the draws above?</p> <p>The answer is No. The formula really comes from point estimation context with</p> $E[ab]= E[a]E[b] + Cov(a,b) = a_0 b_0 + \sigma_{12}^2$ <p>The law of total variance says</p> $Var[ab]= E Var [ab| a_0 b_0 + \sigma_{12}^2 ] + Var E[ab| a_0 b_0 + \sigma_{12}^2 ]$ <p>The sample deviation of $a_0 b_0 + \sigma_{12}^2$ draws only amounts to the second term and I don’t think it means anything. The easiest way to solve the problem is to obtain posterior draws of group-level indirect effects $a_jb_j$ directly.</p>Yuling YaoM asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$. In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normalIf you encounter a recession in your 20s, you might be less likely to win the White House2020-04-17T00:00:00+00:002020-04-17T00:00:00+00:00http://www.yulingyao.com/blog/2020/potus<p>I was talking to BenB about how unlucky we (Ben and me) would be amid a looming economic crisis and financial shortage when we were about to enter the job market.</p> <p>To support my straw-man argument that “we are quite screwed up”, I found that Eisenhower was 39 in 1929, while JFK was 12 — if COVID-19 would have an as serious economic effect as the great depression, then my generation (currently late 20s) would be the same one sandwiched between JFK and Eisenhower.</p> <p>OK life is still fine without the goal of being POTUS, but I am kinda curious, what is the cohort effect that is induced by a recession in early life? And, more pedantically, if we have to encounter some recession from time to time, which age would be the worst to face it?</p> <p>I ask this question because I will graduate in 2021, so basically I know I am likely to be the worst. But again, you occasionally need a Stan model to prove your straw-man argument.</p> <p>I fit a very similar <em>age-period-cohort</em> model as in <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/cohort_voting_20191017.pdf">Yair Ghitza et al</a>. They used it to analyze the generation effect induced by presidential approval rate. Many years ago Yu-Sung and I also utlized a similar model to fit <a href="https://asiapolmeth.princeton.edu/sites/default/files/polmeth/files/su_chinese_happiness_generation.pdf">the generation effect of the Chinese people’s happiness</a>— we even copy the title.</p> <p>For this toy dataset, I collected all presidents’ birth years, and <a href="https://datasets.socialhistory.org/dataset.xhtml?persistentId=hdl:10622/8FCYOX">US inflation adjusted GDP</a>. The early historical data is inputed from an estimation of Clio-Infra.</p> <p><img src="/blog/images/2020/hist.png" alt="histogram" title="histogram" /></p> <p>It is tempting to claim the presidents’ birth years are not distributed evenly— until I think it is actually hard to make a precise test for such claim. At first I thought it should be quite rare to see three presidents born in the same year. But suppose I am sampling 45 draws with replacement from year year 1732 (birth year of Washington) to 1961 (Obama), and if each year is equally likely to be sampled, the chance of sampling a prespecified year three times is (45/230)^3, but after a multiple-testing adjustment, the p value is quite big. How about a one-sample Kolmogorov-Smirnov tests:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dgof::ks.test(president_birth_year, ecdf(1732:1961)) D = 0.10918, p-value = 0.8401 </code></pre></div></div> <p>So I guess it is actually distributed quite evenly—no deep states, no conspiracy theory folks! (That said, KS test has too small power and I remember Persi Diaconis has a recent math paper on hypothesis testing on discrete uniform variables. In this dataset you could probably find some better test statistics to reject the null, but that is not my purpose of this post anyway. )</p> <p>Return to the cohort effect. Let’s assume the gdp’s growth rate, denoted by $x_j$ on year $j$, is accumulated in one’s early life via a age-varying accumulation rate $\gamma_a$ at age $1\leq a \leq A$, I truncate $A=35$ ad-hocly. The generation that was born on year $k$ will receive a total effect</p> $\alpha_k= \sum_{i=1}^A \gamma_i x_{k+i-1}.$ <p>We assume a Poisson observational model with mean $\exp ( \alpha_k + \alpha_0 )$:</p> $y_k \sim \mathrm{Poisson} (\exp ( \alpha_k + \alpha_0 )).$ <p>Here $y_k$ is the counts of presidents that were born on year k. We smoothlize the estimation by an AR prior $\gamma_i \sim \mathrm{normal}(\gamma_{i-1}, \tau), \tau \sim \mathrm{normal}(0,1)$.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data { int&lt;lower=0&gt; A; // number of ages to accumulate int&lt;lower=0&gt; S; // number of years int y[S]; // counts of presidents in that year matrix[S,A] growth_at_age; // gdp growth rate } parameters { vector&lt;lower=0&gt;[A] gamma; // age-varying accumulate rate real c; real&lt;lower=0&gt; tau; } transformed parameters{ vector[S] agg_effect; // aggregated cohort effect agg_effect=growth_at_age*gamma+rep_vector(c, S); } model { for(s in 1:S) y[s] ~ poisson(exp(agg_effect[s])); gamma~normal(0,0.5); gamma[A]~normal(0,0.1); for(a in 2:A) gamma[a-1]~normal(gamma[a],tau); tau~normal(0,1); } </code></pre></div></div> <p><img src="/blog/images/2020/recession.png" alt="stan_fit" title="stan fit" /></p> <p>Hmmm, I don’t want to overstate the result, as it is quite a weak effect anyway, plus this is inference based on 45 data points. But basically from the graph, there is a positive relation between how likely people born in certain year will be president and gdp growth rate in their early life. If I am allowed to exaggerate the finding, I will point out there are two local maximums at age 16 and 23 in the posterior mean of $\gamma$, around the middle school and college graduation age, which is probably equivalent to my age now given the inflation—not good for me, folks. For older cohorts (&gt;30), the effect to accumulate is smaller.</p> <p>The right hand side shows the accumulated cohort effect. In particular people born around 1900 (late in 1929) are less fortunate, and this is in line with a gap of observed POTUS birth year around 1900 in the previous histogram.</p> <p>In the end I will offer a few counterargument to ease people around my generation who might be worried about the current situation and its long term effect:</p> <ol> <li>45 is a super small sample size.</li> <li>There is nothing causal here analyzed.</li> <li>COVID is real, and more uncertainty is yet to come. History is never stationary.</li> <li>Many model assumptions are horrible. Voting for president is not a multinomial sampling, and there are many other factors should be include.</li> <li>Instead of a global comparison, the local comparison is more important. To have a blessed career, you are competing with Lincoln and Washington, you only need relative advantage than your neighbor generations. Maybe I should fit a spline with a basis function that has one peak and two bottoms.</li> </ol> <p>In short, as a generation, facing covid-19 in early career is hardly blessed. But it is like a “slightly increased risk” result from 23andme gene screening: set aside the non-changeable (I mean this is the time we are praying for a <em>do</em> operator) generation effect, there is much more else we can do.</p>Yuling YaoI was talking to BenB about how unlucky we (Ben and me) would be amid a looming economic crisis and financial shortage when we were about to enter the job market.Do men cite themselves more than women?2020-04-12T00:00:00+00:002020-04-12T00:00:00+00:00http://www.yulingyao.com/blog/2020/citation<p>Today I was looking into my citation profile. And when I checked self-citation, I noticed a <a href="https://www.nature.com/news/men-cite-themselves-more-than-women-do-1.20176">Nature article</a> that baldly claimed “Men cite themselves more than women do…The apparent trend has been on the rise over the past two decades.” They not only almost interpret it as a causal relation, but also goes further and suggest “it is something that hiring and tenure committees should take into account when assessing the impact of researchers and their work”.</p> <p>Well, the method of this research is simply to count the citation in each group, and compute the self citation rate. But the claim is hardly anything causal. Assuming a person cites all his previous publications at each new paper. When he has $N$ publications, he would have $1+2+ \dots + (N-1) = O(N^2)$ self citation, making his self citation rate O(N). In realty this cannot happen as when N=100, one cannot really cite 100 previous publications so probably the self citation rate goes $O(N^d)$ for some $0&lt;d&lt;1$. In short, the total number of publication is such as huge confounder that any conclusion is unlikely sensible without taking into this factor into account.</p> <p>The booming literature on these research citations reminds me of a story that the 20-th century Chinese novelist <em>Luxun</em> told in his biography: he used to be a student in a mining school where the students run a experimental colliery. But the production of that colliery was so low that it was merely sufficient to feed the power generator of the pump used for mining.</p>Yuling YaoToday I was looking into my citation profile. And when I checked self-citation, I noticed a Nature article that baldly claimed “Men cite themselves more than women do…The apparent trend has been on the rise over the past two decades.” They not only almost interpret it as a causal relation, but also goes further and suggest “it is something that hiring and tenure committees should take into account when assessing the impact of researchers and their work”.Should I do laundry and get grocery delivery at the same day or separate days?2020-03-29T00:00:00+00:002020-03-29T00:00:00+00:00http://www.yulingyao.com/blog/2020/laundry<p>So I have stayed in my room for more than three weeks, and by stay in the room I mean strictly staying in the room without even touching the knob of the outdoor once – at the risk of sounding creepy.</p> <p>It seems I only have two exposures to outside risks: I need grocery delivery from whole foods market, and I need do laundry every week. Arguably both events have extremely low risk, but at the risk of sounding even more creepy, if I do want to minimize the risk, should I do laundry and get grocery delivery at the same day or separate days in a week?</p> <p>There are obvious reasons for both argument:</p> <p>The infection chance, as a function of virus amount in the aerosol or indeed any other surface, has to be (nearly) strictly convex. This is due to both the fact that (a) the probability measure lives in [0,1], an inverse logit type transformation of transformation carries linear functions to convex ones in the left part (unless I anticipate a more than 0.5 chance of infection in one laundry !) and (b) biologically the virus amount has to be meaningful after reaching certain threshold. As a consequence, if the potential virus coming from laundry and grocery delivery are independent (therefore linear additive), separating them into two days strictly decrease the expected chance of actual infection by Jensen inequality:</p> $Pr(\mathrm{infection} | \mathrm{virus~amount~} A + \mathrm{virus~amount~} B ) &gt; Pr(\mathrm{infection} | \mathrm{virus~amount~} A) + Pr(\mathrm{infection} |\mathrm{virus~amount~} B )$ <p>On the other hand, a concentred exposure can be sensible too. For one thing, I can save certain PPE (masks and gloves) by conducting events in the same time. Further, there are some concavities here: for example, the amount of elevator usage is concave, as I can arrange to carry the delivery package right after I finish the laundry in the basement. In general the marginal risk elevation as a function of exposure times is likely decreasing. This is like airplane safety, is the the fixed cost (airplane crashing rate during taking-off/landing; elevator use in laundry) predominate the variable cost, another Jensen inequality kicks in and you do want to choose a direct flight rather than two separated connected ones! $$Pr(\mathrm{virus~exposure} | \mathrm{event~}A + \mathrm{event~}B ) &lt; Pr(\mathrm{virus~exposure} | A) + Pr(\mathrm{virus~exposure} | B )$$</p> <p>I think in the airplane example, the risk is so low that the first factor is negligible (the Jensen inequality can also be expressed by second order Taylor expansion, but a logit type function will be nearly local in the very left end). In the Covid-19 case, however, both factors are weighted in. So there is probably an optimal soultion with randomized decision each week.</p>Yuling YaoSo I have stayed in my room for more than three weeks, and by stay in the room I mean strictly staying in the room without even touching the knob of the outdoor once – at the risk of sounding creepy.