Jekyll2020-05-19T06:23:04+00:00http://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoGaussian process regressions having opinions or speculation.2020-05-19T00:00:00+00:002020-05-19T00:00:00+00:00http://www.yulingyao.com/blog/2020/gp<p>I occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:</p>
<ol>
<li>facts,</li>
<li>informed extrapolations from analogies to other viruses and</li>
<li>opinion or speculation.</li>
</ol>
<p>That is right. Statistician needs some stationarity and smoothlization assumption so as to learn from data— and thereby always place ourselves in the risk of over-extrapolation.</p>
<p>In machine learning, the novelty detector and out-of-distribution uncertainty used be a hotspot especially given its connection to AI safety, and I have followed papers in this area for a while. (I think it is still a hotspot, but I don’t know for sure— indeed if someone tell you that he would be completely sure on the presence or the future, he is completely extrapolation, but anyway, it is fairly to assume the heat of areas does not shrink overnight, so maybe it is still at least a warm spot.)</p>
<p>In part many deep models ignored the parameter uncertainty and is overconfident. But I feel like there is a dangerous tendency that people treat some non-parametric bayesian model as always-right-but-hard-to-fit-model, as if we would never worry about novelty detector and out-of-sample uncertainty if we know how to fit a gaussian process with 10^10 points.</p>
<p>But gp is not immune to extrapolations. Here I generate 2 two-D data (x,y) with x only supported near 1 and 3 (.5 N(1, 0.01) + .5 N(3,0.01)). I could still fit a gp, and it does return results that fit the data in their support.</p>
<p><img src="/blog/images/2020/gp_demo.png" alt="gp" title="gp" /></p>
<p>But wait, why is it so sure about what happens in between— there is zero data in the middle! How could you know the f(x) at 2 is identically 0, instead of -20004, or 343583? The model is completely extrapolation.</p>
<p>You could probably guess the fitted length scale is very big— indeed longer than the x span so it effectively becomes a linear regression. It is not wrong, a linear regression can be useful too.</p>
<p>Even worse, such over-confidence is self-confirmed. Wikipedia says “the length scale tells how safe it is extrapolate outside the data span”. It is wrong. It does not tell use how safe it is. The inference in the no-data-zone comes from the prior, which is a mean zero gp, and it is very dangerous, arrogant and reckless if treat that as always the true model and the right uncertainty we are looking for.</p>
<p>Ideally we want a model/inference that goes on strike and yelling at the user when it perceived it is making non-data-ful extrapolation. Of course anything beyond data comes from prior, and prior is just more data, so technically it is as kosher to estimate the posterior outside the data domain by a crazy gp prior as to estimate the empirical density by a delta function — which are two extremes on the spectrum of how we weigh the relative reliance on prior and data. If you do not yell at the empirical process, why should the gp yell at you?</p>
<p>It is not utterly sane to stop this post with the previous question which I do not know the answer. But the main message is clear, gp is not always right. And gp can as over-extrapolating as a linear regression. And in many cases we do not know if the gp we are running is over-extrapolating or not.</p>Yuling YaoI occasionally read Howard Marks’s memo, and in my recent infrequent visit, I have constantly encountered him citing Marc Lipsitch, Professor of Epidemiology at Harvard, that (in Lipsitch’s covid research and in Marks’s money making) there are:Back to January 18 when there were 60 cases globally2020-05-09T00:00:00+00:002020-05-09T00:00:00+00:00http://www.yulingyao.com/blog/2020/back<p>On January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of <a href="https://www.bbc.com/news/health-51148303">BBC</a> which covered a story of a brand new yet underemphasized virus in China:</p>
<blockquote>
<p>There have been more than 60 confirmed cases of the new coronavirus (globally).</p>
</blockquote>
<p>BBC reported an estimation by <a href="http://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/news--wuhan-coronavirus/">ICL</a> that the number of cases were likely to be understated, in fact,</p>
<blockquote>
<p>experts estimate a figure nearer 1,700.</p>
</blockquote>
<p>Indeed, I remember I was quite horrified when I read their prediction:</p>
<blockquote>
<p>The virus ‘will have infected hundreds’.</p>
</blockquote>
<p>So I immediately checked their model, which was effectively a capture-recapture model based on 3 positive cases confirmed outside China. You could estimate the total number in Wuhan, by 3 dividing the possibility that a patient will leave Wuhan for international travels, which can be further estimated by the total outbound traffic counts divided by the city population. There were many simplifications, but it was fine to me.</p>
<p>I knew Jonathan Auerbach had used a clever negative binomials on
counting the number of rats in NYC, a mathematically equvalent to this covid model. So I emailed Jonathan:</p>
<blockquote>
<p>The inference based on 3 positive cases seems not convincing, yet should I buy more Lysol now?</p>
</blockquote>
<p>As a statistician, this could have been a reasonable questions to ask, as the whole estimation was driven by three data points. Given that we are dealing with the scale of data like millions in our daily research, it was the limit of a proportionate response to three data points by kinda making fun of it in a harmless email.</p>
<p>Jon replied:</p>
<blockquote>
<p>You should blog about it! You can write a decision theory paper about how much lysol you should buy based on the data.</p>
</blockquote>
<p>Except I didn’t.</p>
<p>Except four months later, the whole world and 7 billion human beings are fundamentally changed, not only by three bathes of viruses, but also by our ignorance and indifference in the early stage, to which I probably contributed too.</p>
<p>Except I did (unusually) purchase put for LQD on a monthly rolling basis starting from December.</p>
<p>Except that was for a completely different reason that I heard someone talking about the credit market risks with sky high public spending.</p>
<p>Except in retrospect, I don’t know what should we really learn from this tragedy. Would I be more alert next time when I heard the word coronavirus?</p>
<p>Sure, expcet such response is overfitting.</p>
<p>Would I be alert next time by some analysis using three data points?</p>
<p>I doubt.</p>
<p>In a Bayes update regime, the posterior (for future) would hardly change with overwhelming evidences drawn everywhere else, even if those are the data we have drawn deliberately to make life sounds fully promising and hopeful.</p>
<p>As a statistician, we are trained to make inference on any give dataset. But it does not eliminate the room for agnosticism by having collected all genes of all creatures in this universe, or monitored all satellite images of all shopping malls.</p>Yuling YaoOn January 18 2020, the pre-pandemic era, when the stock market in both US and China was still busy celebrating their phase-one trade deal, I saw this news in the China section of BBC which covered a story of a brand new yet underemphasized virus in China: There have been more than 60 confirmed cases of the new coronavirus (globally).A very short introduction on the large deviation principle2020-05-06T00:00:00+00:002020-05-06T00:00:00+00:00http://www.yulingyao.com/blog/2020/introLDP<p>I took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).</p>
<h2 id="from-law-of-large-numbers-to-the-large-deviation-principle">From Law of Large Numbers To The Large Deviation Principle</h2>
<p>Given a probability measures ${\mu_{\epsilon}}$ on a space $(\mathcal{X}, \mathcal{B})$, instead of a limiting measure ( for example $\mu_{\epsilon}\ (\Gamma) \to 0$), we may also be interested in how quick such convergence happen. The Large deviation principle describes the limiting rate of such sequence, where the rate is characterized by a lower-semicontinuous mapping I from $\mathcal{X}$ to $[0, \infty]$, which we call a <em>rate function</em>.</p>
<p><strong>Definition</strong>: $\mu_{\epsilon}$ satisfies the large deviation principle with a rate function I, if for all set $\Gamma\in \mathcal{B}$,</p>
<script type="math/tex; mode=display">\inf_{x\in \Gamma^0} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (\Gamma) \leq \lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (\Gamma) \leq \inf_{x\in \bar \Gamma} I(x)</script>
<p>Consider a concrete example, if $S_n$ is the sample average of iid standard Gaussian random variables $X_1, \dots, X_n$, we known $S_n / \sqrt{n} = N(0, 1)$. Indeed as long as CLT holds, we know $P(\vert S_n\vert \geq \delta) \to 1- P(\vert N(0,1)\vert>\delta\sqrt n )$ which is 0 for any $\delta>0$. However, for this toy case, we can write replace the limit by identity and it leads to</p>
<script type="math/tex; mode=display">1/n \log P(\vert S_n\vert \geq \delta) \to -\delta^2/2.</script>
<p>In general this precise rate is way beyond what a CLT can describe. A motivating example I have in mind is <em>importance sampling</em>: We draw $x_i$ from a proposal distribution $q$, and we can estimate $E_p h(x)$ by $S_n=1/n \sum_{i=1}^n h(x_i)r(x_i).$ with $r=p/q$ followed by self-normalization. We do know $S_n \to E_p h(x)$, but how fast is it? How can we describe characterize some large estimation error happens: $P(\vert S_n- E_p h(x) \vert \geq \delta)$? Indeed, even if $r$ has finite second moment and CLT holds, such large deviation probability still depends on the distribution of both $r$ and $h$.</p>
<p>Another practical situation that I recently consider is sequential design/active learning. For example in clinical trial we may adaptively sample until a interim decision boundary is reached (say some “p value” is “significant”). Aside from design hypothesis testing, we shall use $P(\vert S_n\vert \geq \delta)$ to compute the expected stopping time.</p>
<p>For the purpose of many proofs, we present a equivalent (equivalent when $\mathcal{B}$ contains the Boreal sigma filed of $\mathcal{X}$) definition:</p>
<p>$\mu_{\epsilon}$ satisfies the large deviation principle with a rate function $I()$, if</p>
<ol>
<li>
<p>For all closed set $F \subset \mathcal{X}$ ,
<script type="math/tex">\lim_{\epsilon\to 0}\sup \epsilon \log \mu_{\epsilon} (F) \leq \inf_{x\in F} I(x).</script></p>
</li>
<li>
<p>For all open set $G \subset \mathcal{X}$ ,</p>
</li>
</ol>
<script type="math/tex; mode=display">\inf_{x\in G} I(x) \leq \lim_{\epsilon\to 0}\inf \epsilon \log \mu_{\epsilon} (G) .</script>
<h2 id="empirical-average-of-iid-samples--cramérs-theorem">Empirical average of IID samples: Cramér’s Theorem</h2>
<p>If we draw $X_1, \dots, X_n$ iid from the a $d$-dimensional real valued distribution $\mu$, we compute the empirical average $S_n=1/n \sum_{i=1}^n X_i$, of course we know $S_n\to E[X]$. The question is, how quick.</p>
<p><strong>Cramére’s Theorem</strong> states that the law of $S_n$, denoted by $\mu_n$, satisfies LDP with a convex rate function $\Lambda^*(\cdot)$.</p>
<p>To define $\Lambda^*$, we first define the log moment generating function</p>
<script type="math/tex; mode=display">\Lambda (\lambda)=\log \operatorname {E} [\exp(\langle\lambda, X\rangle)].</script>
<p>where $\langle, \rangle$ is the inner product.</p>
<p>The desired rate function $\Lambda^*$ is its Fenchel-Legendre transform (the difference max between log sum exp and sum):</p>
<script type="math/tex; mode=display">\Lambda^* (x)= \sup_{\lambda \in R_d} ( \langle \lambda, X\rangle - \Lambda (\lambda) )</script>
<p>In particular in 1-D, we have</p>
<script type="math/tex; mode=display">\lim_{n\ \to \infty} 1/n \log P(S_n \geq C) = -\inf_{x\geq C} \Lambda^*(x)</script>
<p>The Cramére’s Theorem can also be extended to weak-dependence such as Markov chains, as well as martingales.</p>
<p>For example, for real valued iid $X_1, \dots, X_n$ and a function $Z= g_n (X_1, \dots, X_n)$ that satisfy
$\vert g_n(X_1, \dots, X_{k}, X_n) - g_n(X_1, \dots, X_{k’}, X_n)\vert <1$, then <strong>contraction inequality</strong> has</p>
<script type="math/tex; mode=display">1/n \log P(1/n (Z_n - E Z) \geq C ) \leq - H(\frac{C+1}{2} \vert \frac{1}{2})</script>
<p>for H the KL between two Bernoullis.</p>
<p>Finally we can extend the result beyond $R^d$:</p>
<h3 id="craméres-theorem-for-abstract-empirical-measure">Cramére’s Theorem for abstract empirical measure</h3>
<p>We assume $\mu_n$ is the law of $S_n= \frac{1}{n} \sum_{i=1}^n X_i$ on a locally convex, Hausdorff, topological real vector space $\mathcal{X}$ such that there exists a polish space $\Xi \subset \mathcal{X}$ such that $\mu(\Xi)=1$.Then $\mu_n$ has LDP in both $\Xi$ and $\mathcal{X}$ with rate function $\Lambda^*$.</p>
<h2 id="transformation-of-ldps">Transformation of LDPs</h2>
<h3 id="contraction-inequality-of-a-mapping">contraction inequality of a mapping</h3>
<p>Let $\mathcal{X}$ and $\mathcal{Y}$ be two Hausdorff topological space and $f: \mathcal{X} \to \mathcal{Y}$ a continuous map. If ${\mu_\epsilon}$ satisfy LDP with rate function $I$, then ${\mu_{\epsilon} f^{-1}}$ satisfies LDP with rate function</p>
<script type="math/tex; mode=display">I'(y) = \inf\{I(x): y=f(x)\}.</script>
<h3 id="ldp-from-exponential-approximation">LDP from exponential approximation</h3>
<p>Assuming two random variables ${Z_\epsilon}$ and ${Z_\epsilon’}$ with joint law $P_{\epsilon}$ have marginal probability measures ${\mu_\epsilon}$ and ${\mu_\epsilon’}$ on a metric space $(\mathcal{Y}, d)$. These two probability measure families are exponential equivalent if</p>
<script type="math/tex; mode=display">\lim_{n \to \infty}\sup \epsilon \log P_{\epsilon}( \Gamma_{\delta})= -\infty</script>
<p>where the set $\Gamma_{\delta}= { (y, \tilde y ): d(y, \tilde y )> \delta }$.</p>
<p>Then the same LDP holds for ${\mu_\epsilon}$ and ${\mu_\epsilon’}$.</p>
<p>In practice we often approximate a distribution by a series of simplified distribution.</p>
<h3 id="laplace-approximation-varadhans-integral">Laplace approximation: Varadhan’s Integral</h3>
<p>In the normal case of Cramére’s Theorem, $I(x)=x^2 / 2\sigma^2$. Does $I(x)$ more reverent than inverse variance in some generalized Laplace approximation, especially when the variance is not even defined?</p>
<p>First, $\mu_{\epsilon}$ is on R, and we assume LDP: $\epsilon \log \mu_{\epsilon} (X<x_0 )= - I(x_0)$, take derivative on $x_0$ we have</p>
<script type="math/tex; mode=display">\frac {d\mu_{\epsilon}} {dx} \approx \exp(-\frac{I(x)}{\epsilon}).</script>
<p>For any $\phi(x)$ We run Taylor expansion at $\bar x = \arg\max \phi(x)- I(x)$
and we have</p>
<script type="math/tex; mode=display">\phi(x)- I(x) = \phi(\bar x)- I(\bar x) + (x-\bar x)^2 \frac{d}{dx} (\phi( x)- I( x))\vert_{x=\xi}</script>
<p>Hence we compute the integral by</p>
<script type="math/tex; mode=display">\epsilon \log \int_R \exp (\phi(x)/\epsilon) d \mu_{\epsilon} \approx (\phi(\bar x)- I(\bar x))</script>
<p>Now, in the general space, Suppose ${\mu_{\epsilon}}$ satisfies LDP with rate $I()$ on space $\mathcal{X}$, and assume $\phi: \mathcal{X} \to R$ is any continuous function. With further either the tail condition</p>
<script type="math/tex; mode=display">\lim_{M\to \infty} \limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon}) 1_{\{Z_\epsilon\geq M \} } ] = -\infty</script>
<p>or for some $\gamma>1$ holds</p>
<script type="math/tex; mode=display">% <![CDATA[
\limsup_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\gamma\phi(Z_\epsilon)}{\epsilon})] < \infty, %]]></script>
<p>then
<script type="math/tex">\lim_{\epsilon\to 0}\epsilon \log E [\exp(\frac{\phi(Z_\epsilon)}{\epsilon})] = \sup_{x \in \mathcal{X}} (\phi(x)- I(x))</script></p>
<p>Varadhan’s Integral can often be used to approximate the normalization constant.</p>
<p>Varadhan’s Integral generalizes the MGF to any non-linear functions. We consider the invserse problem:</p>
<p>Define $\Gamma_{f}= \lim_{\epsilon \to 0} \log \int_x \exp(f(x)/\epsilon) d\mu_{\epsilon} $</p>
<p><strong>Bryc inverse lemma:</strong> Suppose $\mu_{\epsilon}$ are exponentially tight tight and $\Gamma_{f}$ exists for all continuous and bounded $f \in C_b(\mathcal{X})$. Then $\mu_{\epsilon}$ has good rate function (largest difference between sum and log sum exp)</p>
<script type="math/tex; mode=display">I(x)= \sup_{f \in C_b(\mathcal{X})} (f(x)-\Gamma_{f} )</script>
<p>and dually</p>
<script type="math/tex; mode=display">\Gamma_{f} = \sup_{x \in \mathcal{X}} (f(x)- I(x) ).</script>
<p>We may restrict $ C_b(\mathcal{X})$ to only linear functionals if $\mathcal{X}$ is a topological vector space.</p>
<h2 id="sanovs-theorem-for-empirical-measures">Sanov’s Theorem for empirical measures</h2>
<p>The LLN of the empirical mean of IID samples that motives the Cramére’s Theorem. Likewise, we know usually the empirical process converges the actual distribution. And Sanov’s Theorem answers how quick it is.</p>
<p>Consider iid random variables $Y_1, \dots, Y_n$ to be $\Sigma$-valued, where $\Sigma$ is a Polish space. $Y_i$ has probability measure $\mu \in M_1(\Sigma)$, where $M_1(\Sigma)$ is the space of all probability measures on $\Sigma$. We may estimate $\mu$ empirically by</p>
<script type="math/tex; mode=display">L_n = 1/n \sum_{i=1}^n\delta(y=Y_i)</script>
<p>$L_n$ is also viewed as elements in $M_1(\Sigma)$.</p>
<p>We equip $M(\sigma)$ with weak topology (consider open set generated by open balls ${\nu: \vert\int \phi d\nu - x \vert < \delta}$ for all bounded continuous $\phi$. ) $M_1(\sigma)$ is a Polish space equipped with levy metric.</p>
<p>By abstract Cramér’s Theorem in Polish space (where we replace $X_i \in R$ by $\delta(Y_i) \in M_1(\Sigma)$), we know $L_n$ has LDP in $M_1(\Sigma)$ with convex rate function</p>
<script type="math/tex; mode=display">\Gamma^*(\nu)= \sup_{\phi\in C_b(\Sigma)}\{ \langle \phi, \nu \rangle - \Gamma(\phi) \}</script>
<p>where $\Gamma(\phi)= \log E[\exp(\langle \phi, \delta(Y) \rangle )] = \log \int_\Sigma \exp(\phi) d\mu$</p>
<p>Such rate function is difficult to compute, but Sanov’s Theorem says</p>
<script type="math/tex; mode=display">\Gamma^*(\nu) = KL( \nu, \mu )= \int_\Sigma \log \frac{d \nu}{d \mu} d\nu</script>
<p>Loosing speaking, for a closed set $\Gamma\subset M_1(\Sigma)$
<script type="math/tex">\lim_{n\ \to \infty} 1/n \log P( L_n \in \Gamma) \approx -\inf_{\nu\in \Gamma} KL (\nu, \mu).</script></p>
<h2 id="sanovs-theorem-for-stationary-gaussian-processes">Sanov’s Theorem for Stationary Gaussian Processes</h2>
<p>Now the data is a sequence of stationary Gaussian process: ${X_k}$ ($-\infty < k < \infty$). We define the probability space $\Omega = \prod_{j = -\infty}^{\infty} \mathbb R_j$. $\omega = {x_j} \in \Omega$ with $\omega(j) = x_j$ and $P$ is that stationary Gaussian process probability measure on $\Omega$ induced by ${X_k}$. It has mean $\mathbb E [X_k] = 0$ and covariance: $\mathbb E[X_0 X_j] = \rho_j$.</p>
<p>How quick doe the empirical measure converge? Indeed we can still find LDP for it. The main result is from <a href="https://projecteuclid.org/euclid.cmp/1103941986">Donsker and Varadhan, 1986</a>.</p>
<p>Bochner’s Theorem says we can decompose the eigenfunction by frequency $Cov(X_0, X_j)=\rho_j = \frac{1}{2\pi} \int_0^{2\pi} e^{i j \theta} f(\theta) d\theta$, where we call $f(\theta)$ the <em>spectral density</em>. It is continuous on $[0, 2\pi]$ with $f(0) = f(2\pi)$.</p>
<p>Let $T$ be a shift operator on $\Omega$, i.e., $T(\omega) (j) = x_{j+1}$. We
construct $\omega^{(n)}$ by $\omega$, which is defined to be</p>
<script type="math/tex; mode=display">\dots, x_1, \dots, x_n, x_1,\dots,x_n, x_1, \dots, x_n, \dots</script>
<p>This define a map $\pi_n$ from $\Omega$ to $M_{s}$:</p>
<script type="math/tex; mode=display">\pi_n(\omega) := \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}}).</script>
<p>$M_{s}$ is the space of all stationary measure on $\Omega$.
$Q_n = \pi_n P^{-1}$ is probability measure on $M_{s}$ induced by $\pi_n$:
$Q_n (A)= P(\omega: \pi_n(\omega)\in A). $</p>
<p>Then $Q_n$ satisfies LPD with good rate function $H_f( R )$.
$H_f( R )$ is effectively the entropy of the stationary process $R$ with respect to the stationary Gaussian process $(X_{k})_{k=\infty}^{\infty}$.</p>
<p>For $R \in M_{s}$ and $A \subset R$, we let $R(A\vert \omega) = R(X_{0} \in A \vert X_{-1}, X_{-2}, \dots)$ be the regular conditional probability distribution of $X_{0}$ given the entire past. Denote by $r(y\vert \omega)$ the corresponding density. This gives the explicti form of the rate:
<script type="math/tex">H_{f}( R ) = \mathbb E^R \left\{ \int_{-\infty}^{\infty} r(y\vert \omega) \log r(y\vert \omega) dy\right\} + \frac{1}{2} \log 2\pi
+ \frac{1}{4\pi} \int_0^{2\pi} \frac{dG(\theta)}{f(\theta)}
+ \frac{1}{4\pi} \int_0^{2\pi} \log f(\theta) d\theta.</script></p>
<h3 id="sketch-the-proof">Sketch the proof:</h3>
<p>By Fourier expansion, we write
$\sqrt{f(\theta)} = \sum_{n = -\infty}^{\infty} a_n e^{in\theta}.$
Let $(\xi_k)$ be a sequence of independent Gaussian random variables with mean 0 and variance 1. Then, by Parseval’s theorem, $(X_k)_{k=-\infty}^{\infty}$ defined by</p>
<script type="math/tex; mode=display">X_k = \sum_{n = -\infty}^{\infty} a_{n-k} \xi_n = \sum_{n = -\infty}^{\infty} a_n \xi_{n+k}</script>
<p>is a stationary Gaussian process with mean 0 and covariance $ E[X_{0} X_{j}] = \rho_{j} = \frac{1}{2\pi} \int_{0}^{2\pi} e^{ij \theta} f(\theta) d\theta .$</p>
<p>Let $b_j = a_j \left(1 - \frac{\vert \vert j\vert}{N} \right)$ for $\vert j\vert < N$. For each positive integer $N$, define a new process $(X_k^N)_{k=-\infty}^{\infty}$ by
\vert
<script type="math/tex">% <![CDATA[
X_k^N = \sum_{\vert j\vert<N} b_{j} \xi_{j+k} = \sum_{\vert j\vert <N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k}, %]]></script></p>
<p>where $X_k^{N}$ is the Cesaro mean of the partial sums of $X_{k}$, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
X_{k}^{N} = \frac{1}{N} \sum_{i = 0}^{N-1} \sum_{j=-i}^{i} a_{j} \xi_{j+k} = \sum_{\vert j\vert <N} a_j \left(1 - \frac{\vert j\vert }{N} \right) \xi_{j+k}. %]]></script>
<p>We define $F: \Omega \to \Omega$ by</p>
<script type="math/tex; mode=display">(F(\omega))(j) = \sum_{k=-\infty}^{\infty} a_{k} x_{j+k}.</script>
<p>$F$ maps $(\xi_{k}) $ to $(X_{k})$.</p>
<p>We define $F_N: \Omega \to \Omega$ such that</p>
<script type="math/tex; mode=display">% <![CDATA[
(F_N(\omega)) (j) = \sum_{\vert k\vert < N} b_k x_{j+k}. %]]></script>
<p>The mapping $F_{N}$ induces a corresponding map $\tilde F_N: M_s \to M_s$.</p>
<p>Let $\mu$ be the measure on $\Omega$ induced by $(\xi_k)$. Define $Q_n$ on $M_{s}$ such that $Q_n(A) := \mu{\omega: \pi_n \cdot F(\omega) \in A}$.
Define $Q_n^N$ on $M_{s}$ such that $Q_n^N(A) := \mu{\omega: \pi_n \cdot F_N(\omega) \in A}$.
Define $\tilde Q_n^N$ on $M_{s}$ such that</p>
<script type="math/tex; mode=display">\tilde Q_n^N(A) := \mu\{\omega: \tilde F_N \cdot \pi_n (\omega) \in A\}.</script>
<p>Recall that $\pi_n(\omega) = \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}})$.
Let <script type="math/tex">\tilde F_{N} \cdot \pi_n(\omega):= \frac{1}{n}[\delta_{F_N(\omega^{(n)})} + \delta_{F_N(T\omega^{(n)})} + \ldots + \delta_{F_N(T^{n-1}\omega^{(n)})}]</script>
and <script type="math/tex">\pi_n \cdot F_{N} (\omega):= \frac{1}{n}[\delta_{(F_N(\omega))^{(n)}} + \delta_{T(F_N(\omega))^{(n)}} + \ldots + \delta_{T^{n-1}(F_N(\omega))^{(n)}}].</script></p>
<p>We apply Donsker Theorem and obtain LDP for $\tilde F_{N} \cdot \pi_n$:</p>
<p>We total variation gap of $\vert \vert \tilde F_{N} \cdot \pi_n - \pi_n F_{N} \vert \vert_{TV} $ is $o(1)$, which further bounds the levy metric between them</p>
<script type="math/tex; mode=display">d(\tilde F_{N} \cdot \pi_n , \pi_n F_{N} ) = o(1).</script>
<p>So they are <em>exponentially equivalent</em>.
This leads to the LDP for $Q_n^N$ using some triangle inequality.</p>
<p>Likewise, we claim from $Q_n^N$ is <em>exponentially approximation</em> of $Q_n$ using some triangle inequality and contraction theorem, and therefore LPD of $Q_n$ applies.</p>
<p>
</p>
<p>P.S. Gonzalo Mena reminds me the connection of Sanov’s Theorem and exp-family, which I just learned from these <a href="https://www.tau.ac.il/~tsirel/Courses/LargeModerateDev/lect2.pdf">lecture notes</a> by Tsirelson.</p>
<h2 id="any-large-deviation-is-done-in-the-least-unlikely-of-all-the-unlikely-ways">“Any large deviation is done in the least unlikely of all the unlikely ways!”</h2>
<p>For any measure $\mu$ on $\mathcal{X}$ and a function $u$: $\mathcal{X} \to R$ we can define a tilted measure</p>
<script type="math/tex; mode=display">\mu_u (x) \propto \mu (x) \exp ( u(x) )</script>
<p>We can prove</p>
<script type="math/tex; mode=display">\mu_u = \arg\min_{\nu :\int u d\nu \geq \int u d \mu_u} KL(\nu, \mu)</script>
<p>Further, if $\mathcal{X}={1, \dots, d}$, we endow $\mathcal{X}^n$ with the probability measure $\mu_n$, and count frequency $\eta_n(j) = 1/n \sum_{k=1}^n1_{{x_k=j}}$
for each realization of $X$.</p>
<p>Now conditioning on event $E_n={\int u d\eta_n \geq c} $, the random measure $\eta_n$ converges in probability to the tilted measure $\mu_{tu}$ where t > 0 is such that $\int u d\mu_{tu} = c$.</p>
<p>This is because</p>
<script type="math/tex; mode=display">\mu_{tu} = \arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)</script>
<p>Generally, consider the set</p>
<script type="math/tex; mode=display">E_c=\{\int u d\nu \geq c\} \subset M_1(\mathcal{X}),</script>
<p>for which we know</p>
<script type="math/tex; mode=display">1/n \log P(L_n \in E_c ) \approx KL(\mu_{tu}, \mu)</script>
<p>Therefore <script type="math/tex">\mu_n(\delta{\mu_{tu}} \vert E_0 ) = 1- \mu_n(E_0 / \{\mu_{tu}\})/ \mu_n(E_0)</script></p>
<p>as long as $\mu_{tu}$ is a unique minimizer of $\arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$, we can conclude <script type="math/tex">Q_n (\delta{\mu_{tu}} \vert E_c ) \to 1.</script>
where $Q_n (A) = P(L_n \in A) $ is the induced measure on $M_1(\mathcal{X})$.</p>Yuling YaoI took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).Sample sd of indirect effects in a multilevel mediation model2020-05-03T00:00:00+00:002020-05-03T00:00:00+00:00http://www.yulingyao.com/blog/2020/mediation<p>M asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$.
In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normal</p>
<script type="math/tex; mode=display">(a_j,b_j)^T \sim MVN ((a_0, b_0)^T, \Sigma),</script>
<p>According to literature xxx, we could estimate the expectation of ab in a typical group by (the sample mean of)</p>
<script type="math/tex; mode=display">a_0 b_0 + \sigma_{12}^2,</script>
<p>for $\sigma_{12}$ the off-diagonal element in $\Sigma$. Does it makes sense to summarize the uncertainty by the sample sd of the draws above?</p>
<p>The answer is No. The formula really comes from point estimation context with</p>
<script type="math/tex; mode=display">E[ab]= E[a]E[b] + Cov(a,b) = a_0 b_0 + \sigma_{12}^2</script>
<p>The law of total variance says</p>
<script type="math/tex; mode=display">Var[ab]= E Var [ab| a_0 b_0 + \sigma_{12}^2 ] + Var E[ab| a_0 b_0 + \sigma_{12}^2 ]</script>
<p>The sample deviation of $a_0 b_0 + \sigma_{12}^2$ draws only amounts to the second term and I don’t think it means anything. The easiest way to solve the problem is to obtain posterior draws of group-level indirect effects $a_jb_j$ directly.</p>Yuling YaoM asked me a question which essentially looks like this: In a mediation model a and b are regression coefficient through the mediation path, and the final quantity of interest is therefore the product $ab$. In a multilevel model, for each group $j$, we model both a[j] and b[j] varying within group, where we could model in Stan as a multivariate normalIf you encounter a recession in your 20s, you might be less likely to win the White House2020-04-17T00:00:00+00:002020-04-17T00:00:00+00:00http://www.yulingyao.com/blog/2020/potus<p>I was talking to BenB about how unlucky we (Ben and me) would be amid a looming economic crisis and financial shortage when we were about to enter the job market.</p>
<p>To support my straw-man argument that “we are quite screwed up”, I found that Eisenhower was 39 in 1929, while JFK was 12 — if COVID-19 would have an as serious economic effect as the great depression, then my generation (currently late 20s) would be the same one sandwiched between JFK and Eisenhower.</p>
<p>OK life is still fine without the goal of being POTUS, but I am kinda curious, what is the cohort effect that is induced by a recession in early life? And, more pedantically, if we have to encounter some recession from time to time, which age would be the worst to face it?</p>
<p>I ask this question because I will graduate in 2021, so basically I know I am likely to be the worst. But again, you occasionally need a Stan model to prove your straw-man argument.</p>
<p>I fit a very similar <em>age-period-cohort</em> model as in <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/cohort_voting_20191017.pdf">Yair Ghitza et al</a>. They used it to analyze the generation effect induced by presidential approval rate.
Many years ago Yu-Sung and I also utlized a similar model to fit <a href="https://asiapolmeth.princeton.edu/sites/default/files/polmeth/files/su_chinese_happiness_generation.pdf">the generation effect of the Chinese people’s happiness</a>— we even copy the title.</p>
<p>For this toy dataset, I collected all presidents’ birth years, and <a href="https://datasets.socialhistory.org/dataset.xhtml?persistentId=hdl:10622/8FCYOX">US inflation adjusted GDP</a>. The early historical data is inputed from an estimation of Clio-Infra.</p>
<p><img src="/blog/images/2020/hist.png" alt="histogram" title="histogram" /></p>
<p>It is tempting to claim the presidents’ birth years are not distributed evenly— until I think it is actually hard to make a precise test for such claim. At first I thought it should be quite rare to see three presidents born in the same year. But suppose I am sampling 45 draws with replacement from year year 1732 (birth year of Washington) to 1961 (Obama), and if each year is equally likely to be sampled, the chance of sampling a prespecified year three times is (45/230)^3, but after a multiple-testing adjustment, the p value is quite big. How about a one-sample Kolmogorov-Smirnov tests:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dgof::ks.test(president_birth_year, ecdf(1732:1961))
D = 0.10918, p-value = 0.8401
</code></pre></div></div>
<p>So I guess it is actually distributed quite evenly—no deep states, no conspiracy theory folks! (That said, KS test has too small power and I remember Persi Diaconis has a recent math paper on hypothesis testing on discrete uniform variables. In this dataset you could probably find some better test statistics to reject the null, but that is not my purpose of this post anyway. )</p>
<p>Return to the cohort effect. Let’s assume the gdp’s growth rate, denoted by $x_j$ on year $j$, is accumulated in one’s early life via a age-varying accumulation rate $\gamma_a$ at age $1\leq a \leq A$, I truncate $A=35$ ad-hocly. The generation that was born on year $k$ will receive a total effect</p>
<script type="math/tex; mode=display">\alpha_k= \sum_{i=1}^A \gamma_i x_{k+i-1}.</script>
<p>We assume a Poisson observational model
with mean $\exp ( \alpha_k + \alpha_0 )$:</p>
<script type="math/tex; mode=display">y_k \sim \mathrm{Poisson} (\exp ( \alpha_k + \alpha_0 )).</script>
<p>Here $y_k$ is the counts of presidents that were born on year k. We smoothlize the estimation by an AR prior $\gamma_i \sim \mathrm{normal}(\gamma_{i-1}, \tau), \tau \sim \mathrm{normal}(0,1)$.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data {
int<lower=0> A; // number of ages to accumulate
int<lower=0> S; // number of years
int y[S]; // counts of presidents in that year
matrix[S,A] growth_at_age; // gdp growth rate
}
parameters {
vector<lower=0>[A] gamma; // age-varying accumulate rate
real c;
real<lower=0> tau;
}
transformed parameters{
vector[S] agg_effect; // aggregated cohort effect
agg_effect=growth_at_age*gamma+rep_vector(c, S);
}
model {
for(s in 1:S)
y[s] ~ poisson(exp(agg_effect[s]));
gamma[1]~normal(0,0.5);
gamma[A]~normal(0,0.1);
for(a in 2:A)
gamma[a-1]~normal(gamma[a],tau);
tau~normal(0,1);
}
</code></pre></div></div>
<p><img src="/blog/images/2020/recession.png" alt="stan_fit" title="stan fit" /></p>
<p>Hmmm, I don’t want to overstate the result, as it is quite a weak effect anyway, plus this is inference based on 45 data points. But basically from the graph, there is a positive relation between how likely people born in certain year will be president and gdp growth rate in their early life. If I am allowed to exaggerate the finding, I will point out there are two local maximums at age 16 and 23 in the posterior mean of $\gamma$, around the middle school and college graduation age, which is probably equivalent to my age now given the inflation—not good for me, folks. For older cohorts (>30), the effect to accumulate is smaller.</p>
<p>The right hand side shows the accumulated cohort effect. In particular people born around 1900 (late in 1929) are less fortunate, and this is in line with a gap of observed POTUS birth year around 1900 in the previous histogram.</p>
<p>In the end I will offer a few counterargument to ease people around my generation who might be worried about the current situation and its long term effect:</p>
<ol>
<li>45 is a super small sample size.</li>
<li>There is nothing causal here analyzed.</li>
<li>COVID is real, and more uncertainty is yet to come. History is never stationary.</li>
<li>Many model assumptions are horrible. Voting for president is not a multinomial sampling, and there are many other factors should be include.</li>
<li>Instead of a global comparison, the local comparison is more important. To have a blessed career, you are competing with Lincoln and Washington, you only need relative advantage than your neighbor generations. Maybe I should fit a spline with a basis function that has one peak and two bottoms.</li>
</ol>
<p>In short, as a generation, facing covid-19 in early career is hardly blessed. But it is like a “slightly increased risk” result from 23andme gene screening: set aside the non-changeable (I mean this is the time we are praying for a <em>do</em> operator) generation effect, there is much more else we can do.</p>Yuling YaoI was talking to BenB about how unlucky we (Ben and me) would be amid a looming economic crisis and financial shortage when we were about to enter the job market.Should I do laundry and get grocery delivery at the same day or separate days?2020-03-29T00:00:00+00:002020-03-29T00:00:00+00:00http://www.yulingyao.com/blog/2020/laundry<p>So I have stayed in my room for more than three weeks, and by stay in the room I mean strictly staying in the room without even touching the knob of the outdoor once – at the risk of sounding creepy.</p>
<p>It seems I only have two exposures to outside risks: I need grocery delivery from whole foods market, and I need do laundry every week. Arguably both events have extremely low risk, but at the risk of sounding even more creepy, if I do want to minimize the risk, should I do laundry and get grocery delivery at the same day or separate days in a week?</p>
<p>There are obvious reasons for both argument:</p>
<p>The infection chance, as a function of virus amount in the aerosol or indeed any other surface, has to be (nearly) strictly convex. This is due to both the fact that (a) the probability measure lives in [0,1], an inverse logit type transformation of transformation carries linear functions to convex ones in the left part (unless I anticipate a more than 0.5 chance of infection in one laundry !) and (b) biologically the virus amount has to be meaningful after reaching certain threshold. As a consequence, if the potential virus coming from laundry and grocery delivery are independent (therefore linear additive), separating them into two days strictly decrease the expected chance of actual infection by Jensen inequality:</p>
<script type="math/tex; mode=display">Pr(\mathrm{infection} | \mathrm{virus~amount~} A + \mathrm{virus~amount~} B ) > Pr(\mathrm{infection} | \mathrm{virus~amount~} A) + Pr(\mathrm{infection} |\mathrm{virus~amount~} B )</script>
<p>On the other hand, a concentred exposure can be sensible too. For one thing, I can save certain PPE (masks and gloves) by conducting events in the same time. Further, there are some concavities here: for example, the amount of elevator usage is concave, as I can arrange to carry the delivery package right after I finish the laundry in the basement. In general the marginal risk elevation as a function of exposure times is likely decreasing. This is like airplane safety, is the the fixed cost (airplane crashing rate during taking-off/landing; elevator use in laundry) predominate the variable cost, another Jensen inequality kicks in and
you do want to choose a direct flight rather than two separated connected ones!
<script type="math/tex">% <![CDATA[
Pr(\mathrm{virus~exposure} | \mathrm{event~}A + \mathrm{event~}B ) < Pr(\mathrm{virus~exposure} | A) + Pr(\mathrm{virus~exposure} | B ) %]]></script></p>
<p>I think in the airplane example, the risk is so low that the first factor is negligible (the Jensen inequality can also be expressed by second order Taylor expansion, but a logit type function will be nearly local in the very left end). In the Covid-19 case, however, both factors are weighted in. So there is probably an optimal soultion with randomized decision each week.</p>Yuling YaoSo I have stayed in my room for more than three weeks, and by stay in the room I mean strictly staying in the room without even touching the knob of the outdoor once – at the risk of sounding creepy.If we only see the pledged delegates, who wins Iowa?2020-02-14T00:00:00+00:002020-02-14T00:00:00+00:00http://www.yulingyao.com/blog/2020/pledged_delegates<p>Let’s recall the eight-school example– the posterior distribution of the school effect is essentially indistinguishable from 0, which by and large says there is little school-specific effect.</p>
<p>In spite of that, a parent seeing this result would still have to pick one school to which deliver their kids the next morning. They could row a dice and pick one at random given they are so sure these schools are just the same and do not bother to even select. But it is also rational to stick to the one with the largest posterior mean– more sophisticatedly model the “picking which school” in a decision theory framework.</p>
<p>The order of posterior mean may not be the same of empirical mean, which suggests that the most intuitive choice of picking the school with the empirically largest school may not always be optimal.</p>
<p>Practically however we have to make decisions anyway even without a reasonable model. I am thinking about this question in the Iowa context. Suppose in year 2800 AD and some archaeologist finds only half page of New York Times from 2020 which writes Mayor Pete won Iowa caucus by a solid one more delegates than Sanders. The archaeologist is trying to uncover the whole story but remaining details are missing from the unearthed paper. With information at hand, if the archaeologist is enforced to make inference, he has to guess Mayor Pete also won the popular vote.</p>
<p>Another example is the coronavirus. Suppose the same archaeologist reads about the daily updated number of new confirmed cases of coronavirus from the back of the page of that half unearthed New York Times, he finds the number spiked to 15,152 on FEB 12 2020, almost 5 times compared with the day before. We actually know it is because China changed its reporting rule and started to include “clinically diagnosed” cases in its figures and that 13,332 of the new cases fall under that classification. But with a much smaller sigma filed at hand, the archaeologist has to guess that day FEB 12 2020 AD was probably the worst single day, no matter what change-point-detection model he is willing to try.</p>
<p>All these examples are the consequence of lack of modeling– how the state delegates are counted, how the confirmed cases are collected, etc. To be fair, that archaeologist would make the correct inference in average– I suppose the probability of wining the popular vote but losing the overall election can be calculated explicitly and should be a small number.</p>
<p>The point is, even in a Bayesian decision theory framework, the decision is often chosen through an optimization procedure – it is suggested not to make binary decision, but if you has to eventually, you do make a binary decision. In many cases the uncertainty of the last step inherited from the last-step point optimization could be understated.</p>Yuling YaoLet’s recall the eight-school example– the posterior distribution of the school effect is essentially indistinguishable from 0, which by and large says there is little school-specific effect.Something I learned from the book “shortest way home”2020-01-06T00:00:00+00:002020-01-06T00:00:00+00:00http://www.yulingyao.com/blog/2020/pete<p>A new book by the interesting candidate, Mayor Pete Buttigieg.</p>
<h1 id="manipulating-millions-of-data">manipulating millions of data</h1>
<p>Buttigieg was in charge of data analysis in McKinsey for grocery prices. According to pete, he has acquired the capability of not only <em>computer programming</em>, which refers to being able to access Microsoft Excel and Microsoft Access, but also the ability to understand the <em>nature of data</em>, which is defined as</p>
<blockquote>
<p>By manipulating millions of data, I(Pete) could weave stories about possible future, and gather insights on which ideas were good or bad. I could simulate millions of shoppers going up and down the aisles of thousands of stores, and in my mind I pictured their habits shifting as a well-placed price cut subtle changed their perception of our clients as a better place to shop.</p>
</blockquote>
<p>First, in terms of analytical linguistics, the verb <em>manipulating</em> seems inappropriate. I think Pete actually means “analyzing”. After all, it is McKinsey in Chicago, not in Ukraine. I shouldn’t be picky but wait, Pete, you graduate first grade from Oxford PPE program!</p>
<p>Second, it is a clear overstatement that any data analysis could even <em>weave stories about possible future</em> in this particular context. I guess he effectively fitted a factor model, but any macro-level prediction seems too noisy to be even useful, and I did not find any appropriate treatment in his analysis if he made a causal assert in the very beginning.
If some consulting company told you that they could grant you <em>insights on which ideas were good or bad</em>, chances that they would have been both overfitting and misunderstood causal inference.</p>
<h1 id="sharpe-ratio-of-his-first-move">Sharpe ratio of his first move</h1>
<p>Buttigieg’s career starts from challenging Richard Murdock for state treasurer. It is a strategically intelligent move: Murdock was a stubbornly-extreme conservative and had recently failled a lawsuit on the bankruptcy of Chrysler.</p>
<p>If the political market is efficient, such arbitrage opportunity should have been picked up and filled through. But it is not: there was no Democratic candidate against Murdock in his 2010 reelection. Humm, the Sharpe ratio of such a campaign is arguably extremely high.</p>
<h1 id="when-to-declare">When to declare</h1>
<p>Buttigieg claims that:</p>
<blockquote>
<p>In American political culture, you are not supposed to admit you have any interest in running for office until the movement you declare. … (while in the UK) I would often meet students who made it clear they would stand for Parliament at the earliest opportunity.</p>
</blockquote>
<p>and</p>
<blockquote>
<p>A politician’s account of how he first came to run for office is supposed to begin with a ritual mention of having been urged to do so by others.</p>
</blockquote>
<p>These does not sound true to me. Maybe Washington was once urged by Jefferson to become the president, but I have very little reason to make sense of such romantic arcadianism in this modern individualism America. We arguably knew either Joe or John Kennedy would one day be the president in day zero, much earlier than we knew anything about whether Duke of Windsor would resign.</p>
<p>I am wondering, back to the JFK time, whether the folks would have the same concern that the life trajectory of that charming young president seemed too well-calculated and calibrated, to the extent that was not even pronounced sincerely. If running MCMC and hit the target region with only 20 steps and no rejection sounds cheating, it is cheating.</p>
<p>Is this purely envy against early career achievement? It is the persecutory delusion that we develop in the subconscious that an abnormally straight trajectory is associated with all conspiracy theories?</p>
<p>Even so, unfortunately, this could one day be a burden for Major Pete.</p>Yuling YaoA new book by the interesting candidate, Mayor Pete Buttigieg.A utopian kitsch coated with xenophobia and chauvinism2019-12-18T00:00:00+00:002019-12-18T00:00:00+00:00http://www.yulingyao.com/blog/2019/union<p>I received an <a href="https://twitter.com/GWCUAW/status/1207120122160328705">email</a> from the Columbia Student Union regarding some specious statement, so I wrote a quick reply as follows.</p>
<p>(I made some self-censorship and replace some certain words by **. I will not tell the reason why I did so.)</p>
<blockquote>
<p>To whom it may concern,</p>
<p>Re the letter you sent earlier.</p>
<p>With all due respect, many of the student workers including me, might not completely agree with you on this letter. We are worried that a potentially biased and overconfident conclusion on behalf of the union would rather undermine the solidarity of our community. From a strategic perspective, it is not clear to me why it is beneficial to either the union or all university workers it claims to aid, via diversifying the propaganda focus towards international affairs, which does not seem to be the particular expertise of UAW in the first place.</p>
<p>More ideologically I do notice there is an irreconcilable contradiction for the union regarding international relationships. As an analogy, I am occasionally equally shocked by some apparently progressive figure like Senator Warren condemning US manufactories for investing in foreigner countries so as to have compromised the woking family in Detroit. From my standpoint such tone is not less xenophobia than a physical wall. International workers must have been confused: it is so difficult to figure out when they belong to innocent victims of the immigration policy and a reminder of social injustice, and when they are suddenly guilty for threatening the vulnerable labor movement by volunteering to become the hostages of Apple or Uniqlo. That said, xenophobism is xenophobism, no matter coated with a caucasian supremacism, or a seemingly-progressive popularism, or even when it is companied with a chauvinistic salute towards another group of straw men in the Southeast Asia.</p>
<p>I could understand why the union chooses to condemn ** affairs, among others, at this particular moment. If it is the bill that essentially both her honorable AOC and Mr. Cruz can agree upon, then there is little reason, why the union should not exploit the topic to appeal to folks on a bipartisan background. However – if the truth might matter – narratives in ** are complicated by multiverse confusions, at least from my understanding. What we have heard in various channels is a total mess that is constituted of certain appeal to democracy, and as well as, if not much more, fundamentalist separatism, street violence and organized crimes, conservative pro-colonialism, and rumors of anti-**-hate-speech from local residents, let alone the socioeconomic development pattern in east Asia. Without a thorough fact check, it is irresponsible to post an official statement by taking any of these arguments for granted. It is therefore, hardly convincing that an emotional and ideological conclusion could be made appropriately in a few lines in your email list. Even if the letter was originated from a most harmless utopian socialist enthusiasm, it might have, and indeed has, resulted in unnecessary conflicting tensions due to the tone of an ignorant kitsch.</p>
<p>All that being said, I am less worried on the straw man argument on this particular issue; I am more concerned about— by viewing the issue in a broader context—the overwhelming approval in the senate and the house and indeed across all mainstream society including a not-even-relevant UAW. In the brighter side, it does manifest a united solidarity, to the level that even the founder of this country had never achieved. On the other hand, a populism loathing and pretentious prejudice are also looming in the conner. Even during the very era of McCarthyism, there were always dissenting opinions both in the establishment Capitol and around the media, which fortunately or unfortunately, have totally disappeared or been suppressed in this AOC w/ Cruz w/ UAW case. Is that the curse of the simmering US-Sino rivality that we have to face? Is that the resurgence of Sen.McCarthy, in a Dem version? Is it the prophecy of the new iron curtain that separates not only ideologies and beliefs but also what is supposed to be the truth?</p>
<p>Although I appreciate the public discussion it provokes, I still feel bad for receiving this email. I would wish the Union could contribute to the community in a more constructive manner.</p>
<p>Sincerely Yours,</p>
</blockquote>Yuling YaoI received an email from the Columbia Student Union regarding some specious statement, so I wrote a quick reply as follows.The second coming of Cauchy2019-11-29T00:00:00+00:002019-11-29T00:00:00+00:00http://www.yulingyao.com/blog/2019/cauchy<h1 id="a-good-cauchy">A good Cauchy</h1>
<p>I am preparing a paper, where I am using an <a href="https://www.yulingyao.com/blog/2019/stacking/">old example</a> (which I thought it it me that first came up with a few years ago in an email communication with Andrew). I have posted it before. It starts with a Cauchy likelihood</p>
<script type="math/tex; mode=display">y\sim Cauchy(\theta,1)</script>
<p>and assuming half of the true data is generated from Cauchy(-10,1) and the remaining half from Cauchy(10,1), or in other words the true DG possesses</p>
<script type="math/tex; mode=display">\theta \sim 1/2 (\delta(-10)+ \delta(10)),</script>
<p>the Bayesian posterior will be bimodal, while only one of the mode will predominates. But that is fine: If the DG is bimodal, how can the posterior not reflect the bimodality?</p>
<p>And we have some other way to adjust it so as to “recover the true DG from the wrong model and wrong inference”, which however is irrelevant to today’s blog.</p>
<h1 id="a-bad-cauchy">A bad Cauchy</h1>
<p>What I found today is that an essentially similar toy example was introduced by Persi Diaconis and David Freedman in their famous <a href="https://projecteuclid.org/euclid.aos/1176349830">1986 paper</a> as a counterexample. It states if we have a location parameter $\theta$ and we observe</p>
<script type="math/tex; mode=display">X= \theta+ \epsilon.</script>
<p>With a prior $\theta \sim N(0,1)$ and the error distribution $\epsilon\sim DP(MC)$ where DP is a dirichlet process, M is scaler constant, and C is the base measure: a standard cauchy.</p>
<p>In this example, if the true DG of $\epsilon$ is two points:
<script type="math/tex">\epsilon = 1/2 (\delta(-a)+ \delta(a)).</script></p>
<p>Then the Bayesian posterior of $\theta$ is asymptotically only supported by $\theta_0\pm \sqrt(a^2-1)$– different from the true value $\theta_0$.</p>
<p>Alternatively, if $\epsilon$ follows a normal likelihood in the base measure in the dirichlet process, $\theta$ does converges to the true value $\theta_0$.</p>
<h1 id="example--or-counterexample">example or counterexample?</h1>
<p>The funny thing is that we are using this example in the opposite way: Diaconis and Freedman call the Cauchy one a counterexample for the inference fails to converge to the true value. When I construct this model, the normal behavior is rather a pitfall: effectively we are using a one-component normal to approximate a two-component normal mixture, how can it be anything more wrong than the posterior density concentrated at the middle point?</p>
<p>This is the inevitable <a href="http://www.stat.columbia.edu/~gelman/research/published/copss.pdf">pluralist’s dilemma</a>.</p>Yuling YaoA good Cauchy