A very short introduction on the large deviation principle
Posted by Yuling Yao on May 06, 2020.I took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).
From Law of Large Numbers To The Large Deviation Principle
Given a probability measures ${\mu_{\epsilon}}$ on a space $(\mathcal{X}, \mathcal{B})$, instead of a limiting measure ( for example $\mu_{\epsilon}\ (\Gamma) \to 0$), we may also be interested in how quick such convergence happen. The Large deviation principle describes the limiting rate of such sequence, where the rate is characterized by a lowersemicontinuous mapping I from $\mathcal{X}$ to $[0, \infty]$, which we call a rate function.
Definition: $\mu_{\epsilon}$ satisfies the large deviation principle with a rate function I, if for all set $\Gamma\in \mathcal{B}$,
Consider a concrete example, if $S_n$ is the sample average of iid standard Gaussian random variables $X_1, \dots, X_n$, we known $S_n / \sqrt{n} = N(0, 1)$. Indeed as long as CLT holds, we know $P(\vert S_n\vert \geq \delta) \to 1 P(\vert N(0,1)\vert>\delta\sqrt n )$ which is 0 for any $\delta>0$. However, for this toy case, we can write replace the limit by identity and it leads to
In general this precise rate is way beyond what a CLT can describe. A motivating example I have in mind is importance sampling: We draw $x_i$ from a proposal distribution $q$, and we can estimate $E_p h(x)$ by $S_n=1/n \sum_{i=1}^n h(x_i)r(x_i).$ with $r=p/q$ followed by selfnormalization. We do know $S_n \to E_p h(x)$, but how fast is it? How can we describe characterize some large estimation error happens: $P(\vert S_n E_p h(x) \vert \geq \delta)$? Indeed, even if $r$ has finite second moment and CLT holds, such large deviation probability still depends on the distribution of both $r$ and $h$.
Another practical situation that I recently consider is sequential design/active learning. For example in clinical trial we may adaptively sample until a interim decision boundary is reached (say some “p value” is “significant”). Aside from design hypothesis testing, we shall use $P(\vert S_n\vert \geq \delta)$ to compute the expected stopping time.
For the purpose of many proofs, we present a equivalent (equivalent when $\mathcal{B}$ contains the Boreal sigma filed of $\mathcal{X}$) definition:
$\mu_{\epsilon}$ satisfies the large deviation principle with a rate function $I()$, if

For all closed set $F \subset \mathcal{X}$ ,

For all open set $G \subset \mathcal{X}$ ,
Empirical average of IID samples: Cramér’s Theorem
If we draw $X_1, \dots, X_n$ iid from the a $d$dimensional real valued distribution $\mu$, we compute the empirical average $S_n=1/n \sum_{i=1}^n X_i$, of course we know $S_n\to E[X]$. The question is, how quick.
Cramére’s Theorem states that the law of $S_n$, denoted by $\mu_n$, satisfies LDP with a convex rate function $\Lambda^*(\cdot)$.
To define $\Lambda^*$, we first define the log moment generating function
where $\langle, \rangle$ is the inner product.
The desired rate function $\Lambda^*$ is its FenchelLegendre transform (the difference max between log sum exp and sum):
In particular in 1D, we have
The Cramére’s Theorem can also be extended to weakdependence such as Markov chains, as well as martingales.
For example, for real valued iid $X_1, \dots, X_n$ and a function $Z= g_n (X_1, \dots, X_n)$ that satisfy $\vert g_n(X_1, \dots, X_{k}, X_n)  g_n(X_1, \dots, X_{k’}, X_n)\vert <1$, then contraction inequality has
for H the KL between two Bernoullis.
Finally we can extend the result beyond $R^d$:
Cramére’s Theorem for abstract empirical measure
We assume $\mu_n$ is the law of $S_n= \frac{1}{n} \sum_{i=1}^n X_i$ on a locally convex, Hausdorff, topological real vector space $\mathcal{X}$ such that there exists a polish space $\Xi \subset \mathcal{X}$ such that $\mu(\Xi)=1$.Then $\mu_n$ has LDP in both $\Xi$ and $\mathcal{X}$ with rate function $\Lambda^*$.
Transformation of LDPs
contraction inequality of a mapping
Let $\mathcal{X}$ and $\mathcal{Y}$ be two Hausdorff topological space and $f: \mathcal{X} \to \mathcal{Y}$ a continuous map. If ${\mu_\epsilon}$ satisfy LDP with rate function $I$, then ${\mu_{\epsilon} f^{1}}$ satisfies LDP with rate function
LDP from exponential approximation
Assuming two random variables ${Z_\epsilon}$ and ${Z_\epsilon’}$ with joint law $P_{\epsilon}$ have marginal probability measures ${\mu_\epsilon}$ and ${\mu_\epsilon’}$ on a metric space $(\mathcal{Y}, d)$. These two probability measure families are exponential equivalent if
where the set $\Gamma_{\delta}= { (y, \tilde y ): d(y, \tilde y )> \delta }$.
Then the same LDP holds for ${\mu_\epsilon}$ and ${\mu_\epsilon’}$.
In practice we often approximate a distribution by a series of simplified distribution.
Laplace approximation: Varadhan’s Integral
In the normal case of Cramére’s Theorem, $I(x)=x^2 / 2\sigma^2$. Does $I(x)$ more reverent than inverse variance in some generalized Laplace approximation, especially when the variance is not even defined?
First, $\mu_{\epsilon}$ is on R, and we assume LDP: $\epsilon \log \mu_{\epsilon} (X<x_0 )=  I(x_0)$, take derivative on $x_0$ we have
For any $\phi(x)$ We run Taylor expansion at $\bar x = \arg\max \phi(x) I(x)$ and we have
Hence we compute the integral by
Now, in the general space, Suppose ${\mu_{\epsilon}}$ satisfies LDP with rate $I()$ on space $\mathcal{X}$, and assume $\phi: \mathcal{X} \to R$ is any continuous function. With further either the tail condition
or for some $\gamma>1$ holds
then
Varadhan’s Integral can often be used to approximate the normalization constant.
Varadhan’s Integral generalizes the MGF to any nonlinear functions. We consider the invserse problem:
Define $\Gamma_{f}= \lim_{\epsilon \to 0} \log \int_x \exp(f(x)/\epsilon) d\mu_{\epsilon} $
Bryc inverse lemma: Suppose $\mu_{\epsilon}$ are exponentially tight tight and $\Gamma_{f}$ exists for all continuous and bounded $f \in C_b(\mathcal{X})$. Then $\mu_{\epsilon}$ has good rate function (largest difference between sum and log sum exp)
and dually
We may restrict $ C_b(\mathcal{X})$ to only linear functionals if $\mathcal{X}$ is a topological vector space.
Sanov’s Theorem for empirical measures
The LLN of the empirical mean of IID samples that motives the Cramére’s Theorem. Likewise, we know usually the empirical process converges the actual distribution. And Sanov’s Theorem answers how quick it is.
Consider iid random variables $Y_1, \dots, Y_n$ to be $\Sigma$valued, where $\Sigma$ is a Polish space. $Y_i$ has probability measure $\mu \in M_1(\Sigma)$, where $M_1(\Sigma)$ is the space of all probability measures on $\Sigma$. We may estimate $\mu$ empirically by
$L_n$ is also viewed as elements in $M_1(\Sigma)$.
We equip $M(\sigma)$ with weak topology (consider open set generated by open balls ${\nu: \vert\int \phi d\nu  x \vert < \delta}$ for all bounded continuous $\phi$. ) $M_1(\sigma)$ is a Polish space equipped with levy metric.
By abstract Cramér’s Theorem in Polish space (where we replace $X_i \in R$ by $\delta(Y_i) \in M_1(\Sigma)$), we know $L_n$ has LDP in $M_1(\Sigma)$ with convex rate function
where $\Gamma(\phi)= \log E[\exp(\langle \phi, \delta(Y) \rangle )] = \log \int_\Sigma \exp(\phi) d\mu$
Such rate function is difficult to compute, but Sanov’s Theorem says
Loosing speaking, for a closed set $\Gamma\subset M_1(\Sigma)$
Sanov’s Theorem for Stationary Gaussian Processes
Now the data is a sequence of stationary Gaussian process: ${X_k}$ ($\infty < k < \infty$). We define the probability space $\Omega = \prod_{j = \infty}^{\infty} \mathbb R_j$. $\omega = {x_j} \in \Omega$ with $\omega(j) = x_j$ and $P$ is that stationary Gaussian process probability measure on $\Omega$ induced by ${X_k}$. It has mean $\mathbb E [X_k] = 0$ and covariance: $\mathbb E[X_0 X_j] = \rho_j$.
How quick doe the empirical measure converge? Indeed we can still find LDP for it. The main result is from Donsker and Varadhan, 1986.
Bochner’s Theorem says we can decompose the eigenfunction by frequency $Cov(X_0, X_j)=\rho_j = \frac{1}{2\pi} \int_0^{2\pi} e^{i j \theta} f(\theta) d\theta$, where we call $f(\theta)$ the spectral density. It is continuous on $[0, 2\pi]$ with $f(0) = f(2\pi)$.
Let $T$ be a shift operator on $\Omega$, i.e., $T(\omega) (j) = x_{j+1}$. We construct $\omega^{(n)}$ by $\omega$, which is defined to be
This define a map $\pi_n$ from $\Omega$ to $M_{s}$:
$M_{s}$ is the space of all stationary measure on $\Omega$. $Q_n = \pi_n P^{1}$ is probability measure on $M_{s}$ induced by $\pi_n$: $Q_n (A)= P(\omega: \pi_n(\omega)\in A). $
Then $Q_n$ satisfies LPD with good rate function $H_f( R )$. $H_f( R )$ is effectively the entropy of the stationary process $R$ with respect to the stationary Gaussian process $(X_{k})_{k=\infty}^{\infty}$.
For $R \in M_{s}$ and $A \subset R$, we let $R(A\vert \omega) = R(X_{0} \in A \vert X_{1}, X_{2}, \dots)$ be the regular conditional probability distribution of $X_{0}$ given the entire past. Denote by $r(y\vert \omega)$ the corresponding density. This gives the explicti form of the rate:
Sketch the proof:
By Fourier expansion, we write $\sqrt{f(\theta)} = \sum_{n = \infty}^{\infty} a_n e^{in\theta}.$ Let $(\xi_k)$ be a sequence of independent Gaussian random variables with mean 0 and variance 1. Then, by Parseval’s theorem, $(X_k)_{k=\infty}^{\infty}$ defined by
is a stationary Gaussian process with mean 0 and covariance $ E[X_{0} X_{j}] = \rho_{j} = \frac{1}{2\pi} \int_{0}^{2\pi} e^{ij \theta} f(\theta) d\theta .$
Let $b_j = a_j \left(1  \frac{\vert \vert j\vert}{N} \right)$ for $\vert j\vert < N$. For each positive integer $N$, define a new process $(X_k^N)_{k=\infty}^{\infty}$ by \vert
where $X_k^{N}$ is the Cesaro mean of the partial sums of $X_{k}$, i.e.,
We define $F: \Omega \to \Omega$ by
$F$ maps $(\xi_{k}) $ to $(X_{k})$.
We define $F_N: \Omega \to \Omega$ such that
The mapping $F_{N}$ induces a corresponding map $\tilde F_N: M_s \to M_s$.
Let $\mu$ be the measure on $\Omega$ induced by $(\xi_k)$. Define $Q_n$ on $M_{s}$ such that $Q_n(A) := \mu{\omega: \pi_n \cdot F(\omega) \in A}$. Define $Q_n^N$ on $M_{s}$ such that $Q_n^N(A) := \mu{\omega: \pi_n \cdot F_N(\omega) \in A}$. Define $\tilde Q_n^N$ on $M_{s}$ such that
Recall that $\pi_n(\omega) = \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n1}\omega^{(n)}})$. Let and
We apply Donsker Theorem and obtain LDP for $\tilde F_{N} \cdot \pi_n$:
We total variation gap of $\vert \vert \tilde F_{N} \cdot \pi_n  \pi_n F_{N} \vert \vert_{TV} $ is $o(1)$, which further bounds the levy metric between them
So they are exponentially equivalent. This leads to the LDP for $Q_n^N$ using some triangle inequality.
Likewise, we claim from $Q_n^N$ is exponentially approximation of $Q_n$ using some triangle inequality and contraction theorem, and therefore LPD of $Q_n$ applies.
P.S. Gonzalo Mena reminds me the connection of Sanov’s Theorem and expfamily, which I just learned from these lecture notes by Tsirelson.
“Any large deviation is done in the least unlikely of all the unlikely ways!”
For any measure $\mu$ on $\mathcal{X}$ and a function $u$: $\mathcal{X} \to R$ we can define a tilted measure
We can prove
Further, if $\mathcal{X}={1, \dots, d}$, we endow $\mathcal{X}^n$ with the probability measure $\mu_n$, and count frequency $\eta_n(j) = 1/n \sum_{k=1}^n1_{{x_k=j}}$ for each realization of $X$.
Now conditioning on event $E_n={\int u d\eta_n \geq c} $, the random measure $\eta_n$ converges in probability to the tilted measure $\mu_{tu}$ where t > 0 is such that $\int u d\mu_{tu} = c$.
This is because
Generally, consider the set
for which we know
Therefore
as long as $\mu_{tu}$ is a unique minimizer of $\arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$, we can conclude where $Q_n (A) = P(L_n \in A) $ is the induced measure on $M_1(\mathcal{X})$.