A very short introduction on the large deviation principle

Posted by Yuling Yao on May 06, 2020.       Tag: tutorial  

I took this seminar class on Large Deviation Principle (LDP) by Sumit. I summarize some following results that I personally think most relevant (to what I am doing now). Most results are from the book Large Deviations Techniques and Applications (Dembo and Zeitouni, 2009).

From Law of Large Numbers To The Large Deviation Principle

Given a probability measures ${\mu_{\epsilon}}$ on a space $(\mathcal{X}, \mathcal{B})$, instead of a limiting measure ( for example $\mu_{\epsilon}\ (\Gamma) \to 0$), we may also be interested in how quick such convergence happen. The Large deviation principle describes the limiting rate of such sequence, where the rate is characterized by a lower-semicontinuous mapping I from $\mathcal{X}$ to $[0, \infty]$, which we call a rate function.

Definition: $\mu_{\epsilon}$ satisfies the large deviation principle with a rate function I, if for all set $\Gamma\in \mathcal{B}$,

Consider a concrete example, if $S_n$ is the sample average of iid standard Gaussian random variables $X_1, \dots, X_n$, we known $S_n / \sqrt{n} = N(0, 1)$. Indeed as long as CLT holds, we know $P(\vert S_n\vert \geq \delta) \to 1- P(\vert N(0,1)\vert>\delta\sqrt n )$ which is 0 for any $\delta>0$. However, for this toy case, we can write replace the limit by identity and it leads to

In general this precise rate is way beyond what a CLT can describe. A motivating example I have in mind is importance sampling: We draw $x_i$ from a proposal distribution $q$, and we can estimate $E_p h(x)$ by $S_n=1/n \sum_{i=1}^n h(x_i)r(x_i).$ with $r=p/q$ followed by self-normalization. We do know $S_n \to E_p h(x)$, but how fast is it? How can we describe characterize some large estimation error happens: $P(\vert S_n- E_p h(x) \vert \geq \delta)$? Indeed, even if $r$ has finite second moment and CLT holds, such large deviation probability still depends on the distribution of both $r$ and $h$.

Another practical situation that I recently consider is sequential design/active learning. For example in clinical trial we may adaptively sample until a interim decision boundary is reached (say some “p value” is “significant”). Aside from design hypothesis testing, we shall use $P(\vert S_n\vert \geq \delta)$ to compute the expected stopping time.

For the purpose of many proofs, we present a equivalent (equivalent when $\mathcal{B}$ contains the Boreal sigma filed of $\mathcal{X}$) definition:

$\mu_{\epsilon}$ satisfies the large deviation principle with a rate function $I()$, if

  1. For all closed set $F \subset \mathcal{X}$ ,

  2. For all open set $G \subset \mathcal{X}$ ,

Empirical average of IID samples: Cramér’s Theorem

If we draw $X_1, \dots, X_n$ iid from the a $d$-dimensional real valued distribution $\mu$, we compute the empirical average $S_n=1/n \sum_{i=1}^n X_i$, of course we know $S_n\to E[X]$. The question is, how quick.

Cramére’s Theorem states that the law of $S_n$, denoted by $\mu_n$, satisfies LDP with a convex rate function $\Lambda^*(\cdot)$.

To define $\Lambda^*$, we first define the log moment generating function

where $\langle, \rangle$ is the inner product.

The desired rate function $\Lambda^*$ is its Fenchel-Legendre transform (the difference max between log sum exp and sum):

In particular in 1-D, we have

The Cramére’s Theorem can also be extended to weak-dependence such as Markov chains, as well as martingales.

For example, for real valued iid $X_1, \dots, X_n$ and a function $Z= g_n (X_1, \dots, X_n)$ that satisfy $\vert g_n(X_1, \dots, X_{k}, X_n) - g_n(X_1, \dots, X_{k’}, X_n)\vert <1$, then contraction inequality has

for H the KL between two Bernoullis.

Finally we can extend the result beyond $R^d$:

Cramére’s Theorem for abstract empirical measure

We assume $\mu_n$ is the law of $S_n= \frac{1}{n} \sum_{i=1}^n X_i$ on a locally convex, Hausdorff, topological real vector space $\mathcal{X}$ such that there exists a polish space $\Xi \subset \mathcal{X}$ such that $\mu(\Xi)=1$.Then $\mu_n$ has LDP in both $\Xi$ and $\mathcal{X}$ with rate function $\Lambda^*$.

Transformation of LDPs

contraction inequality of a mapping

Let $\mathcal{X}$ and $\mathcal{Y}$ be two Hausdorff topological space and $f: \mathcal{X} \to \mathcal{Y}$ a continuous map. If ${\mu_\epsilon}$ satisfy LDP with rate function $I$, then ${\mu_{\epsilon} f^{-1}}$ satisfies LDP with rate function

LDP from exponential approximation

Assuming two random variables ${Z_\epsilon}$ and ${Z_\epsilon’}$ with joint law $P_{\epsilon}$ have marginal probability measures ${\mu_\epsilon}$ and ${\mu_\epsilon’}$ on a metric space $(\mathcal{Y}, d)$. These two probability measure families are exponential equivalent if

where the set $\Gamma_{\delta}= { (y, \tilde y ): d(y, \tilde y )> \delta }$.

Then the same LDP holds for ${\mu_\epsilon}$ and ${\mu_\epsilon’}$.

In practice we often approximate a distribution by a series of simplified distribution.

Laplace approximation: Varadhan’s Integral

In the normal case of Cramére’s Theorem, $I(x)=x^2 / 2\sigma^2$. Does $I(x)$ more reverent than inverse variance in some generalized Laplace approximation, especially when the variance is not even defined?

First, $\mu_{\epsilon}$ is on R, and we assume LDP: $\epsilon \log \mu_{\epsilon} (X<x_0 )= - I(x_0)$, take derivative on $x_0$ we have

For any $\phi(x)$ We run Taylor expansion at $\bar x = \arg\max \phi(x)- I(x)$ and we have

Hence we compute the integral by

Now, in the general space, Suppose ${\mu_{\epsilon}}$ satisfies LDP with rate $I()$ on space $\mathcal{X}$, and assume $\phi: \mathcal{X} \to R$ is any continuous function. With further either the tail condition

or for some $\gamma>1$ holds

then

Varadhan’s Integral can often be used to approximate the normalization constant.

Varadhan’s Integral generalizes the MGF to any non-linear functions. We consider the invserse problem:

Define $\Gamma_{f}= \lim_{\epsilon \to 0} \log \int_x \exp(f(x)/\epsilon) d\mu_{\epsilon} $

Bryc inverse lemma: Suppose $\mu_{\epsilon}$ are exponentially tight tight and $\Gamma_{f}$ exists for all continuous and bounded $f \in C_b(\mathcal{X})$. Then $\mu_{\epsilon}$ has good rate function (largest difference between sum and log sum exp)

and dually

We may restrict $ C_b(\mathcal{X})$ to only linear functionals if $\mathcal{X}$ is a topological vector space.

Sanov’s Theorem for empirical measures

The LLN of the empirical mean of IID samples that motives the Cramére’s Theorem. Likewise, we know usually the empirical process converges the actual distribution. And Sanov’s Theorem answers how quick it is.

Consider iid random variables $Y_1, \dots, Y_n$ to be $\Sigma$-valued, where $\Sigma$ is a Polish space. $Y_i$ has probability measure $\mu \in M_1(\Sigma)$, where $M_1(\Sigma)$ is the space of all probability measures on $\Sigma$. We may estimate $\mu$ empirically by

$L_n$ is also viewed as elements in $M_1(\Sigma)$.

We equip $M(\sigma)$ with weak topology (consider open set generated by open balls ${\nu: \vert\int \phi d\nu - x \vert < \delta}$ for all bounded continuous $\phi$. ) $M_1(\sigma)$ is a Polish space equipped with levy metric.

By abstract Cramér’s Theorem in Polish space (where we replace $X_i \in R$ by $\delta(Y_i) \in M_1(\Sigma)$), we know $L_n$ has LDP in $M_1(\Sigma)$ with convex rate function

where $\Gamma(\phi)= \log E[\exp(\langle \phi, \delta(Y) \rangle )] = \log \int_\Sigma \exp(\phi) d\mu$

Such rate function is difficult to compute, but Sanov’s Theorem says

Loosing speaking, for a closed set $\Gamma\subset M_1(\Sigma)$

Sanov’s Theorem for Stationary Gaussian Processes

Now the data is a sequence of stationary Gaussian process: ${X_k}$ ($-\infty < k < \infty$). We define the probability space $\Omega = \prod_{j = -\infty}^{\infty} \mathbb R_j$. $\omega = {x_j} \in \Omega$ with $\omega(j) = x_j$ and $P$ is that stationary Gaussian process probability measure on $\Omega$ induced by ${X_k}$. It has mean $\mathbb E [X_k] = 0$ and covariance: $\mathbb E[X_0 X_j] = \rho_j$.

How quick doe the empirical measure converge? Indeed we can still find LDP for it. The main result is from Donsker and Varadhan, 1986.

Bochner’s Theorem says we can decompose the eigenfunction by frequency $Cov(X_0, X_j)=\rho_j = \frac{1}{2\pi} \int_0^{2\pi} e^{i j \theta} f(\theta) d\theta$, where we call $f(\theta)$ the spectral density. It is continuous on $[0, 2\pi]$ with $f(0) = f(2\pi)$.

Let $T$ be a shift operator on $\Omega$, i.e., $T(\omega) (j) = x_{j+1}$. We construct $\omega^{(n)}$ by $\omega$, which is defined to be

This define a map $\pi_n$ from $\Omega$ to $M_{s}$:

$M_{s}$ is the space of all stationary measure on $\Omega$. $Q_n = \pi_n P^{-1}$ is probability measure on $M_{s}$ induced by $\pi_n$: $Q_n (A)= P(\omega: \pi_n(\omega)\in A). $

Then $Q_n$ satisfies LPD with good rate function $H_f( R )$. $H_f( R )$ is effectively the entropy of the stationary process $R$ with respect to the stationary Gaussian process $(X_{k})_{k=\infty}^{\infty}$.

For $R \in M_{s}$ and $A \subset R$, we let $R(A\vert \omega) = R(X_{0} \in A \vert X_{-1}, X_{-2}, \dots)$ be the regular conditional probability distribution of $X_{0}$ given the entire past. Denote by $r(y\vert \omega)$ the corresponding density. This gives the explicti form of the rate:

Sketch the proof:

By Fourier expansion, we write $\sqrt{f(\theta)} = \sum_{n = -\infty}^{\infty} a_n e^{in\theta}.$ Let $(\xi_k)$ be a sequence of independent Gaussian random variables with mean 0 and variance 1. Then, by Parseval’s theorem, $(X_k)_{k=-\infty}^{\infty}$ defined by

is a stationary Gaussian process with mean 0 and covariance $ E[X_{0} X_{j}] = \rho_{j} = \frac{1}{2\pi} \int_{0}^{2\pi} e^{ij \theta} f(\theta) d\theta .$

Let $b_j = a_j \left(1 - \frac{\vert \vert j\vert}{N} \right)$ for $\vert j\vert < N$. For each positive integer $N$, define a new process $(X_k^N)_{k=-\infty}^{\infty}$ by \vert

where $X_k^{N}$ is the Cesaro mean of the partial sums of $X_{k}$, i.e.,

We define $F: \Omega \to \Omega$ by

$F$ maps $(\xi_{k}) $ to $(X_{k})$.

We define $F_N: \Omega \to \Omega$ such that

The mapping $F_{N}$ induces a corresponding map $\tilde F_N: M_s \to M_s$.

Let $\mu$ be the measure on $\Omega$ induced by $(\xi_k)$. Define $Q_n$ on $M_{s}$ such that $Q_n(A) := \mu{\omega: \pi_n \cdot F(\omega) \in A}$. Define $Q_n^N$ on $M_{s}$ such that $Q_n^N(A) := \mu{\omega: \pi_n \cdot F_N(\omega) \in A}$. Define $\tilde Q_n^N$ on $M_{s}$ such that

Recall that $\pi_n(\omega) = \frac{1}{n}(\delta_{\omega^{(n)}} + \delta_{T\omega^{(n)}} + \ldots + \delta_{T^{n-1}\omega^{(n)}})$. Let and

We apply Donsker Theorem and obtain LDP for $\tilde F_{N} \cdot \pi_n$:

We total variation gap of $\vert \vert \tilde F_{N} \cdot \pi_n - \pi_n F_{N} \vert \vert_{TV} $ is $o(1)$, which further bounds the levy metric between them

So they are exponentially equivalent. This leads to the LDP for $Q_n^N$ using some triangle inequality.

Likewise, we claim from $Q_n^N$ is exponentially approximation of $Q_n$ using some triangle inequality and contraction theorem, and therefore LPD of $Q_n$ applies.

   

P.S. Gonzalo Mena reminds me the connection of Sanov’s Theorem and exp-family, which I just learned from these lecture notes by Tsirelson.

“Any large deviation is done in the least unlikely of all the unlikely ways!”

For any measure $\mu$ on $\mathcal{X}$ and a function $u$: $\mathcal{X} \to R$ we can define a tilted measure

We can prove

Further, if $\mathcal{X}={1, \dots, d}$, we endow $\mathcal{X}^n$ with the probability measure $\mu_n$, and count frequency $\eta_n(j) = 1/n \sum_{k=1}^n1_{{x_k=j}}$ for each realization of $X$.

Now conditioning on event $E_n={\int u d\eta_n \geq c} $, the random measure $\eta_n$ converges in probability to the tilted measure $\mu_{tu}$ where t > 0 is such that $\int u d\mu_{tu} = c$.

This is because

Generally, consider the set

for which we know

Therefore

as long as $\mu_{tu}$ is a unique minimizer of $\arg\min_{\nu :\int u d\nu \geq c} KL(\nu, \mu)$, we can conclude where $Q_n (A) = P(L_n \in A) $ is the induced measure on $M_1(\mathcal{X})$.