Monte Carlo estimate of quantile

Posted by Yuling Yao on Nov 24, 2020.       Tag: computation  

This comes a lot in Monte Carlo computation: we are only given finite draws but we want to compute extreme quantiles.

Our go-to estimate is the sample quantile. Given posterior draws $\theta_1, \dots,\theta_S$, the $[S\alpha]$-th sample (ordered increasingly) $\theta_{[S\alpha]}$ is an asymptotically unbiased estimate of the $\alpha$ quantile of the distribution, which we shall call $\theta^*_\alpha$. Using some basic theory from order statistics, the variance of this term is (asymptotically)

\(\frac{\alpha (1-\alpha)}{f^2(\theta^*_\alpha)}+ o(S^{-1}).\) This can be derived from the CLT of empirical CDF. Inside this expression, the nominator is bounded from above, while the denominator can go very ugly if the tail probability $f(\theta^*_\alpha)$ is small. This approximation is more complicated is $\theta_i$ is drawn from a Markov chain, but this is the basic idea why tail quantile estimate is hard.

Can we improve sample quantiles?

The quantile estimate is as good as CDF estimate

To link the estimation error in CDF and in quantile, a useful approximation is Bahadur representation:

\[\theta_{[S\alpha]}= \theta^*_\alpha + \frac{\alpha- \hat F_S(\theta^*_\alpha)}{f(\theta^*_\alpha)} + O(S^{-3/4}\log\log S)\]

The remainder term has quicker convergence rate than the usual root S. Hence, to have a good quantile estimate is almost the same as to have a good CDF estimate. The empirical CDF is a Monte Carlo sum $\hat F(u)= 1/S \sum_{i=1}^S c(\theta_i)$, for $c(\theta_i)=I(\theta_{(i)}>u)$, an indicator function. The usual variance reduction technique applies here and we may use add a mean-zero control variate in the Monte Carlo sum. For example we can use parametric model to learn $\theta_{(i)}$, and subtract the deterministic part from the summand—in particular, if $\theta$ is discrete, and can first sum over the categories with high probability and only run Monte Carlo on the remaining categories. This step can also be viewed as Rao-Balckwellization.

control variate

Typically $\theta$ is continuous, yet we may still choose to use numerical methods to compute the bulk probability $\Pr(a<\theta<b)$, and run Monte Carlo on the tail intervals:

\[\hat F(u)= \begin{cases} 1/S\sum_{i=1}^S I(\theta_{(i)}>u), ~if ~ u<a; \\ \hat F (a)+ \Pr(a<\theta<u) , ~if ~ a<u<b;\\ \hat F (a)+ \Pr(a<\theta<b)+ (1-1/S\sum_{i=1}^S I(a< \theta_{(i)}< u)), ~if ~ u>b \\ \end{cases}\]

something like this. The point if we can reduce some variance by computing the bulk $\Pr(a<\theta<u)$ deterministically.

regularize the tail

We can even do better in the tail. For example, we can estimate the polynomial tail and use the its expected value instead.

A parametric model, however, might not necessarily extrapolate well in the extreme tail. Even in the lucky case where we can evaluate the pointwise marginal density exactly $p(\theta_i)$, and we could fit a parametric model $p(\theta_i)\approx f(\theta_i)$. No matter how flexible $f$ is in fitting the observed samples, it relies on certain tail shape assumptions. A cubic spline assumes a cubic tail. A gaussian process regression on $(x=\theta_i, y=p(\theta_i)$ does not model the tail directly, but the mean function of $x^\star$ is $k(x, x^\star) C$ where $C$ is the matrix not relevant to $x^\star$. Hence the whole expression models the gaussian tail ($\log pdf(x)= O(x^2)$) if $k(,)$ is squared-exponential tail—it is not trivial, why should the R.V. being Gaussian if I use GP to fit its pdf?

L estimate

We can use all order statistics instead of only one. That is, $1/S \sum_{i=1}^S \theta_{(i)}w_i$ where $w_i$ is optimized such that the variance of this summand is minimized, while the constraint $1/S \sum_{i=1}^S w_i \hat F^{-1}(\theta_{(i)})= \alpha$ would ensure that the final estimate is asymptotically unbiased.

This estimate can be viewed as a special case of the previous section with multiple control variate (all order statistics).

It seems we have many cute techniques and we seldom apply any of them in practice. A natural question is that what is the most general approach? Maybe the starting point is to think about the UMVUE of the population quantile given both samples and the marginal density.