Yuling Yao’s Blog

Bayesian prediction and the likelihood principle

2024-05-19T00:00:00+00:00

Bertrand and I have recently finished our review paper on Bayesian prediction. One of my favorite part there is the distinction between inferential and predictive Bayes.

A Bayesian procedure is such that we treat observed data as a realization of random variables, and we calculate the conditional distribution of unknown quantities given the observed data. In this sense, there is no difference between posterior inference $p(\theta \vert y)$ and posterior prediction $p(y_{n+1} \vert y)$ as both parameters $\theta$ and next unseen data point $y_{n+1}$ can be viewed as random variables.

Bayesian prediction makes statements on future/unseen data: $p(y_{n+1} \vert y_{1:n})$, however the innocent index notation $i$ has already masked one additional task in prediction: to define the replication. In frequentist calculation, the repeated sampling $y_i \vert \theta$ needs to be specified in any inference, while Bayesian \emph{inference} does not repeated sampling $y_i \vert \theta$: we can define a Bayesian inference when there is one single data point. Bayesian prediction, including its formulation, validation, and evaluation of Bayesian prediction, needs to specify the sampling procedure. As a bold statement, Bayesian prediction is beyond the likelihood principle. To explain, consider four examples:

Example 1: Binomial vs Negative binomial likelihood. The likelihood principle states that when there are two data generating experiments both involving $\theta$ and outcomes $y_1$ and $y_2$, if the outcome-experiment pair has the same likelihood $p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$ (viewed as a function of $ \theta$), then these two experiments and two observations will provide the same inference information about $\theta$. As a classic example, when we observe $y=9$ positives from $n=10$ trials, it does not matter if the data is generated from $y ~ \sim \mathrm{Binomial}(\theta, n=10)$ or $n-y ~ \sim \mathrm{NegativeBinomial}(\theta, y=9)$ as they provide the same posterior inference of $\theta \vert y,n$. But these two sampling procedures define two distinct replications: either we can fix $n$ and predict a new $y$ or fix $y$ and predict a new $n$, and will impact the evaluation of the prediction. The marginal likelihood of two procedures are different and cannot be compared directly (as they involve different sampling space); the posterior predictive check will result in different tail probabilities, for example, given $\theta=0.7$, the tail probably of $n-y = 1$ in the Negative Binomial predictive distribution is .15, while the tail probably of $y = 9$ in the Binomial predictive distribution is .03.

Example 2: sequential prediction. Consider a sequence of observed data $y_{1}, \dots, y_n$ distributed from a process $y_1\sim \mbox{normal}(0, \sigma), ~~ y_{i+1} \sim \mbox{normal}( x_i, \sigma),~ i\geq 2$, where $x_i = \frac{1}{i} \sum_{j=1}^{i} y_j$ is the running mean and $\sigma$ is the only unknown parameter to infer. When making predictions for a new dataset given a parameter value $\theta$, there are two ways to define the replication process $(y_{1}^{\mathrm{rep}},y_{2}^{\mathrm{rep}}, \ldots, y_{n}^{\mathrm{rep}})$, (a) treat ${x_1, x_n}$ as fixed design matrix, for each $i$ sample $y_{i+1}^{\mathrm{rep}}$ from $\mbox{normal}(\bar x_i, \sigma)$ as in a usual regression model; (b) treat ${x_1, x_n}$ as functions of outcome $y$, for each $i$ compute $x_i = \frac{1}{i} {j=1}{i} y_j^{\mathrm{rep}}$ and then sample $y_{i+1}^{\mathrm{rep}}$ from $\mbox{normal}(\bar x_i, \sigma)$. These these two sampling processes entails the same likelihood given existing date, and thereby will not impact Bayesian inference of $\sigma \vert y$, but this sampling difference is relevant to Bayesian model check.

Example 3: two-by-two contingency table. It is a classic task to test independence on a two-by-two table where we observe counts in four cells spanned by a binary treatment and a binary outcome indicator. The frequentist test needs to specify which margin is fixed, for example the Fisher’s exact test is only exact when both margins are fixed. The Bayesian inference takes a shortcut as the posterior inference will not change by conditioning on a contrasts that has been satisfied by observed data. That is $p(\theta \vert Y= y^{obs} ) =p(\theta \vert T(Y)= T(y^{obs}))$, where $T(\cdot)$ is any function of the data to be constrained. Clearly the resulting sampling distribution of future replication will generally be different, $p(Y^{\mathrm{rep}} \vert y^{\mathrm {obs}}, T(Y^{\mathrm{rep}})= T(y^{\mathrm{obs}}) ) \neq p(Y^{\mathrm{rep}} \vert y^{\mathrm {obs}} )$. Once again, the sampling aspect is ignorable in Bayesian inference but relevant in Bayesian prediction and model check.

Example 4: data-dependent stopping time. Bayesian inference has a key advantage in handling data-dependent stopping: as long as the parameters of the data are a priori independent of the parameters of the data-dependent stopping rule, and if all data used to make stopping decisions are recorded for data analysis, the stopping rule is ignorable and we can carry a standard Bayesian inference. Suppose we collect data $y_{1}, \dots, y_n$ from a generating process $y_i \sim \mathrm{normal}(\mu, 1), p(\mu)\sim \mathrm{normal} (0,1)$. Whether the observed data is collected from IID sampling or from a sequential data collection $n= \min {n: \frac{1}{n}\sum_{i=1}^n y_i >2}$ is not relevant to Bayesian inference of $\theta \vert y$, but is matter to forming future replications $y^{\mathrm{rep}}$ and the examination of such replications, which is even more needed for data-dependent stopping for its sensitive on model misspecification (rosenbaum and rubin 1984).

In all these examples, and in applied Bayesian modeling at large, the definition of replication is never automated, but is often guided by the realistic prediction task. Even in a simple linear regression with exchangeable observations $D={y_{1}, \ldots, ,y_{n}}$, the sampling procedure requires to specify if the sample size is fixed or random and if the design matrix is fixed. And even after these aspect are decided, the definition of replication or prediction can still differ between (a) predicting the next unseen data $y_{n+1} \vert D$, and (b) replicating the whole dataset and predict $(y^{\mathrm{rep}}_1, \ldots, y^{\mathrm{rep}}_n) \vert D$ , and such sampling assumption does have different implications in practice. To name a few:

The model check, model evaluation and model averaging typically depends on how the replication is defined;
Even in Bayesian computation, a realm that is traditionally only “inferential”, can deepen on the sampling distribution: In BUGS and JAX, you need to define it in the program; In ABC and simulation-based inference, an appropriate choice of sampling distribution can in theory make your computation more efficient.

StanCon 2023

2023-06-22T00:00:00+00:00

Last month I was at WashU campus for the Bayesian-for-nuclear-physics workshop. Today I am in WashU again: This time it is for StanCon 2023. Here is some of the random takeaway I obtain from the conference:

John Kruschke discussed Barg: the Bayesian analysis reporting guidelines. It is a decision-theory-orientated framework of workflow, including guidance for data analysts, reviewers and method developers. For example, instead of looking at the confidence interval only, John proposes to look at the posterior coverage posterior probability of region of practical equivalence, an approach sharing a similar spirit to the traditional power calculation. Personally, I am not the largest fan of placing central importance on nested models in data analysis, but I sympathize with the ultimate fantasy to have a fully automated data analysis process.

Ben Goodrich presents his ongoing work on the numerical approximation of log likelihood. The goal is to evaluate the log density fewer times in sampling. Ben proposed a simple change of variable, instead of sampling theta, we sample x to be the inverse cdf (quantile) of theta in the prior, such that x has uniform [0,1] prior. Now instead of evaluating the log likelihood $p(x)$ directly, Ben would like to approximate the log likelihood function by some numerical approximation. To this end, Ben used Kolmogorov representation theorem. It seems there are still many gaps in practice, such as high dimensional quantile transformation and the smooth implementation of the Kolmogorov representation theorem. I guess an active learning approach that iteratively approximates the log density and sample therein would be interesting. I asked Ben why we need the Kolmogorov representation theorem in light of the seemingly more nueral network and normaling flow, Ben told me that it seems a nicer approach in noise-free functional approximation.

William Gillespie from Metrum presents their R package bbr.bayes. It facilitates Bayesian analysis for pharmacometrics using stan and nonmem, the famous PK analysis software. The overall workflow seems to resemble the usual bayesian workflow, and talk is quite a high level so I am curious what their secrete sauce is in actual PK modeling (for example, they use loo, but I could imagine there are easily non-iid/nested outcomes in PK modeling, which requires non-iid loo).

Edward Roualdes presents BridgeStan, an r and python wrapper that computes log density and its gradient, or the next-generation rstan. I have been following and using some parts of BridgeStan since last summer. In (my shallow) retrospect, the lack of the ability to easily return gradients of an arbitrary log density or even an arbitrary function, and the lack of exposure of stan sampler in high-level functions, had made stan miss the chance to monopolize the field of automatic differentiable programming language, while at the same time paved the way for its counterparts such as jax and tfp. BridgeStan is certainly overdue to the stan community and will be most welcome to methodology developers.

Jeff Soules introduces MCMC-Monitor: browser-based monitoring of stan. The kool part is that it can run a stan model on a server and allow users to monitor the samples on a different machine on the fly.

Arya Pourzanjani presents a clever way to model summary statistics. The observations are some tumor progressive data: the number of patients got 20% tumor size growth from published results. Arya used a clever stan modeling technique to infer the fine-grained tumor growth curve by converting the “20% tumor size growth” into linear-constrained parameter vectors. I asked Arya how is his method in relation to abc, and Arya told us abc would be more general and automated, while such smart hacking requires human engineering but is arguably more efficient. It reminds me of the difficulty of likelihood-free or simulation-based inference: likelihood-free is like the GPL-license, even if the majority part the model is likelihood-exact, as long as there is a tiny part of the observation is likelihood-free, then you need to run likelihood-free inference on all the data and model. How to incorporate likelihood-free and -exact inference jointly? I do not know.

Siddhartha Chib gives a talk on conditional moment models. The talk is based on his 2018 Jasa and 2021 Jrss-b paper. The basic idea is simple: how to do Bayesian linear regression $y=x\beta_1 +z\beta_2+ \epsilon$ in the existence of endogeneity—i.e., $\mathrm{E}(\epsilon x) \neq 0$, in which $z$ is the instrument variables such that $\mathrm{E}(\epsilon z) = 0$. In light the two cultures battle, Sib’s approach is certainly on the reduced-form-for-robustness side. The most reduced form in linear regression is probably to only set the first-moment condition: such that $\mathrm E \epsilon (x, z, 1) = 0$ means linear regression without endogeneity while $\mathrm E \epsilon (x, z, 1) = (b, 0)$ is a relaxed model allowing endogeneity. To make Bayesian inference without distributional assumptions, Sib further uses empirical likelihood to make Bayesian inference $\beta \mid y$—the idea is that given any $\beta$, you obtain its profile liklihood from an optimization procedure: $\max_w \prod w_i, s.t., \sum w_i \epsilon_i (x_i, z_i, 1) = 0$. Multiple this $\max_w \prod w_i \mid \beta$ with the prior of $\beta$, you obtain the posterior desnity evaluted at $\beta$. To test between the reduced model and the encompassing model using marginal liklihood, you obtain a valid endogeneity test. Set aside whether I may practically use it in linear regression, i am impressed by the mathematical beauty of the empirical likelihood method.

Will Landau talks about the file management in statistical modeling. He designed a cool pipeline tool targets in R that range the code and data file into a graph that reflects the dependence of the code. The graph representation enables automatic parallelization, file storage, and version control.

Wade Brorsen discusses optimal experimental design in agriculture. Experimental design is one of my favorite topic. The goal is to determine the optimal nitrogen amount in soil for crop yield. Arya reminds me that the model is similar to how pharmacometricians decides the optimal drug dosage in PKPD modeling.

Cameron Pfiffer introduces the model that describes how likely a reader will subscribed to the news site after paywall. To this end he models the probability that a user reads a news article, and the chance that of a subscription after paywall. It

Collin Cademartori presents “Two Challenges for Bayesian Model Expansion”—i think one of his main claim is that model expansion reduced identifiability, but it appears that his identifiability defincaiton is the mutual information between data and parameter. I do not know. I probably won’t be worried if the the correlation between data and parameter reduces from 0.9 to 0.8, and I probably won’t call that poor identifiability.

Nathaniel Haines discusses the loss rate modeling in insurance industry, which they are using in an investment firm to help investor to trade investment-grade securities. To incorporate various risk models, they have been using hierarchical staking in their pipeline, cool!

Finally, Bob wrapped the conference by pointing out a few ongoing promising directions in stan implementation, including generalized HMC, massive parallelization, and normalization flow VI.

Bayes is guaranteed to overfit, for any model, any prior, and every data point

2023-05-26T00:00:00+00:00

The myth

A few years ago I saw a StackOverflow question (but I cannot find it now): allegedly Andrew Gelman blogged that Bayesian models would not overfit and would only underfit (I also cannot find Andrew’s blog post now but I think what he meant by “Bayesian models only underfits” is that you can never fully model the true data process in the population).

It reminded me that I received several emails asking me why we need cross-validation for Bayesian models: since Bayesian inference is not set to be an empirical risk minimization or an M-estimate, there is no immediate reason that why the empirical training risk/loss needs to be under-estimate the test data risk. Such argument seems popular in the Bayesian world, for example, a quick google search of “Bayes does not overfit” shows this lecture notes from Edinburgh:

Fully Bayesian procedures can’t suffer from “overfitting”, because parameters aren’t fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not maximizing. The predictions can depend heavily on the model and choice of prior however.

The truth

I have a different view. Bayesian model does overfit.

Moreover, Bayes is guaranteed to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).

Moreover, Bayes is guaranteed to overfit on every realization of training data, not just in expectation.

Moreover, Bayes is guaranteed to overfit on every single point of the training data, not just in the summation.

The guarantee

To see this¹, let’s work with a general model setting with exchangeable observations $y_1, \dots, y_n$: the data model is any $p(y \vert \theta)$ and you could have any prior $p(\theta)$. Wherever the model is correct or wrong is not relevant to our discussion here. The posterior predictive distribution of any future unseen data is $p(\cdot \vert y) = \int p(\cdot \vert \theta ) p(\theta \vert y) d\theta$. We will evaluate the prediction by its expected log score. The in-sample log score is

\[\sum_{i=1}^n \log p(y_i \vert y), ~~~ p(y_i \vert y)= \int p(y_i \vert \theta) p(\theta \vert y)d\theta.\]

It is a sum of $n$ terms. Each term, $p(y_i \vert y)$ is a (weighted) arithmetic mean of $p(y_i \vert \theta)$. You could imagine if you have access to $S$ Monte Carlo draws from $p(\theta \vert y)$, then this in sample individual predictive density is the arithmetic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.

To evaluate the out-of-sample prediction, we can leave this $i$-th data out. The leave-one-out log predictive density is $\sum \log p(y_i \vert y_{-i})$, where the $i$-th term predictive density is

\[p(y_i \vert y_{-i})= \int p(y_i \vert \theta) p(\theta \vert y) \frac{p(y_i \vert \theta)^{-1}} {\int p(y_i \vert \theta^\prime p(\theta^{\prime} \vert y) )^{-1} d\theta^\prime} d\theta = \frac{1} {\int p(\theta \vert y) p(y_i \vert \theta)^{-1} d\theta},\]

that is the weighted harmonic mean of $p(y_i \vert \theta)$. Again, when you have $S$ Monte Carlo draws from $p(\theta \vert y)$, then the out-of-sample individual predictive density is precisely the harmonic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.

Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality $p(y_i \vert y_{-i}) < p(y_i \vert y)$, for any point $i$ and any model.

What is the most overfitting inference?

Bayes always overfit, how about other inferences?

From the harmonic mean inequality, you can think about an abstract math problem: the pointwise in-sample-score is the mean of $p(y_i \vert \theta)$, while the leave-one-out version is the inner product of $p(y_i \vert \theta)$ and $\frac{1/p(y_i \vert \theta)}{ \int 1/p(y_i \vert \theta) d\theta}$. The second term is a $\theta$-density that is proportional to $\frac{1}{p(y_i \vert \theta)}$.

The Bayesian update can be quantified by its sequential update: what you do when you see a data point $y_i$. Bayes multiplies the posterior by a factor ${p(y_i \vert \theta)}$, but you also need to normalize it cuz the density needs to have expectation 1. But there are other updates, for example, if you choose to update the posterior by ${p(y_i \vert \theta)}^{\alpha}$, such that your posterior is proportional to $p(y\vert \theta)^\alpha p(\theta)$, then the out-of-sample predictive density will be the inner product of $p(y_i \vert \theta)$ and a number proportional to $\frac{1}{p(y_i \vert \theta)^\alpha}$.

From Jensen’s inequality, it is easy to see that the generalization gap is strictly monotonic on $\alpha$. When $\alpha=0$, the inference always returns the prior and you have no overfitting. When $\alpha=\infty$, the posterior always returns the MLE, and you get the maximum amount of overfitting.

The math we need is elementary: suppose you have a sequence of (fixed) positive numbers: $a_1, \dots, a_J.$ There is another sequence of variable positive numbers: $b_1, \dots, b_J,$ and $\sum_{j=1}^J b_j=1$. We compute their inner product $\sum_{j=1}^J a_j b_j$. When $b_j=1/J$, you get the arithmetic mean, corresponding to the in-sample predictive density; when $b_j \propto 1/a_j$, you get the harmonic mean, corresponding to the out-of-sample Bayes predictive density; When $b_j = \mathbf{1} (a_j= \min(a_1, \dots, a_j))$, you get the $\min(a_j)$, which is the minimum you can obtain from this summation, corresponding to the out-of-sample MLE.

The bottom line:

Bayes is guaranteed to overfit any training data.
MLE is guaranteed to overfit more than Bayes.
Both of these two claims hold for any model and any prior. The claims hold point-wise, not just on average.
These strict inequalities come from the convexity of $f(x)=1/x, ~x>0$.

P.S.

This post is discussed a lot on Twitter. I thank all the readers.

To clarify, I think word overfit could mean two things:

Test error is always larger than training error, or/and,
A complicated model gives worse predictions than a simpler model in testing.

In this blog post, I only refer to (1), not (2). Perhaps a more accurate saying should be that Bayesian posterior prediction always has a positive generalization gap for any single point. In participial, I do not discuss whether a Bayesian prediction necessarily have a better or worse test prediction error than MLE.

PPS. When you do have the correct model, the Bayesian prediction is optimal in the sense of prior replication, see our review paper on Bayes prediction. But even then, the in-sample prediction error will be underestimated when you have finite sample size. Here, the underestimation is an issue of the model evaluation, not of the model design.

PPPS. Don’t get me wrong: I do use, like, and advocate Bayesian approaches. But when there is a distinction between treat-Bayes-rule-as-an-always-correct-blackbox and an open-minded view of use-bayes-and-bayesian-decision-theory-as-building-blocks-toward-model-improvement, I would prefer the latter view.

I am not aware of whether this justification has appeared before. If you would like to cite this “proof”, you can cite my recent review paper on Bayesian prediction where I add this simple math as a side note. ↩

How much saving do you need in the end? zero?

2022-12-30T00:00:00+00:00

I came across this book called “Die with zero”. Along with many other yolo ideas, the book prompts the attitude that one must maximize net fulfillment over net worth to the extent of “DIE WITH ZERO”. Indeed, as the book pointed out on the cover

In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.

I don’t know. I would be moderately shocked if there was indeed an academic effort to show this circular-argument-type result. Seems very trivial.

At the risk of a pedantic tone, I think one common flaw is that people ignore uncertainty. I will show that even in this seemingly trivial problem, uncertainty leads to a surprise.

We need some math here. Suppose one has a fixed total income throughout their life, say $C$ dollars. For simplicity, we would assume this person has no investment nor loan; only spending: they spent $0 \leq Y \leq C$ dollars in his lifetime, such that he has $C-Y$ dollars left in the saving account. In addition, the total amount of necessary costs (food, medical, etc) in the late stage is denoted by $X$ dollars. Naturally, we would wish that the saving could cover such necessary costs, and the final balance sheet $C-Y-X$ is not negative. If $C-Y-X=0$ then a die-with-zero situation occurs.

It seems reasonable to assume that the utility comes from the following two parts:

whether the necessity is satisfied, that is $1(C-Y-X \geq 0)$. If $C-Y
the spending of leisure: It is a consumerism world, so it is certainly an axiom that the $Y$ dollars spending directly contributed to a utility increment equal to$Y$.

The combined utility is then $1(C-Y-X \geq 0) + Y$, subject to $0 \leq Y \leq C$.

The uncertainty comes from $X$: we do not know the necessary cost at death. Indeed, if X is known, then clearly

\[\mathrm{argmax}_{0 \leq Y \leq C} 1(C-Y-X \geq 0) + Y = C-X,\]

at which one dies with zero. So yes, we shall all die with zero if we have already calculated our life flawlessly.

In practice, $X$ is a random variable. The decision problem maximizes the expected utility

\[\mathrm{max}_{0 \leq Y \leq C} \mathrm{E}_{X} [ 1(C-Y-X \geq 0) + Y ].\]

You can check your intuition here: with this uncertainty on $X$, shall we expect more savings? That seems what grandma would say, no? But isn’t the utility a linear function and why would uncertainty matter at all?

We assume $X$ is a normal $(0, \sigma)$ random variable. The expected utility function can be written as $\Pr(X \leq C-Y) + Y$. The derivative of this function with respect to $Y$ is $- \frac{1}{\sigma \sqrt {2\pi}} \exp (-\frac{(C-Y)^2}{2\sigma^2}) + 1$.

If $\sigma=0$, then yes, there is no uncertainty, such that dying with zero is optimal.
If $\sigma$ is not too big, $\sigma \leq 1/ \sqrt {2\pi}$, then the optimum $\hat y= C-\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) }$. The expected saving at death, is $\sqrt{ - 2\sigma^2 \log (\sigma \sqrt {2\pi}) } >0$, not zero. This is the price one pays for the uncertainty.
If $\sigma$ is too big, $\sigma > 1/ \sqrt {2\pi}$. Then the derivative of the objective function is always positive hence the optimal $\hat Y=C$. That is, since the future is just too chaotic, one simply adopts a yolo lifestyle and forgot about the necessity at all. The expected saving at death is either $-X$ or 0 depending on the healthcare system.

The bottom line: die-with-zero is an oversimplification. When there is moderate uncertainty of necessary cost, the optimum would prefer extra non-zero saving at death. When the uncertainty is too big, just yolo.

control variate other than the score

2022-12-08T00:00:00+00:00

In Bayesian computation, we use control variate to reduces Monte Carlo (MC) variance. The idea if we want to compute $E_{p} h(x)$ from MC draws $x_{1, \dots, S}$, instead of computing the sample mean of $h(x_i)$ , we seek a mean zero function m(x): $E_{p} m(x)= 0$, such that $h (x)- m(x)$ has lower variance.

As far as I know, most control variate takes the form of score function; Either the gradient of the log density, $\nabla_{x} \log p(x)$, or the stein gradient $\nabla \log p(x) g(x) + \nabla_{x} \cdot (g(x))$.

Are there other zero-mean functions? Well, at least another one: $\nabla_{x} p(x)$ because $E_p \nabla_{x} p(x)=0$ for all $p$. I might be ignorant, but I just notice this identity today.

Edit: Actually, $\nabla_{x} p(x)$ is still generated by the score function. Just take $g(x)=p(x)$ in the formula in the second paragraph, then we get the $\nabla_{x} p(x)$.

What is wrong with this marginalize-out trick

2022-08-23T00:00:00+00:00

Consider a normal-normal model with vector data $y$ and scalar parameter $\mu$ and $\sigma$ written in the following stan code¹:

data {
  int N;
  vector[N] y;
  real tau;
}
parameters {
  real mu;
  real sigma;
}
model {
  mu ~ normal(0, tau);
  y ~ normal(mu, sigma);
}

Tau is a fixed hyper-parameter, say 10. We can make inference on $\mu$ and $\tau$. That is easy.

But now I decide that I want to apply marginalization out trick to get rid of $\mu$. That is easy because it is a normal-normal model, such that the marginalized out model is

data {
  int N;
  vector[N] y;
  real tau;
}
parameters {
  real sigma;
}
model {
  y ~ normal(0, hypot(sigma, tau));
}

The problem is that this two models are not the same. Mathematically, the full joint model reads

\[y\vert \mu , \sigma \sim N(\mu, \sigma^2),~~ \mu\sim N(0, \tau^2),\]

It looks so tempting to marginalize out $\mu$ and write

\[y\vert \sigma \sim N(0, \sigma^2+ \tau^2).\]

But they just cannot be the same: The MAP estimate of model 1 is $\tau^2= Var(y)$ and $\tau^2= \sum_{i=1}^n y_i^2 / n - \tau^2$ for model 2.

The problem is y is a vector. $y_i$ are conditionally independent given $\mu$ and $\sigma$, but not so when only conditioning on $\sigma$. It is true that the marginal-marginal of $y_i$ is

\[y_i\sim N(0, \sigma^2+ \tau^2).\]

However, the joint-marginal is no longer factorizable. Indeed, Cov$(y_i, y_j)= \tau^2$. So the correct marginalized-out model $y \vert \sigma$ should be a MVN with mean 0 and a covariance matrix whose diagonals are $\sigma^2+ \tau^2$ and off-diagonals $\tau^2$.

The bottomline: Marginalization is a great trick to boost computing efficienty. But it is your obligation to validate the conditional independence after the marginalization.

Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. ↩

Alternaitves to two stage modeling

2022-08-22T00:00:00+00:00

Sometimes a model can be decomposed into modules and we may run inference separately. This task comes a lot in cut-feedback, SMC, causal inference (two stage regression), multiple imputation, and PK-PD modeling.

To have an easiest example, consider a Stan model with data y and parameter mu, sigma

y ~ normal (mu, sigma);

For some reason, we have already fitted mu from a different module or from a different dataset. We have obtained $\mu_1, \dots, \mu_S$. The goal is to make inference on $p(\sigma \vert y, \mu_1, \dots, \mu_S)$.

To be clear, till now we have already lost full-Bayeisanity now since we do not fit a joint model. But hey, we are inclusive of non-bayeisan methods.

There are three seemingly reasonable approaches to do for the second stage model:

Multiple imputation. We run the model y ~ normal (mu[i], sigma); separately for each $i$ and collect draws $p(\sigma \vert y, \mu_i)$; we then mix these draws altogether. We run this method in MI, Cut.
Plugin estimate. When we do two stage least-square fit, we simply plugin the first stage point estimate, say the posterior mean. This amount to a new model $y \sim normal (\bar {\mu}, \sigma)$, where $\bar {\mu}= 1/S \sum_{i=1}^S \mu_i$.
Mixed log likelihood. At least seemingly doable, we may also mix the log density from these draws, which in Stan reads

for (i in 1:S)
 	 target += 1/S * normal_lpdf (y | mu[i], sigma);

In this model example, the mixed-log-likelihood-approach is identical to the plugin estimate, although generally all these three methods will differ. Using the conditional variance formula, we can see that the multiple imputation delivers that largest estimate of $\sigma$.

OK, I know that in most cases approach 1 is the only acceptable answer. The justification is straight from the Bayes rule:

\[p(\sigma \vert y) = \int p(\sigma \vert y, \mu) p(\mu \vert y ) d\mu.\]

My controversial objective is that the Bayes rule is only relevant is we are running a joint model and infer $\mu$ and $\sigma$ together. That is SMC. But in a situation like Cut, we are placing doubt on the model in the first place, and still keep the obsession over this bayes rule seems a little bit stubborn to me.

Approach 1 and approach 3 differ in how they mix the conditional sampling model $p(y \vert \sigma, \mu)$. Approach 1 is using a mixture (coherent with the joint model)

\[p(y \vert \sigma) := \int (p(y \vert \sigma, \mu) p(\mu \vert \sigma) ) d\mu,\]

while approach 3 is using log-linear-pooling (this line does not correspond to any joint model):

\[\log p(y \vert \sigma) := \int \log p(y \vert \sigma, \mu) p(\mu \vert y) d\mu + Constant.\]

I wonder if this approach 3 has any actual application. I do not know.

Score matching, Bayesian predictions, tempering, and invariance

2022-08-20T00:00:00+00:00

Score matching

Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.

But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.

Since the seminal work by Hyvarinen (2005), score matching has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant disappears by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the score function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,

\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]

hence the name score matching.

In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is

\[H(y_{1:n}, q_{\theta}) =\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]

In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.

Unnormalized models in Bayesian statistics

There are three levels of unnormalized models in Bayesian statistics.

Level 1: A harmless normalization constant comes from the Bayes rule.

In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,

\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]

Level 2: Intractable posterior predictive distribution.

Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that

\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~ z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]

Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.

Level 3: Intractable likelihood.

If the likelihood is also intractable, meaning we are only able to evaluate $q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant $z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called doubly intractable. For example, in alpha-liklihood, the likelihood function is

\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~ z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]

Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:

\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]

The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:

\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]

Matching for doubly intractable Bayesian predictions, or a mixture of unnormalized densities?

“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,

\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]

But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?

Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that

\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]

The answer is negative for any $S\geq 2$.

A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.

The bottomline:

Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.

Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.

How to generate unbiased estimate of 1/E[x] using one random draw?

2022-04-19T00:00:00+00:00

Quiz: you are given ONE random draw $x$ that was drawn from a density $p(x)$. Could you produce an unbiased estimate of $1/E_p[X]$?

You might want to think about this quiz before reading my solution.

Apart from mathematical fun, this type of problem comes out in stochastic approximation, in which we needs an unbiased estimate using a very small number of Monte Carlo draws. The unbiasedness here means that this sampling step will be repeated many times, but each time you are only shown one sample point $x$, and we wish the estimate to be unbiased under repeated sampling.

Here, the obvious wrong answer is to use $1/x$. You can try

n=10000
x=rbeta(n,2,2)
mean(1/x)

It is clear that E $[1/x]= 3 $ while our desired quantity 1/E $[x]= 2 $. Indeed it is also clear that E $[1/x] >$ 1/E $[x]$ for positive $x$.

How about some Taylor series expansion? something like $1/E[p(X)] = 1- (E(x)-1) + O(E(x)-1)^2$? It is legitimate but then you get some crude approximation $2-x$, provided that I believe that $E(x) \approx 1$, which we typically do not know in the first place.

I find one solution from rejection sampling. The idea is that self-normalized importance sampling is only unbiased asymptotically, while rejection sampling is always unbiased even if you have MC size 1.

Here is the method. To make it work, I need to know the upper bound of $x$, it has to be a bounded variable. Say the upper bound is $c$. Each time I saw a realization $x$, then independent I generate a random number from uniform (0,1). If $u$ is smaller than $x/c$, accept, and report $1/x$. If $u$ is larger than $x/c$, do not report any estimate.

Then whenever I report the accepted $1/x$, it is an unbiased estimate of 1/E$[x] $.

Here is a demo code for $p(x)=$ Beta$(2,2)$:

n=1000000
x=rbeta(n,2,2)
u=runif(n,0,1)
Unbiased_estiamte=rep(NA, n)
Unbiased_estiamte[u



Statistics intuitions for intergals
2022-03-29T00:00:00+00:00
I have not done any  math for a long while.  Today I happen to need to compute an integral

\[S(k, \sigma) =\int_{0}^\infty x\log (x)/\sigma (1+kx/\sigma)^{(-1/k-1)} dx.\]

It is the expectation of $x\log x$ under generalized Pareto distribution. Surely it will be finite as long as $k <1$.

I tried for a while then I was very sure I cannot solve it. So I opened some symbolic integral tool and  the result turned out easy

\[S(k, \sigma)= \frac{\sigma \left( 1-\mathrm{HarmonicNumber}[-2+\frac{1}{k}] - \log(\frac{k}{\sigma}) \right)   }{1-k}.\]

Except that I do not understand what HarmonicNumber is. I think it is some special function, so I looked it up. Wikipedia told me that

  In mathematics, the n-th harmonic number is the sum of the reciprocals of the first n natural numbers:


\[H_{n}=1+{\frac {1}{2}}+{\frac {1}{3}}+\cdots +{\frac {1}{n}}=\sum _{k=1}^{n}{\frac {1}{k}}.\]

Except it is not helpful to me cuz apparently i have non-integer $n= -2+\frac{1}{k}$ here. I studied  complex analysis in college but I have never used it ever since. But that is ok, I trust my symbolic integral tool.

Indeed I only want to evaluate this integral near 1. Because the mean and variance of generalized Pareto distribution is of the order $O((1-k)^{-1})$ and $O((1-k)^{-2}(1-2k))$ respectively, my best conjecture is that this $S$ should be $O((1-k)^{-m}), 1\leq m \leq 2$ as k is close to 1.

So I searched one more minute I found that $H_{x}= \frac{\Gamma^\prime(x+1)}{\Gamma(x+1)}+\gamma$, in which $\gamma$ is the Euler constant and $\Gamma$ is the Gamma function.

The appearance of Euler constant and  Gamma function in applied statistics is like  a six pleat shirring on a shirt: fancy to the wearer but seldom useful to the audience.

It appeared that I needed the derivative of the Gamma function near 0. But I found that $\frac{\Gamma^\prime(x)}{\Gamma(x)}$ is itself called digamma function $\psi(x)$. OK, I am not proud for being ignorant here, but it is still fun to learn. So I looked up Wikipedia again and I found  $\psi(x)\approx \log x - 1/2x$. I plugged this approximation into my expression and is it not the same as my conjecture.  Ohh, of course, the $\psi(x)\approx \log x - 1/2x$ approximation is only applicable if  x is large. For small $x\approx 0$, I found that $\psi(x) \approx -1/x - \gamma$.  I plugged this in and then $\gamma$ cancelled out. So the final answer is that $S(k, \sigma) = \sigma k / (1-k)^2 $ + small order terms as k goes to 1. Done.

But this is not why I wrote this post. The point is that sometimes statistics intuitions can help to do tedious math. To be clear this math problem is only tedious to me cuz I am ignorant on digamma function or gamma function. I am sure the previous problem is trivial to Euler. That said, I have already used stats intuition once that I know the order must be between -1 and -2 because $x\log x$ is bounded between the first and second moments.

Indeed, a more statistically intuitive solution here is that I can simply replace the   generalized Pareto distribution by a Pareto distribution. This time I can do it by hand:

\[S(k,1)\approx
\int_{1}^{\infty} r \log r \frac{1}{k} r^{-1/k-1} dr= k(1-k)^{-2}.\]

This expression is different but has the same order I obtained using the digamma function when k is close 1, which can be used for many crude approximaitons.  Again, it would be nice if i have known more about gamma fucntion, but solving a tedious math by some simple statitics approximation is equally fun.