During the middle of March, I realized that I had not heard back from any schools I had interviewed with. It was not normal. Interview are like earthquake rescue: If I were not rescued in the first 72 hours, the time would be against my favor. At this time I was writing a review paper on bayesian prediction, and one key point I was making in the paper was that Bayesian prediction needs the sampling distribution while Bayesian inference does not. I told myself: hey, the interview application outcome prediction was a good example. See, I had been waiting to hear back from these schools for like a month. It used to be that the random variable is the binary result: offer or rejection. But here, I had implicitly knew that the outcome had to be negative and the only random variable was the time of announcement. Despite the likelihood principle claimed that the realized outcome would have the same likelihood evaluation under these two situations, I still had a tiny bit of hope.
On Friday I emailed all schools I had interviewed with, and asked for the outcomes. Perhaps not surprisingly, I got negative news—None of the schools would make me an offer—the collapse of the tiny uncertain hope entailed a point mass of sadness. I had a dream immediately that night. In the dream I was the one in the hiring committee. I was criticizing my research in a conference room painted with walnut wood panels. I forgot the exact reasons I listed in that dream, but the conclusion was sound: this candidate had no potential and we’d better vote against. (To be fair, I do not think that is how the committee works, but hey this is my dream.)
Yes, the classical 5 stage grieving is real. I was soon in the anger model on Monday. I went to overleaf, and I changed my affiliation to “not have a job” in all drafts I was still writing. I also added an acknowledge that
“The author thanks New York Department of Labor, for the unemployment benefits he will soon receive to support this paper.”
Yes, I did look up the unemployment benefits on Monday, which was supposedly 500 dollars per week.
I started to talk to colleagues and friends about my sad news, and I kept asking them what I could do: tech or finance or stay at home. Reality was still better than my dream after all. I received many warm encouragements, and these encouragements were warmer than “I encourage you to resubmit.” I forced myself to think about what I would do. The journey of thinking was not pleasant. It mixed with a high dosage of regrets—and the regrets went more and more counterfactual—why I did not practice my job talk better as if it would make a difference, why I did my postdoc at all, and why I chose the academia path given it appeared so brutal. See it is hard to make a decision when you condition on other counterfactuals other than the observed data, but I guess sometimes the observed data are just too depressing to watch, so we either run unrealistic predictions in the fantasy, or repeatably check the model in the regrets.
Supposedly stage 5 should be acceptance. But I have not been there yet as I am writing, five days after the result being told. My impression is that pure acceptance or lack of grief is an ideal approximation. The grief stays there for a long while, presumably with some slow logit tail, or presumably it is eventually washed out by some next grief. I do not know. I lost my job to conduct that research.
]]>John Kruschke discussed Barg: the Bayesian analysis reporting guidelines. It is a decision-theory-orientated framework of workflow, including guidance for data analysts, reviewers and method developers. For example, instead of looking at the confidence interval only, John proposes to look at the posterior coverage posterior probability of region of practical equivalence, an approach sharing a similar spirit to the traditional power calculation. Personally, I am not the largest fan of placing central importance on nested models in data analysis, but I sympathize with the ultimate fantasy to have a fully automated data analysis process.
Ben Goodrich presents his ongoing work on the numerical approximation of log likelihood. The goal is to evaluate the log density fewer times in sampling. Ben proposed a simple change of variable, instead of sampling theta, we sample x to be the inverse cdf (quantile) of theta in the prior, such that x has uniform [0,1] prior. Now instead of evaluating the log likelihood $p(x)$ directly, Ben would like to approximate the log likelihood function by some numerical approximation. To this end, Ben used Kolmogorov representation theorem. It seems there are still many gaps in practice, such as high dimensional quantile transformation and the smooth implementation of the Kolmogorov representation theorem. I guess an active learning approach that iteratively approximates the log density and sample therein would be interesting. I asked Ben why we need the Kolmogorov representation theorem in light of the seemingly more nueral network and normaling flow, Ben told me that it seems a nicer approach in noise-free functional approximation.
William Gillespie from Metrum presents their R package bbr.bayes
. It facilitates Bayesian analysis for pharmacometrics using stan and nonmem, the famous PK analysis software. The overall workflow seems to resemble the usual bayesian workflow, and talk is quite a high level so I am curious what their secrete sauce is in actual PK modeling (for example, they use loo, but I could imagine there are easily non-iid/nested outcomes in PK modeling, which requires non-iid loo).
Edward Roualdes presents BridgeStan
, an r and python wrapper that computes log density and its gradient, or the next-generation rstan.
I have been following and using some parts of BridgeStan since last summer. In (my shallow) retrospect, the lack of the ability to easily return gradients of an arbitrary log density or even an arbitrary function, and the lack of exposure of stan sampler in high-level functions, had made stan miss the chance to monopolize the field of automatic differentiable programming language, while at the same time paved the way for its counterparts such as jax and tfp. BridgeStan
is certainly overdue to the stan community and will be most welcome to methodology developers.
Jeff Soules introduces MCMC-Monitor
: browser-based monitoring of stan. The kool part is that it can run a stan model on a server and allow users to monitor the samples on a different machine on the fly.
Arya Pourzanjani presents a clever way to model summary statistics. The observations are some tumor progressive data: the number of patients got 20% tumor size growth from published results. Arya used a clever stan modeling technique to infer the fine-grained tumor growth curve by converting the “20% tumor size growth” into linear-constrained parameter vectors. I asked Arya how is his method in relation to abc, and Arya told us abc would be more general and automated, while such smart hacking requires human engineering but is arguably more efficient. It reminds me of the difficulty of likelihood-free or simulation-based inference: likelihood-free is like the GPL-license, even if the majority part the model is likelihood-exact, as long as there is a tiny part of the observation is likelihood-free, then you need to run likelihood-free inference on all the data and model. How to incorporate likelihood-free and -exact inference jointly? I do not know.
Siddhartha Chib gives a talk on conditional moment models. The talk is based on his 2018 Jasa and 2021 Jrss-b paper. The basic idea is simple: how to do Bayesian linear regression $y=x\beta_1 +z\beta_2+ \epsilon$ in the existence of endogeneity—i.e., $\mathrm{E}(\epsilon x) \neq 0$, in which $z$ is the instrument variables such that $\mathrm{E}(\epsilon z) = 0$. In light the two cultures battle, Sib’s approach is certainly on the reduced-form-for-robustness side. The most reduced form in linear regression is probably to only set the first-moment condition: such that $\mathrm E \epsilon (x, z, 1) = 0$ means linear regression without endogeneity while $\mathrm E \epsilon (x, z, 1) = (b, 0)$ is a relaxed model allowing endogeneity. To make Bayesian inference without distributional assumptions, Sib further uses empirical likelihood to make Bayesian inference $\beta \mid y$—the idea is that given any $\beta$, you obtain its profile liklihood from an optimization procedure: $\max_w \prod w_i, s.t., \sum w_i \epsilon_i (x_i, z_i, 1) = 0$. Multiple this $\max_w \prod w_i \mid \beta$ with the prior of $\beta$, you obtain the posterior desnity evaluted at $\beta$. To test between the reduced model and the encompassing model using marginal liklihood, you obtain a valid endogeneity test. Set aside whether I may practically use it in linear regression, i am impressed by the mathematical beauty of the empirical likelihood method.
Will Landau talks about the file management in statistical modeling.
He designed a cool pipeline tool targets
in R that range the code and data file into a graph that reflects the dependence of the code. The graph representation enables automatic parallelization, file storage, and version control.
Wade Brorsen discusses optimal experimental design in agriculture. Experimental design is one of my favorite topic. The goal is to determine the optimal nitrogen amount in soil for crop yield. Arya reminds me that the model is similar to how pharmacometricians decides the optimal drug dosage in PKPD modeling.
Cameron Pfiffer introduces the model that describes how likely a reader will subscribed to the news site after paywall. To this end he models the probability that a user reads a news article, and the chance that of a subscription after paywall. It
Collin Cademartori presents “Two Challenges for Bayesian Model Expansion”—i think one of his main claim is that model expansion reduced identifiability, but it appears that his identifiability defincaiton is the mutual information between data and parameter. I do not know. I probably won’t be worried if the the correlation between data and parameter reduces from 0.9 to 0.8, and I probably won’t call that poor identifiability.
Nathaniel Haines discusses the loss rate modeling in insurance industry, which they are using in an investment firm to help investor to trade investment-grade securities. To incorporate various risk models, they have been using hierarchical staking in their pipeline, cool!
Finally, Bob wrapped the conference by pointing out a few ongoing promising directions in stan implementation, including generalized HMC, massive parallelization, and normalization flow VI.
]]>A few years ago I saw a StackOverflow question (but I cannot find it now): allegedly Andrew Gelman blogged that Bayesian models would not overfit and would only underfit (I also cannot find Andrew’s blog post now but I think what he meant by “Bayesian models only underfits” is that you can never fully model the true data process in the population).
It reminded me that I received several emails asking me why we need cross-validation for Bayesian models: since Bayesian inference is not set to be an empirical risk minimization or an M-estimate, there is no immediate reason that why the empirical training risk/loss needs to be under-estimate the test data risk. Such argument seems popular in the Bayesian world, for example, a quick google search of “Bayes does not overfit” shows this lecture notes from Edinburgh:
Fully Bayesian procedures can’t suffer from “overfitting”, because parameters aren’t fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not maximizing. The predictions can depend heavily on the model and choice of prior however.
I have a different view. Bayesian model does overfit.
Moreover, Bayes is guaranteed to overfit, regardless of the model (being correct or wrong) or the prior ( “strong” or uninformative).
Moreover, Bayes is guaranteed to overfit on every realization of training data, not just in expectation.
Moreover, Bayes is guaranteed to overfit on every single point of the training data, not just in the summation.
To see this^{1}, let’s work with a general model setting with exchangeable observations $y_1, \dots, y_n$: the data model is any $p(y \vert \theta)$ and you could have any prior $p(\theta)$. Wherever the model is correct or wrong is not relevant to our discussion here. The posterior predictive distribution of any future unseen data is $p(\cdot \vert y) = \int p(\cdot \vert \theta ) p(\theta \vert y) d\theta$. We will evaluate the prediction by its expected log score. The in-sample log score is
\[\sum_{i=1}^n \log p(y_i \vert y), ~~~ p(y_i \vert y)= \int p(y_i \vert \theta) p(\theta \vert y)d\theta.\]It is a sum of $n$ terms. Each term, $p(y_i \vert y)$ is a (weighted) arithmetic mean of $p(y_i \vert \theta)$. You could imagine if you have access to $S$ Monte Carlo draws from $p(\theta \vert y)$, then this in sample individual predictive density is the arithmetic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.
To evaluate the out-of-sample prediction, we can leave this $i$-th data out. The leave-one-out log predictive density is $\sum \log p(y_i \vert y_{-i})$, where the $i$-th term predictive density is
\[p(y_i \vert y_{-i})= \int p(y_i \vert \theta) p(\theta \vert y) \frac{p(y_i \vert \theta)^{-1}} {\int p(y_i \vert \theta^\prime p(\theta^{\prime} \vert y) )^{-1} d\theta^\prime} d\theta = \frac{1} {\int p(\theta \vert y) p(y_i \vert \theta)^{-1} d\theta},\]that is the weighted harmonic mean of $p(y_i \vert \theta)$. Again, when you have $S$ Monte Carlo draws from $p(\theta \vert y)$, then the out-of-sample individual predictive density is precisely the harmonic mean of $p(y_i \vert \theta_j)$, $j=1, … S$.
Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality $p(y_i \vert y_{-i}) < p(y_i \vert y)$, for any point $i$ and any model.
Bayes always overfit, how about other inferences?
From the harmonic mean inequality, you can think about an abstract math problem: the pointwise in-sample-score is the mean of $p(y_i \vert \theta)$, while the leave-one-out version is the inner product of $p(y_i \vert \theta)$ and $\frac{1/p(y_i \vert \theta)}{ \int 1/p(y_i \vert \theta) d\theta}$. The second term is a $\theta$-density that is proportional to $\frac{1}{p(y_i \vert \theta)}$.
The Bayesian update can be quantified by its sequential update: what you do when you see a data point $y_i$. Bayes multiplies the posterior by a factor ${p(y_i \vert \theta)}$, but you also need to normalize it cuz the density needs to have expectation 1. But there are other updates, for example, if you choose to update the posterior by ${p(y_i \vert \theta)}^{\alpha}$, such that your posterior is proportional to $p(y\vert \theta)^\alpha p(\theta)$, then the out-of-sample predictive density will be the inner product of $p(y_i \vert \theta)$ and a number proportional to $\frac{1}{p(y_i \vert \theta)^\alpha}$.
From Jensen’s inequality, it is easy to see that the generalization gap is strictly monotonic on $\alpha$. When $\alpha=0$, the inference always returns the prior and you have no overfitting. When $\alpha=\infty$, the posterior always returns the MLE, and you get the maximum amount of overfitting.
The math we need is elementary: suppose you have a sequence of (fixed) positive numbers: $a_1, \dots, a_J.$ There is another sequence of variable positive numbers: $b_1, \dots, b_J,$ and $\sum_{j=1}^J b_j=1$. We compute their inner product $\sum_{j=1}^J a_j b_j$. When $b_j=1/J$, you get the arithmetic mean, corresponding to the in-sample predictive density; when $b_j \propto 1/a_j$, you get the harmonic mean, corresponding to the out-of-sample Bayes predictive density; When $b_j = \mathbf{1} (a_j= \min(a_1, \dots, a_j))$, you get the $\min(a_j)$, which is the minimum you can obtain from this summation, corresponding to the out-of-sample MLE.
This post is discussed a lot on Twitter. I thank all the readers.
To clarify, I think word overfit could mean two things:
In this blog post, I only refer to (1), not (2). Perhaps a more accurate saying should be that Bayesian posterior prediction always has a positive generalization gap for any single point. In participial, I do not discuss whether a Bayesian prediction necessarily have a better or worse test prediction error than MLE.
PPS. When you do have the correct model, the Bayesian prediction is optimal in the sense of prior replication, see our review paper on Bayes prediction. But even then, the in-sample prediction error will be underestimated when you have finite sample size. Here, the underestimation is an issue of the model evaluation, not of the model design.
PPPS. Don’t get me wrong: I do use, like, and advocate Bayesian approaches. But when there is a distinction between treat-Bayes-rule-as-an-always-correct-blackbox and an open-minded view of use-bayes-and-bayesian-decision-theory-as-building-blocks-toward-model-improvement, I would prefer the latter view.
I am not aware of whether this justification has appeared before. If you would like to cite this “proof”, you can cite my recent review paper on Bayesian prediction where I add this simple math as a side note. ↩
In 1957, Nobel Prize Winner, Franco Modigliani, developed the Life-cycle Hypothesis showing the most optimal way of utilizing your wealth is to end with zero.
I don’t know. I would be moderately shocked if there was indeed an academic effort to show this circular-argument-type result. Seems very trivial.
At the risk of a pedantic tone, I think one common flaw is that people ignore uncertainty. I will show that even in this seemingly trivial problem, uncertainty leads to a surprise.
We need some math here. Suppose one has a fixed total income throughout their life, say $C$ dollars. For simplicity, we would assume this person has no investment nor loan; only spending: they spent $0 \leq Y \leq C$ dollars in his lifetime, such that he has $C-Y$ dollars left in the saving account. In addition, the total amount of necessary costs (food, medical, etc) in the late stage is denoted by $X$ dollars. Naturally, we would wish that the saving could cover such necessary costs, and the final balance sheet $C-Y-X$ is not negative. If $C-Y-X=0$ then a die-with-zero situation occurs.
It seems reasonable to assume that the utility comes from the following two parts:
The combined utility is then $1(C-Y-X \geq 0) + Y$, subject to $0 \leq Y \leq C$.
The uncertainty comes from $X$: we do not know the necessary cost at death. Indeed, if X is known, then clearly
\[\mathrm{argmax}_{0 \leq Y \leq C} 1(C-Y-X \geq 0) + Y = C-X,\]at which one dies with zero. So yes, we shall all die with zero if we have already calculated our life flawlessly.
In practice, $X$ is a random variable. The decision problem maximizes the expected utility
\[\mathrm{max}_{0 \leq Y \leq C} \mathrm{E}_{X} [ 1(C-Y-X \geq 0) + Y ].\]You can check your intuition here: with this uncertainty on $X$, shall we expect more savings? That seems what grandma would say, no? But isn’t the utility a linear function and why would uncertainty matter at all?
We assume $X$ is a normal $(0, \sigma)$ random variable. The expected utility function can be written as $\Pr(X \leq C-Y) + Y$. The derivative of this function with respect to $Y$ is $- \frac{1}{\sigma \sqrt {2\pi}} \exp (-\frac{(C-Y)^2}{2\sigma^2}) + 1$.
The bottom line: die-with-zero is an oversimplification. When there is moderate uncertainty of necessary cost, the optimum would prefer extra non-zero saving at death. When the uncertainty is too big, just yolo.
]]>As far as I know, most control variate takes the form of score function; Either the gradient of the log density, $\nabla_{x} \log p(x)$, or the stein gradient $\nabla \log p(x) g(x) + \nabla_{x} \cdot (g(x))$.
Are there other zero-mean functions? Well, at least another one: $\nabla_{x} p(x)$ because $E_p \nabla_{x} p(x)=0$ for all $p$. I might be ignorant, but I just notice this identity today.
Edit: Actually, $\nabla_{x} p(x)$ is still generated by the score function. Just take $g(x)=p(x)$ in the formula in the second paragraph, then we get the $\nabla_{x} p(x)$.
]]>data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, tau);
y ~ normal(mu, sigma);
}
Tau is a fixed hyper-parameter, say 10. We can make inference on $\mu$ and $\tau$. That is easy.
But now I decide that I want to apply marginalization out trick to get rid of $\mu$. That is easy because it is a normal-normal model, such that the marginalized out model is
data {
int<lower=0> N;
vector[N] y;
real<lower=0> tau;
}
parameters {
real<lower=0> sigma;
}
model {
y ~ normal(0, hypot(sigma, tau));
}
The problem is that this two models are not the same. Mathematically, the full joint model reads
\[y\vert \mu , \sigma \sim N(\mu, \sigma^2),~~ \mu\sim N(0, \tau^2),\]It looks so tempting to marginalize out $\mu$ and write
\[y\vert \sigma \sim N(0, \sigma^2+ \tau^2).\]But they just cannot be the same: The MAP estimate of model 1 is $\tau^2= Var(y)$ and $\tau^2= \sum_{i=1}^n y_i^2 / n - \tau^2$ for model 2.
The problem is y is a vector. $y_i$ are conditionally independent given $\mu$ and $\sigma$, but not so when only conditioning on $\sigma$. It is true that the marginal-marginal of $y_i$ is
\[y_i\sim N(0, \sigma^2+ \tau^2).\]However, the joint-marginal is no longer factorizable. Indeed, Cov$(y_i, y_j)= \tau^2$. So the correct marginalized-out model $y \vert \sigma$ should be a MVN with mean 0 and a covariance matrix whose diagonals are $\sigma^2+ \tau^2$ and off-diagonals $\tau^2$.
The bottomline: Marginalization is a great trick to boost computing efficienty. But it is your obligation to validate the conditional independence after the marginalization.
Bob Carpenter wrote the code. Bob, Charles and I wasted one hour discussing this toy example. Please do not let our employee knows what we are doing. ↩
To have an easiest example, consider a Stan model with data y
and parameter mu
, sigma
y ~ normal (mu, sigma);
For some reason, we have already fitted mu
from a different module or from a different dataset. We have obtained $\mu_1, \dots, \mu_S$. The goal is to make inference on $p(\sigma \vert y, \mu_1, \dots, \mu_S)$.
To be clear, till now we have already lost full-Bayeisanity now since we do not fit a joint model. But hey, we are inclusive of non-bayeisan methods.
There are three seemingly reasonable approaches to do for the second stage model:
Multiple imputation. We run the model y ~ normal (mu[i], sigma);
separately for each $i$ and collect draws $p(\sigma \vert y, \mu_i)$; we then mix these draws altogether. We run this method in MI, Cut.
Plugin estimate. When we do two stage least-square fit, we simply plugin the first stage point estimate, say the posterior mean. This amount to a new model $y \sim normal (\bar {\mu}, \sigma)$, where $\bar {\mu}= 1/S \sum_{i=1}^S \mu_i$.
Mixed log likelihood. At least seemingly doable, we may also mix the log density from these draws, which in Stan reads
for (i in 1:S)
target += 1/S * normal_lpdf (y | mu[i], sigma);
In this model example, the mixed-log-likelihood-approach is identical to the plugin estimate, although generally all these three methods will differ. Using the conditional variance formula, we can see that the multiple imputation delivers that largest estimate of $\sigma$.
OK, I know that in most cases approach 1 is the only acceptable answer. The justification is straight from the Bayes rule:
\[p(\sigma \vert y) = \int p(\sigma \vert y, \mu) p(\mu \vert y ) d\mu.\]My controversial objective is that the Bayes rule is only relevant is we are running a joint model and infer $\mu$ and $\sigma$ together. That is SMC. But in a situation like Cut, we are placing doubt on the model in the first place, and still keep the obsession over this bayes rule seems a little bit stubborn to me.
Approach 1 and approach 3 differ in how they mix the conditional sampling model $p(y \vert \sigma, \mu)$. Approach 1 is using a mixture (coherent with the joint model)
\[p(y \vert \sigma) := \int (p(y \vert \sigma, \mu) p(\mu \vert \sigma) ) d\mu,\]while approach 3 is using log-linear-pooling (this line does not correspond to any joint model):
\[\log p(y \vert \sigma) := \int \log p(y \vert \sigma, \mu) p(\mu \vert y) d\mu + Constant.\]I wonder if this approach 3 has any actual application. I do not know.
]]>Suppose that we observe a sequence of data $y={y_i \in R_m \mid 1\leq i \leq n}$ coming independently from an unknown distribution $p_{true}$; we would like to evaluate a forecast given by a probabilistic density function $p(y)$. For example, we may use the logarithm score $\sum_{i=1}^n \log p(y_i)$ to assess the forecast.
But what if the predictive pdf is only known up to multiplicative constants? That is, we are only able to evaluate the unnormalized density $q(y) = p(y)/ c$. In a typical task of parameter inference, model selection, and model averaging, we are given a set of unnormalized forecasts indexed by $\theta$: ${q_\theta(\cdot) \mid \theta \in \Theta}$, where each element $q_\theta(\cdot)$ is a non-negative function on $R_m$, whose normalizing constant $c(\theta) = \int_{R_m} q_\theta(y) d y$ is unknown.
Since the seminal work by Hyvarinen (2005), score matching has been a powerful tool for evaluating unnormalized predictions. The main idea is that, the normalizing constant disappears by looking at the gradient of log unnormalized density. The ``gradient of log’’ of the pdf is often known as the score function to statisticians. We measure the difference score functions of the true data generating process $p_{true}$ and of the forecast $q_\theta$,
\[D(p_{true}, q_\theta) = \int_{R_m} \Vert \nabla \log p_{true}(y) - \nabla \log q_{\theta} (y) \Vert^2 p_{true}(y) dy,\]hence the name score matching.
In practice, we do not know $p_{true}$; we only observes its samples $y_{1:n}$. A sample estimate of the divergence above is
\[H(y_{1:n}, q_{\theta}) =\frac{1}{n}\sum_{i=1}^n \left(\nabla_y \log q_{\theta} (y_i) + \frac{1}{2} \Delta_y \log q_{\theta} (y_i) \right).\]In a larger universe of scoring rules, this $H(y, q_{\theta})$ is known as the Hyvarinen score. In the limiting case as sample size $n \to \infty$, this sample estimate converges in the sense that $H(y_{1:n}, q_{\theta}) \to D(p_{true}, q_\theta)$+ Constant, where the constant does not depend on $q_{\theta}$.
There are three levels of unnormalized models in Bayesian statistics.
In classical parameter inference, the posterior density of a parameter is typically given in a unnormalized form: $p(\theta\vert y) \propto p(y\vert \theta) p(\theta)$, where the normalizing constant is the marginal likelihood $\int p(y\vert \theta) p(\theta) d \theta = p(y)$. For the purpose of the Bayesian computation, this normalizing constant is irrelevant in MCMC, variational inference, or importance sampling. Notably, with posterior draws $\theta_1, \dots, \theta_S$, the posterior predictive distribution is tractable and appropriately normalized,
\[p(\tilde y \vert y) = \int p(\tilde y \vert \theta)p(\theta \vert y) d \theta \approx \frac{1}{S} \sum_{i=1}^S p(\tilde y \vert \theta).\]Sometimes we only know the posterior predictive density up to a constant. For example, in modern literature on calibration, we may address the potential overconfidence of a prediction via tempering, such that
\[p(\tilde y\vert y, \lambda)= \frac{1}{z(\lambda)} p(\tilde y \vert y)^\lambda, ~ z(\lambda)= \int p(\tilde y \vert y)^\lambda d \tilde y.\]Intuitively, a smaller $\lambda \in (0,1)$ flatten the prediction, resulting in less confidence. The Hyvarinen score still applies.
If the likelihood is also intractable, meaning we are only able to evaluate $q(y\mid \theta) \propto p(y\mid \theta),$ while the pointwise normalizing constant $z(\theta)= \int q(y\mid \theta) dy$ is unknown. This types of models are often called doubly intractable. For example, in alpha-liklihood, the likelihood function is
\[p(y\vert \theta, \lambda) \propto p(y\vert \theta)^ \lambda, ~ z(\lambda, \theta)= \int p(y\vert \theta)^ \lambda d y.\]Aside from how to sample from a doubly intractable model, even if we do obtain posterior draws $\theta_1, \dots, \theta_S$, this time the posterior predictive distribution is \textbf{a mixture of unnormalized} densities:
\[p(\tilde y\vert y)= \int p(\tilde y\vert \theta \lambda) p( \theta \lambda \vert y) d\theta d\lambda = \sum_{s=1}^S \frac{1}{z(\lambda_s, \theta_s)} p^{\lambda_s}(\tilde y\vert \theta_s).\]The Hyvarinen score does not apply to a mixture/summation of unnormalized densities. It is clear that the score function is not invariant under this procedure:
\[\nabla \log \left(\sum_{i=1}^S c_if_i(y) \right) \neq \nabla \log \left(\sum_{i=1}^S f_i(y)\right).\]“Gradient of log” is a great operator because it throws away normalizing constants. That is, for any positive constant $c$ and any continuous density function $p(y)$,
\[{\frac{d}{dy} \log} ( {c} p(y) ) = {\frac{d}{dy} \log} ( p(y) ).\]But what if we now want to evaluate a sum of unnormalized functions, $\sum_{i=1}^n p_i(y)$?
Does there exist a non-trivial operator, #, a mapping from $R^R$ to $R^R$, such that
\[{\color{red} \#} ( \sum_{i=1}^S {\color{orange}c_i} p_i(y) ) = {\color{red} \#} ( \sum_{i=1}^S p_i(y) ).\]The answer is negative for any $S\geq 2$.
A heuristic proof is that we can write any function into Taylor series expansion. If an operator satisfies the propriety above, it will make any two function invariant, such that any two predictions are evaluated to be the same. That is not useful.
Score matching is a useful tool for evaluating unnormalized models. The Hyvarinen score applies to a tempered mixture, but does not apply to a mixture of tempered densities, or any doubly-intractably Bayesian predictions.
Furthermore, we can mathematically prove that there is not any operator that we can use to match a mixture of unnormalized densities.
]]>You might want to think about this quiz before reading my solution.
Apart from mathematical fun, this type of problem comes out in stochastic approximation, in which we needs an unbiased estimate using a very small number of Monte Carlo draws. The unbiasedness here means that this sampling step will be repeated many times, but each time you are only shown one sample point $x$, and we wish the estimate to be unbiased under repeated sampling.
Here, the obvious wrong answer is to use $1/x$. You can try
n=10000
x=rbeta(n,2,2)
mean(1/x)
It is clear that E $[1/x]= 3 $ while our desired quantity 1/E $[x]= 2 $. Indeed it is also clear that E $[1/x] >$ 1/E $[x]$ for positive $x$.
How about some Taylor series expansion? something like $1/E[p(X)] = 1- (E(x)-1) + O(E(x)-1)^2$? It is legitimate but then you get some crude approximation $2-x$, provided that I believe that $E(x) \approx 1$, which we typically do not know in the first place.
I find one solution from rejection sampling. The idea is that self-normalized importance sampling is only unbiased asymptotically, while rejection sampling is always unbiased even if you have MC size 1.
Here is the method. To make it work, I need to know the upper bound of $x$, it has to be a bounded variable. Say the upper bound is $c$. Each time I saw a realization $x$, then independent I generate a random number from uniform (0,1). If $u$ is smaller than $x/c$, accept, and report $1/x$. If $u$ is larger than $x/c$, do not report any estimate.
Then whenever I report the accepted $1/x$, it is an unbiased estimate of 1/E$[x] $.
Here is a demo code for $p(x)=$ Beta$(2,2)$:
n=1000000
x=rbeta(n,2,2)
u=runif(n,0,1)
Unbiased_estiamte=rep(NA, n)
Unbiased_estiamte[u<x]= (1/x)[u<x]
#check the answer:
mean(Unbiased_estiamte, na.rm = T)- 1/mean(x)
It is the expectation of $x\log x$ under generalized Pareto distribution. Surely it will be finite as long as $k <1$.
I tried for a while then I was very sure I cannot solve it. So I opened some symbolic integral tool and the result turned out easy
\[S(k, \sigma)= \frac{\sigma \left( 1-\mathrm{HarmonicNumber}[-2+\frac{1}{k}] - \log(\frac{k}{\sigma}) \right) }{1-k}.\]Except that I do not understand what HarmonicNumber
is. I think it is some special function, so I looked it up. Wikipedia told me that
\[H_{n}=1+{\frac {1}{2}}+{\frac {1}{3}}+\cdots +{\frac {1}{n}}=\sum _{k=1}^{n}{\frac {1}{k}}.\]In mathematics, the n-th harmonic number is the sum of the reciprocals of the first n natural numbers:
Except it is not helpful to me cuz apparently i have non-integer $n= -2+\frac{1}{k}$ here. I studied complex analysis in college but I have never used it ever since. But that is ok, I trust my symbolic integral tool.
Indeed I only want to evaluate this integral near 1. Because the mean and variance of generalized Pareto distribution is of the order $O((1-k)^{-1})$ and $O((1-k)^{-2}(1-2k))$ respectively, my best conjecture is that this $S$ should be $O((1-k)^{-m}), 1\leq m \leq 2$ as k is close to 1.
So I searched one more minute I found that $H_{x}= \frac{\Gamma^\prime(x+1)}{\Gamma(x+1)}+\gamma$, in which $\gamma$ is the Euler constant and $\Gamma$ is the Gamma function.
The appearance of Euler constant and Gamma function in applied statistics is like a six pleat shirring on a shirt: fancy to the wearer but seldom useful to the audience.
It appeared that I needed the derivative of the Gamma function near 0. But I found that $\frac{\Gamma^\prime(x)}{\Gamma(x)}$ is itself called digamma function $\psi(x)$. OK, I am not proud for being ignorant here, but it is still fun to learn. So I looked up Wikipedia again and I found $\psi(x)\approx \log x - 1/2x$. I plugged this approximation into my expression and is it not the same as my conjecture. Ohh, of course, the $\psi(x)\approx \log x - 1/2x$ approximation is only applicable if x is large. For small $x\approx 0$, I found that $\psi(x) \approx -1/x - \gamma$. I plugged this in and then $\gamma$ cancelled out. So the final answer is that $S(k, \sigma) = \sigma k / (1-k)^2 $ + small order terms as k goes to 1. Done.
But this is not why I wrote this post. The point is that sometimes statistics intuitions can help to do tedious math. To be clear this math problem is only tedious to me cuz I am ignorant on digamma function or gamma function. I am sure the previous problem is trivial to Euler. That said, I have already used stats intuition once that I know the order must be between -1 and -2 because $x\log x$ is bounded between the first and second moments.
Indeed, a more statistically intuitive solution here is that I can simply replace the generalized Pareto distribution by a Pareto distribution. This time I can do it by hand:
\[S(k,1)\approx \int_{1}^{\infty} r \log r \frac{1}{k} r^{-1/k-1} dr= k(1-k)^{-2}.\]This expression is different but has the same order I obtained using the digamma function when k is close 1, which can be used for many crude approximaitons. Again, it would be nice if i have known more about gamma fucntion, but solving a tedious math by some simple statitics approximation is equally fun.
]]>