Jekyll2021-04-12T23:55:35+00:00http://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoNote on “model diversity”2021-04-12T00:00:00+00:002021-04-12T00:00:00+00:00http://www.yulingyao.com/blog/2021/negativecorr<p>In my <a href="https://statmodeling.stat.columbia.edu/2021/01/26/hierarchical-stacking-part-ii/">previous blog post</a> on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.</p>
<p>So I took a look at that thesis. In the NC framework we are combining $K$ point estimates $f_{1}, \dots, f_{K}$</p>
\[f_{ens} (x)=\sum_{k=1}^K w_k f_i(x)\]
<p>and try to minimize the MSE of the ensemble.</p>
\[\mathrm{MSE}= \frac{1}{K}\sum_{i=1}^K w_i E (f_i(x)-y)^2 + \frac{1}{K^2} \sum_{i=1}^K \mathrm{Var}(f_i) + \frac{1}{K^2} \sum_{i=1}^K \sum_{j\neq i} \mathrm{Cov}(f_i, f_j).\]
<p>The three terms read</p>
\[\mathrm{bias}^2 + \frac{1}{K} \mathrm{variance} + (1-\frac{1}{K}) \mathrm{covariance}.\]
<p>The intuition is that we want to maximize the diversity, perhaps that means minimize correlation?</p>
<p>An alternative decomposition is called ambiguity. As the MSE of the ensemble is</p>
\[\mathrm{MSE}= \mathrm{E} \vert f_{ens} (x)- y \vert ^2 = \sum_{i=1}^K w_i E (f_i(x)-y)^2 - \sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2.\]
<p>Here the term</p>
\[\sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2\]
<p>is called <em>ambiguity</em>. Compared with the covariance, the ambiguity only has one term so is likely more tractable.</p>
<p>That is the basis of negative correlation (NC) learning, in which we train K neural nets. Instead of minimizing the individual $\mathrm{E} (f_i(x)-y)^2$, we minimize the error minus the ambiguity (or plus the covariance):
\(\min (f_i-y)^2 + \lambda (f_i - f_{ens}) \sum_{j\neq i} (f_j -f_{ens})\)</p>
<p>Certainly all the theory results derived here are brilliant. But there are some reasons why we did not stop here.</p>
<p>First, the bias variance trade-off only applies to MSE, while we want to quantify the ensemble richness with respect any pre-specified utility. Of course, Some Jensen inequality still holds.</p>
<p>Second, also because of the central role of MSE here, the term ambiguity does not apply to combining predictive distributions. We do not have the concept of correlation there: say we have two random variable $\mathrm{Corr}(x_1, x_2)=0.5$, but what is $\mathrm{Corr}(N(x_1,1) , N(x_1,1))$?</p>
<p>Third, this ambiguity term is only a good summary of the diversity when it stands next to the individual bias. Standing alone, this term is independent of data (kinda like why we would like to attack PCA for ignoring $y$).</p>
<p>These reasons are why we propose a new metric in our hierarchical stacking paper: how often an individual model wins.</p>Yuling YaoIn my previous blog post on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.I failed an interview for being Bayesian2021-04-07T00:00:00+00:002021-04-07T00:00:00+00:00http://www.yulingyao.com/blog/2021/job2<p>I get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was that</p>
<blockquote>
<p>“You were not open-minded to solutions and first principles to the problem other than the ones that you are comfortable with.”</p>
</blockquote>
<p>I recall during the interview there were a few case studies on data analysis. I proposed a complete workflow: how to design the experiment, how to make causal and decision theory adjustments, how to build models, how to make regularization, how to compute and approximate, and how to make model improvement. This is what I did in my applied data analysis all the time.</p>
<p>It turned out the interviewers were expecting some “first principles” such as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if(prediction problem)
run a neural net;
if(causal inference)
run a t-test;
if(model evaluation)
compute AIC;
if(decision theory is involved)
flip a coin;
</code></pre></div></div>
<p>Hmm, if these magical simplifications are what I were not open-minded to, it is hard for me to regret.</p>
<p>Apart from reminding me of a paper review I once received “Because all MCMC methods are not scalable to big data, your new development therein is not interesting”,
the feedback above might also seem to suggest that this company itself is not open-minded to candidates who are capable of solving certain problems in ways that are not familiar to the company but might have been proven successful elsewhere. To be fair, this we-are-hiring-people-who-have-certain-skills-but-nothing-more attitude makes sense in business: these companies typically have a comprehensive pipeline to problem solving and an entry level employee should really focus on implementing the given pipeline. I blame the Neanderthal inside me for not being compatible.</p>
<p>This whole story echoes what Andrew used to say</p>
<blockquote>
<p>Making Bayes inference is the only correct thing to do when you have correct model and correct prior no matter who you are. Being Bayesian means you make Bayes inference anyway even all assumptions are wrong.</p>
</blockquote>
<p>from which you could tell where I got my “<em>not being open-minded</em>” from, if I were enforced to accept such accusation.</p>
<p>In short, I am not regretful for being assertive/creative in interviews. Or put it in another way, I am not regretful for receiving doctoral level training on applied statistics (in contrast to, say, taking two online courses on “mastering all machine learning and statistics and data science and programming in 14 days”), which grants me those assertiveness and creations. As Winston Churchill pointed out:</p>
<blockquote>
<p>You failed an interview for proposing your own solutions? Good. That means you’ve stood up for something, sometime in your life.</p>
</blockquote>Yuling YaoI get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was thatTwo approaches for online updates in the election forecast2021-03-03T00:00:00+00:002021-03-03T00:00:00+00:00http://www.yulingyao.com/blog/2021/update<p>The term online update here is referred to updating a statistical model after certain modeled outcome is observed. A concrete example is in election forecast: the state election result comes in sequence, and that is when some website has to offer “real-time update of our prediction”.</p>
<p>It seems there are two ways for this task. Approach 1 is model based update, or maybe we shall call Bayesian update—the model provides a posterior outcome predictive density Pr(CA, NY, …). An online update becomes a conditional estimate: Pr(CA $\vert$ NY = observed outcome). In practice, we only need to collect posterior simulation draws that leads to this outcome.</p>
<p>If the outcome is continuous (shares of the vote), the probability is zero for any simulation draws to match the exact observations. We could use some ABC method here, and update the exact conditional probability by Pr(CA $\mid$ NY $\approx$ observed outcome) equipped come chosen distance metric.</p>
<p>The problem with simulation based approach is that if we observe some tail event, say R winning NY, there is hardly any simulation draw to match this event. Or simply when the number of events is large, every time along the update we would discard some simulation draws. Whichever reason, the update efficiency is limited by number of simulation draws. A quick approximation is to further approximate the posterior outcome model Pr(CA, NY, …) by a multivariate normal model, such that any conditional update comes in closed form solution, at the cost of less modeling flexibility.</p>
<p>Another way for this update task is some regression approach. Say we have point prediction for each state $y_{NY}, y_{CA}, \dots$, and sequentially we observe the actual outcome, $\tilde y_{NY}, \tilde y_{CA}, \dots$, we could run a regression $\tilde y_{i} = \beta_{1} y_i + \beta_{0} + \epsilon_i, \epsilon_i \sim \mathrm{normal}(0, \sigma).$ Then the online update task becomes the standard parameter update problem with more and more data comes in. This approach is compatible with point predictions, and has the advantage to adjust systematic “polling bias” (think about 2016).</p>
<p>A further question is how to combine these two approaches. In particular, approach 1 (simulation draws) can make use the posterior correlation between state outcomes. A plausible way is to replace the regression model by</p>
\[\tilde y_{i} = y_i + \beta_0 + \epsilon_i,\]
<p>But instead of the iid residuals in the regression model, this time we model $\epsilon_i$ as from a multivariate normal distribution, whose covariance is adapted from the posterior predictions,</p>
\[\mathrm{Corr}(\epsilon_i, \epsilon_j)= \mathrm{Corr}(y_i, y_j).\]
<p>The extra $\beta_0$ terms is still the systematic “polling bias” that was not seen by the existing model.</p>
<p>Certainly this new model is not ideal: The first two moments are oversimplified descriptions of a multi-D density; The normal model cannot adapt to heavy tailed predictions; The tail correlation is not necessarily the same the the overall correlation, etc.</p>Yuling YaoThe term online update here is referred to updating a statistical model after certain modeled outcome is observed. A concrete example is in election forecast: the state election result comes in sequence, and that is when some website has to offer “real-time update of our prediction”.Career development2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00http://www.yulingyao.com/blog/2021/job<p>Recently I have gone through a few industry job applications. Here is a few examples that manifest how amusing this procedure can become</p>
<ol>
<li>Company A desk-rejected me for a position I did not apply for (i.e., I applied for a full-time position, and I received an email from HR that they would not consider me for an intern—yes I made the correct application as I have checked my receipt—and this HR email was “no-reply”).</li>
<li>Fun fact: company A is famous for AI, ML, DL, and all fancy things. I wonder if AI safety would be undermined if the basic database procedure goes corrupted.</li>
<li>After finishing the a very long interview with Company B, I received an email from HR saying this interview had been rearranged to another day. It turned out in the first three minutes the interviewer had a connection issue and contacted the HR. This interviewer managed to go back but the HR did not know.</li>
<li>Fun fact: company B is famous for AI, ML, DL, and all fancy things. I guess at least I could trust their differential privacy.</li>
<li>Company C asked effectively identical interview questions in two subsequent sessions. I cannot help but to repeat to the fact that company C is famous for AI, ML, DL, and all fancy things.</li>
<li>I am not extremely thrilled with the fact that most tech companies requires no more statistical knowledge than the level of the first three chapters of “regression and other story”–not saying it is not a good book. But I guess a MD would be depressed too if they were only examined by their knowledge on babysitting skills.</li>
<li>To be fair, finance companies often had much harder math problems, until I learned afterwards that many, if not all, of their problems were from one published book, and this book and sold on Amazon. We would certainly eliminate all overfitting issues in machine learning if all test data are sold on Amazon.</li>
<li>Many tech companies desk-reject me. I was later informed that an insider referral was often necessarily for a resume to be picked from the pile in the first place. I guess a referral is not always bad: not everyone can afford a PPO plan.</li>
<li>It is like we are running a linear regression with covariate dimension = 1e6 so we have to run some pre-screening, and we do so by sequentially pick the variables that have high insider correlation from the all currently-selected variables. Don’t ask what an insider correlation means. I don’t even know what it means in human context.</li>
</ol>
<p>Well, let me take a step back. People make mistakes. I mess up things a lot in my work if not always. So I should not laugh at those inconvenience I encounter. But the cynical part of me, which gets amplified after all these, has the following cynical advice to future job seeker. An ideal candidate in modern job market should have an ivy degree so the hiring manager would be happy. However, during the graduate school, instead of going to lectures, they should spend as much time as possible on</p>
<ol>
<li>building linkedin connections,</li>
<li>reading first 3 chapters of RAOS,</li>
<li>practicing leetcode,</li>
<li>buying that hedge fund book from Amazon,</li>
<li>praying that the company’s email database would work,</li>
<li>Besides, don’t do anything else. Weeeee!</li>
</ol>Yuling YaoRecently I have gone through a few industry job applications. Here is a few examples that manifest how amusing this procedure can becomeFour open questions on ensemble methods2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00http://www.yulingyao.com/blog/2021/openproblem<p>In a recent paper I wrote, I discussed a few open questions on ensemble methods:</p>
<blockquote>
<ol>
<li>Both BMA and stacking are restricted to a linear mixture form, would it be beneficial to consider other aggregation forms such as convolution of predictions or a geometric bridge of predictive densities?</li>
<li>Stacking often relies on some cross-validation, how can we better account for the finite sample variance therein?</li>
<li>While staking can be equipped with many other scoring rules, what is the impact of the scoring rule choice on the convergence rate and robustness?</li>
<li>Beyond current model aggregation tools, can we develop an automated ensemble learner that could fully explore and expand the space of model classes—for example, using an autoregressive (AR) model and a moving-average (MA) model to learn an ARMA model?</li>
</ol>
</blockquote>
<p>I think they are all important directions!</p>Yuling YaoIn a recent paper I wrote, I discussed a few open questions on ensemble methods: Both BMA and stacking are restricted to a linear mixture form, would it be beneficial to consider other aggregation forms such as convolution of predictions or a geometric bridge of predictive densities? Stacking often relies on some cross-validation, how can we better account for the finite sample variance therein? While staking can be equipped with many other scoring rules, what is the impact of the scoring rule choice on the convergence rate and robustness? Beyond current model aggregation tools, can we develop an automated ensemble learner that could fully explore and expand the space of model classes—for example, using an autoregressive (AR) model and a moving-average (MA) model to learn an ARMA model?A Bayesian reflection of “Invariance, Causality and Robustness”2021-02-23T00:00:00+00:002021-02-23T00:00:00+00:00http://www.yulingyao.com/blog/2021/causal<p>I was reading Peter Bühlmann’s statistical science article <a href="https://arxiv.org/abs/1812.08233">“Invariance, Causality and Robustness”</a>. To be fair, he gave a short course in 2020 here in Columbia, but after reading this paper I guess I did not totally understand his lecture last time.</p>
<h2 id="going-beyond-the-potential-outcome">Going beyond the potential outcome</h2>
<p>The motivation to depart from the potential outcome framework is that in an open-ended observational data gathering we do not know what input is treatment and what is covariate. Or put it in another way, rather than to find the effect of the cause, Bühlmann is investigating the cause of the effect. Denote $X$ to be all input variables and $Y$ the outcome, one goal is to understand which variable in $X$ that actually causes/casually impact $Y$.</p>
<p>Bühlmann defines causality as invariance under different environments. Assuming a non-confounder situation, or equivalently if we have collected all possibly relevant input variables $X$. As per Bühlmann’s definition, we would like to collect data from various environments (perturbation of marginal of x, different countries, different experiment design, etc) and the (x, y) relation that remains unchanged in all environment is then the causality.</p>
<p>For a concrete example, consider the input x is gene sequence of length 1000, $y$ is some protein expression, and we collect data from 10 “experiments”. We denote data by $(x_{ijd}, y_{ij})$, $i=1,\dots, n, j=1, \dots, 10, d=1,\dots,100$, so $i, j, k$ are indexes for data, environment, and covariate dimension respectively.</p>
<p>Bühlmann folds it into a multiple testing problem: to seek a subset $\mathcal{S}\in {1, \dots, 1000}$, such that the conditional distribution of $y\mid X_\mathcal{S}$ is invariant, i.e.,</p>
\[y_{i j}= \sum_{d\in \mathcal{S}}\beta_{d} x_{ijd} + \mathrm{iid~ noise.}, \forall j.\]
<p>Given subset $\mathcal{S}$ we can certainly test this hypothesis by
testing if the regression coefficient is invariant under environment. To find this subset $\mathcal{S}$ (the true cause), we can test the hypothesis over all $2^d$ subsets, which needs to be done in the lens of multiple testing. This is the basic of Bühlmann’s method.</p>
<h2 id="bayesian-adaptation">Bayesian adaptation</h2>
<p>From a Bayesian perspective, if I collect data from 10 “environments”, a more natural way is to fit a hierarchal regression with all inputs included,</p>
\[y_{i j}= \sum_{d=1}^{1000} \beta_{jd} x_{ijd} + \mathrm{iid~ noise}, ~~ \beta_{j d}\sim\mathrm{normal}( \mu_{d}, \sigma_d), ~~\mathrm{other~priors}.\]
<p>Now if the true DG is really the reduced form (i.e., there exists a sparse subset $X_{\mathrm{X}}$ who are $y$’s parent nodes), then with large sample size, we should expect</p>
\[\sigma_d \to 0, ~~\forall d\in \mathrm{S}.\]
<p>Put it in another way, we can interpret the $\sigma_d$ as the extent to which the finding can be <em>transported</em> to other environments, and therefore, $\sigma_d = 0$ means causality.</p>
<p>The nice thing of the Bayesian framework is</p>
<ol>
<li>We avoid multiple testing.</li>
<li>generalization to non-linear models is straightforward.</li>
<li>In practice we will never observe $p (\sigma_d\mid y)= \delta(0)$. That said, we can place a horseshoe prior on $\sigma_1, \dots,\sigma_{1000}$ to enforce sparsity. I almost want to call this model an <em>automated cause finding</em>.</li>
</ol>
<p>An alternative Bayesian approach is some predictive projection (post-inference variable selection). After obtaining the full posterior $p(\sigma \mid y)$, we project it to a subspace $\sigma_{d}=0$ for some $d$.</p>
<h2 id="other-thoughts">Other thoughts</h2>
<ol>
<li>There is <em>generally</em> no contradiction between prediction and causality. A good prediction requires knowing the true DG, and true DG= causality. Conversely, knowing causality = knowing true model = robust predictive extrapolation.</li>
<li>But certainly I can come up with some counter-example when I used the word <em>generally</em>. Think about the example in which we know the true DG is $x_1 \to y$. $x_1$ is a car’s speed and $y$ is the moving distance in a given distance in 1 minute. But suppose we do not know physics and we therefore collect all other inputs, such as the color, the weight, the number of the wheels… from many observed cars. Now if we do the automated cause finding, we will get $y=\beta_1 x_1$, the correct physical finding, and all other variables are irreverent. But in principle we cannot measure the speed exactly so we always have some measurement error. Let’s say $x_1$= speed + normal (0,0.01), and $x_2$= speed + normal (0,0.02). To improve the prediction, we would like include both $x_1$ and $x_2$ in the linear model—i.e., at the cost of polluting the causal finding.</li>
<li>The framework of Invariance= Causality= true DG is a good justification why we often propose <em>fake data simulation</em> in Bayesian model, even though we seldom make reference to causality. Fake data simulation = generating (fake) data from various environments, and if the fitted model being desired = invariance under environments.</li>
<li>One conceptual challenge is how to make the division of input and environment (analogous to dividing between covariates and treatment in PO approach). If I collect survey from 10 countries, should I view country as an input (a possible treatment/cause), or an environment? So I guess after all there is some human decision here.</li>
</ol>Yuling YaoI was reading Peter Bühlmann’s statistical science article “Invariance, Causality and Robustness”. To be fair, he gave a short course in 2020 here in Columbia, but after reading this paper I guess I did not totally understand his lecture last time.What is the optimal design of regression covariates?2021-02-23T00:00:00+00:002021-02-23T00:00:00+00:00http://www.yulingyao.com/blog/2021/design<p>Depends who you ask.</p>
<p>To be specific, assume the observation is (x, y) pairs, $x\in [-1,1]$, and the model is</p>
\[y= \beta x + \mbox{normal}(0,\sigma).\]
<p>Suppose we also know that in the population of interest, the input is $x\sim$Uniform[-1,1].</p>
<p>Depending who you ask, there are three goals to optimize:</p>
<h2 id="experiment-design-or-an-m-closed-view">Experiment design or an M-closed view</h2>
<p>The objective is to minimize the variance of $\hat \beta$, which is achieved at placing all masses of x at the two end-points</p>
\[x\sim .5 \delta(-1) + .5 \delta(1).\]
<h2 id="active-learning--or-an-m-complete-view">Active learning or an M-complete view</h2>
<p>Now what if the model is not correct Isn’t it reckless that we do not even try to look around x=0? Adopting a covariate shift persecutive, we want to reweight the model to reflect the difference between training and testing x, $w_i=\frac{p_{te}(x_i)}{p_{tr}(x_i)}$, and we would like to minimize the variance of the weighted OLS estimate: <code class="language-plaintext highlighter-rouge">target += w[i] * log_lik[i]</code>. Clearly placing x only on -1 and 1 is bad: the importance weight $w$ and thereby the weighted-estimate $\hat \beta$ would have infinite variance.</p>
<p>We can work out the optimal design</p>
\[p_{train}(x) \propto \vert x\vert.\]
<p>A further complication is the self-normalization of $w_i$.</p>
<h2 id="causal-inference--or-an-m-open-view">Causal inference or an M-open view</h2>
<p>In the derivation above, we still need some belief model (i.e., we do not trust the linear model so we use weighted OLS, but to derive the optimal design we have to approximate the true DG by this the linear model anyway). In a model-free/minimal assumption case, we would like to hedge against some minimax risk: what if the true $y= x + \mbox{normal}(0,0.1)$, except inside the interval $x \in [-0.01,0.01]$, $y=100 x$? You never know.</p>
<p>The most conservative design is to minimize the variance of the importance weight, which is 0 and is achieved at</p>
\[p_{train} =p_{test} = \mbox{Uniform} [-1,1].\]
<p>In reality we have more complex model than a linear regression. Sadly, literature are somewhat developed separately in these areas. In sequential-data gathering it can be more complicated that we may adaptively change the design based on how the model fits the existing data.</p>Yuling YaoDepends who you ask.Best point mass approximation2021-02-20T00:00:00+00:002021-02-20T00:00:00+00:00http://www.yulingyao.com/blog/2021/point<p>It comes a lot that we often summarize a continuous distribution (often, posterior distribution of parameter estimation or of predictions) by a point mass (or a sharpe spike) for (1) computation or memory cost, (2) easier communication, (3) part of a larger computation approximation, such as EM or in dropout, or a combination of all of these.</p>
<p>How do we justify the choice of point summary? Here is a few list:</p>
<ol>
<li>The mean of posterior density minimizes the $L^2$ risk.</li>
<li>The mode of the posterior density minimizes the KL divergence to it. Well, this needs more explanation. Assuming $p(\theta)$ is some continuous density on real line with respect to the Lebesgue measure $\mu(\theta)$—but a point mass $\delta(\theta_0)$ is defined with repect to counting measure $\mu_c(\theta)$—but to define KL divergence does requires the same measurable space. To overvome this ambiguity we can either (a) equip the real line with the summed measure $\mu_c(\theta)+ \mu(\theta)$ and the KL divergence reads $C - \log p(\theta_0)$ which is minimized at the mode, or (b) view $\delta(\theta_0) \approx \mathrm{normal}(\theta_0, \tau)$ for a very small but fixed $\tau$, then KL divergence is also approximately $C(\tau) + \log p(\theta_0)$, which again minimized at the mode. Put it in another way, the MAP is always the spiky variational inference approximation to the exact posterior density.</li>
<li>The Wasserstein metric is different. It is legit to consider the Wasserstein metric between point mass and continuous densities. The posterior median minimizes the Wasserstein metric for order 1 and he posterior mean minimizes the Wasserstein metric for order 2. Likewise we can argue the posterior median/mean is the variational inference approximation to the exact posterior density under Wasserstein metric.</li>
</ol>Yuling YaoIt comes a lot that we often summarize a continuous distribution (often, posterior distribution of parameter estimation or of predictions) by a point mass (or a sharpe spike) for (1) computation or memory cost, (2) easier communication, (3) part of a larger computation approximation, such as EM or in dropout, or a combination of all of these.Measuring extrapolation2021-02-19T00:00:00+00:002021-02-19T00:00:00+00:00http://www.yulingyao.com/blog/2021/extra<h2 id="cramérrao-lower-bound">Cramér–Rao lower bound</h2>
<p>I will not call myself a theoretic statistician but sometimes I still find mathematical statistics amusing especially when they have practical implications. To start this blog post, I will go from Cramér–Rao lower bound: Given a statistical model with likelihood $p(y|\theta)$, and $\hat \theta_n:= \theta_n (y_1, \dots, y_n)$ is <strong>any</strong> regular unbiased estimate from $n$ iid data, then it is guaranteed that the asymptotic variance of $\theta_n$ is lower bounded by Fisher information:</p>
\[Var(\hat \theta_n) \geq \frac{1}{I(\theta)} = \frac{1}{n Var_{\theta}(S(\theta, y))}.\]
<p>where $S(\theta, y_i)= \frac{\partial}{\partial \theta} \log p(\theta, y_i)$ is the score function—the gradient of the log likelihood, or pointwise <code class="language-plaintext highlighter-rouge">grad_log_prob</code> in Stan.</p>
<p>One clear limitation is that this bound only applies when $\hat \theta_n$ has finite variance, or when $I(\theta)$ is PD. But in practice many estimate may still be useful if they do not have finite variance. Put it in another way, we want an estimate to have a smaller asymptotic variance for a quicker convergence rate, but an estimate might still converge in practical amount of time even if CLT does not hold.</p>
<h2 id="semiparametric-efficiency">Semiparametric efficiency</h2>
<p>A more profound version of the Cramér–Rao lower bound is the semiparametric efficiency. For convenience we now view the parameter $\theta$ a $d$-dimensional vector, although we could extend to infinite dimensional space too. We also assume there is a true $\theta_0$ which generates all data. Now we consider a regular linear estimate, such that</p>
\[\sqrt {n}(\hat \theta_n - \theta_0) = \frac{1}{\sqrt n}\sum_{i=1}^n \phi (y_i) + o_p(1).\]
<p>$\phi (y_i)$ is the point-wise <em>influence function</em>. Again, the scaling $\frac{1}{\sqrt n}$ implies we want to apply some CLT.</p>
<p>The main efficiency theory is that the influence function must satisfy</p>
\[E( \phi(y) S^T_{\theta} (z, \theta_0))= I_{d\times d}.\]
<p>$S^T_{\theta} (y, \theta_0)$ is the score function. It has mean zero. We could divide the space into the linear span of $S^T_{\theta} (y, \theta_0)$ and its orthogonal complement $\Gamma$. Because the equation above still holds if we add or subtract something $\phi(y)+= something \in \Gamma.$ It also implies that the most efficient estimate (in terms of asymptotic variance) shall live in the <em>tangent space</em>:</p>
\[\{\phi(\cdot):\phi(\cdot)= B \times S_{\theta} (\cdot, \theta_0)\}.\]
<p>For a regular model, this bound will be achieved by MLE, in which case the coefficient $B= I^{-1}(\theta)$.</p>
<h2 id="we-love-tight-bounds-as-much-as-hate-extrapolation">We love tight bounds as much as hate extrapolation</h2>
<p>Besides every line else, one of my favorite quote from Andrew is</p>
<blockquote>
<p>Use the methods that you think will best solve your problem, and stay focused on the three key leaps of statistics: (1) Extrapolating from sample to population;
(2) Extrapolating from control to treatment conditions;
(3) Extrapolating from observed data to underlying constructs of interest. Whatever methods you use, consider directly how they address these issues.</p>
</blockquote>
<p>Among his three categories of extrapolation, (1) and (2) are kinda the same thing and (3) is a different concept that focuses more on interpretation. For this blog post, I will only talk about the first two.</p>
<p>The running example I have is how extrapolation affects learning. It is also convenient for my blog post that Andrew explicitly use the phrase “whatever methods”—We can design better and better models, better computation, better features, better neural network infrastructure, but hey, there is some intrinsic learning bound toward which whatever method cannot overcome, the same reason that the CR-bound holds for any estimate. Maybe we shall distinguish <em>aleatoric</em> and <em>epistemic</em> learning bound: the latter one refers to what we could have learned better if we had a better model or a better algorithm, while the former one depends on the configuration of data: some tasks are just more difficult than others no matter what method we use. If we observe 100 points with their input $x\in$ normal(0,1), then making perditions at x=0.5 is clearly easier than x=1000.</p>
<h2 id="out-of-distribution-detection">Out-of-distribution detection</h2>
<p>Out-of-distribution in recent ML literature means that we have some training data (e.g., all existing road data from Tesla users in CA), and when we have a new application (driving on the moon), we shall tell this is beyond the training domain and be cautious for prediction over-confidence.</p>
<p>I do not like this term “out-of-distribution” for in many situation the new test data is a given point, not a distribution—a linked but different task than covariate adaptation. I think a better name would be “out-of-support”. But then it is also misleading as we can always say 10000 is literally in the support of normal(0,1), although too rare to be seen in the training data.</p>
<p>The real concept is how much extrapolation is harmful enough that we should stop given the amount of training data we have. There is a tendency that we have method-specific, or epistemic, tools to detect this extrapolation level: when we do matching, we have various cute statistics to compute the covariate imbalance, when we run a regression, we could use the model based posterior variance of the estimate as a proxy (think about a gaussian process in which we have larger uncertainty outside the data domain).</p>
<h2 id="extrapolation-as-a-description-of-data">Extrapolation as a description of data.</h2>
<p>But extrapolation is a description of data configuration, and will jeopardize “whatever method”. It is like the sample mean of $y$: we do not report two different sample means for matching and regression.</p>
<p>Return to the CR-bound, when we apply this bound to causal estimate, it is straightforward to obtain the asymptotic lower bound of any regular estimate of the local causal effect $\hat \mu $ by</p>
\[nVar (\hat \mu) \geq
E[ \frac{\sigma_{c}^2 (X)}{1-e(X)} + \frac{\sigma_{t}^2 (X)}{e(X)} ]\]
<p>$e(x)$ is the propensity score at $x$ and $\sigma_{t}^2 (X)$ and $\sigma_{c}^2 (X)$ are the conditional variance in the treated and control group. It is clear how the variance is amplified if some $e(x)$ is close to 1 or 0. It is tempting to measure the intrinsic covariate extrapolation by this expecation.</p>
<h2 id="a-tighter-bound-via-k-hat">A tighter bound via k hat</h2>
<p>A nice property is that the right hand side is always finite—the importance weight has expectation 1. But that is misleading, applying CR inequality already implies a finite second moment of the score function, which is partially not true in this importance sampling application.</p>
<p>Loosely speaking, the order is more important than the constant. Zuckerberg probably (I guess, I do not know for sure) does not envy Bezos for his net wealth is a fraction of the latter one as they share a similar growth rate, but Rothschild family probably (again I guess) fell salty when they look into how the new money growth rate can easily beat their big O constant. The point is that when CLT holds, we have square root convergence rate. But really the extrapolation is more harmful when we do not even have CLT for any method.</p>
<p>A tighter bound is to measure the k hat of the ratio $\frac{1}{1-e(X)}$ and $\frac{1}{1-e(X)}$ (detailed reasoning will come in a new paper). If k hat is smaller than 0.5, then we do have finite variance, so we could compare that variance term. If k>0.5, this lower bound is useless, and in particular if k>0.7, we should not expect any reasonable estimate from any finite sample and any algorithm.</p>Yuling YaoCramér–Rao lower bound I will not call myself a theoretic statistician but sometimes I still find mathematical statistics amusing especially when they have practical implications. To start this blog post, I will go from Cramér–Rao lower bound: Given a statistical model with likelihood $p(y|\theta)$, and $\hat \theta_n:= \theta_n (y_1, \dots, y_n)$ is any regular unbiased estimate from $n$ iid data, then it is guaranteed that the asymptotic variance of $\theta_n$ is lower bounded by Fisher information:The likelihood principle in model check and model evaluation2020-12-16T00:00:00+00:002020-12-16T00:00:00+00:00http://www.yulingyao.com/blog/2020/likelihood<p>The likelihood principle is often phrase as an axiom in Bayesian statistics. My interpretation of the likelihood principle reads:</p>
<p>We are (only) interested in estimating an unknown parameter $\theta$, and there are two data generating experiments both involving $\theta$ with observable outcomes $y_1$ and $y_2$ and likelihoods $p_1(y_1 \vert \theta)$ and $p_2(y_2 \vert \theta)$. If the outcome-experiment pair satisfies $p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$, (viewed as a function of $\theta$) then these two experiments and two observations will provide the same amount of information about $\theta$.</p>
<p>Consider a classic example. Someone is doing an AB testing and only interested in the treatment effect, and he told his manager that among all n=10 respondents, y=9 saw an improvement (assuming the metric is binary). It is natural to estimate the improvement probability $\theta$ by independent Bernoulli trial likelihood: $y\sim binomial (\theta\vert n=10)$. Other informative priors can exist but is not relevant to our discussion here.</p>
<p>What is relevant is that later the manager found that the experiment was not done appropriately. Instead of independent data collection, the experiment was designed to sequentially keep recruiting more respondents until $y=9$ are positive. The actual random outcome is n, while y is fixed. So the correct model is $10=n\sim$ negative binomial $(\theta\vert y=9)$.</p>
<p>Luckily, the likelihood principle kicks in for the fact that
binomial_lpmf $(y\vert n, \theta) =$ neg_binomial_lpmf $(y\vert n, \theta)$ + constant. Hence no matter how the experiment is done, they yield the same inference.</p>
<p>At the abstract level, the likelihood principle says the information of $\theta$ can only be extracted via the likelihood, not from experiments that could have been done.</p>
<p>For example, in hypothesis testing, all the type-1 error is about a hypothetical experiment (e.g., the null is $\theta=0$). A classic example is that one has two scalers which return $y\sim$ N$(\theta, 1)$ or N$(\theta, 10000)$ respectively, and which scaler is to be used is determined by a coin flip. But even if in one trail one knows he is using the precise scaler, the hypothesis testing still uses the inflated p-value $p= \Pr(X_{mix} >\vert y \vert)$. where $X_{mix}$ come from a mixture density $X_{mix} \sim .5 N(0,1)+.5 N(0,10000)$.</p>
<h2 id="what-can-go-wrong-in-model-check">What can go wrong in model check</h2>
<p>The likelihood is dual-purposed in Bayesian inference. For inference, it is just one component of the unnormalized density. But for model check and model evaluation, the likelihood function enables generative model to generate posterior predictions of $y$.</p>
<p>In the binomial/negative binomial example, it is OK to stop at the inference of $\theta$. But as long as we want to check the model, we do need to distinguish between the two possible sampling model and which variable ($n$ or $y$) is random.</p>
<p>Consider we observe y=9 positive cases among n=10 trials and the estimated $\theta=0.9$, the likelihood of binomial and negative binomial models are</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> y=9
> n=10
> dnbinom(n-y,y,0.9)
0.3486784
> dbinom(y,n, 0.9)
0.3874205
</code></pre></div></div>
<p>Not really identical. But the likelihood principle does not require them to be identical. What is needed is a constant density ratio, and that is easy to verify:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> prob_list=seq(0.5,0.95,length.out = 100)
> dnbinom(n-y,y, prob=prob_list)/dbinom(y,n, prob=prob_list)
</code></pre></div></div>
<p>The result is a constant ratio, $0.9$.</p>
<p>However, the posterior predictive check (PPC) will have different p-values:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> 1-pnbinom(n-y,y, 0.9)
0.2639011
> 1-pbinom(y,n, 0.9)
0.3486784
</code></pre></div></div>
<p>The difference of the PPC-p-value can be even more dramatic with other $\theta$:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> 1-pnbinom(n-y,y, 0.99)
0.0042662
> 1-pbinom(y,n, 0.99)
0.9043821
</code></pre></div></div>
<p>Just very different!</p>
<p>Clearly using Bayesian posterior of $\theta$ does not fix the issue. The problem is that likelihood ensures some constant ratio on $\theta$, not on $y_1$ nor $y_2$.</p>
<h2 id="model-selection">Model selection?</h2>
<p>Unlike the unnormalized likelihood in the likelihood principle, the marginal likelihood in model evaluation is required to be normalized.</p>
<p>In the previous AB testing example, given data $(y,n)$, if we know that one and only one of the binomial or the negative binomial experiment is run, we may want to make model selection based on marginal likelihood. For simplicity we consider a point estimate $\hat \theta=0.9$. Then we obtain a likelihood ratio test, with the ratio $0.9$, slightly favoring the binomial model. Actually this marginal likelihood ratio is constant $y/n$, independent of the posterior distribution of $\theta$. If $y/n=0.001$, then we get a Bayes factor 1000 favoring the binomial model.</p>
<p>Except it is wrong. It is not sensible to compare a likelihood on $y$ and a likelihood on $n$.</p>
<h2 id="what-can-go-wrong-in-cross-validation">What can go wrong in cross-validation</h2>
<p>CV requires some loss function, and the same likelihood does not imply the same loss function (L2 loss, interval loss, etc). For adherence, we adopt log predictive densities for now.</p>
<p>CV also needs some part of the data to be exchange, which depends on the sampling distribution.</p>
<p>On the other hand, the calculated LOO-CV of log predictive density seems to only depend on the data through the likelihood. Using the two model notation with $M1: p_1(\theta\vert y_1)$ and $M2: p_2(\theta\vert y_2)$</p>
\[\text{LOOCV}_1= \sum_i \log \int_\theta {\frac{ p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }} \left({ \int_{\theta} { p_\text{post} (\theta\vert M_1, y_1)}{ p_1(y_{1i}\vert \theta) }d\theta}\right)^{-1} p_1 (y_\vert\theta) d\theta,\]
<p>and replace all 1 with 2 in $\text{LOOCV}_2$.</p>
<p>The likelihood principle does say that $p_\text{post} (\theta\vert M_1, y_1)=p_\text{post} (\theta\vert M_2, y_2) $,
and if there is some generalized likelihood principle ensureing that $p_1 (y_{1i}\vert\theta)\propto p_2 (y_{2i} \vert\theta)$, then $\text{LOOCV}_1= \text{constant} + \text{LOOCV}_2$.</p>
<p>Sure, but it is extra assumption. And arguably the point-wise likelihood principle is such a strong assumption that would hardly be useful beyond toy examples.</p>
<p>The basic form of the likelihood principle does not have the notation of $y_i$. It is possibles that $y_2$ and $y_1$ have different sample size: consider a meta-polling with many polls. Each poll is a binomial model with $y_i\sim binomial(n_i, \theta)$. If I have 100 polls, I have 100 data points. Alternatively I can view data from $\sum {n_i}$ Bernoulli trials, and the sample size becomes $\sum_{i=1}^{100} {n_i}$.</p>
<p>Finally just like the case in marginal likelihood, even if all conditions above hold, regardless of the identity, it is conceptually wrong to compare $\text{LOOCV}_1$ with $\text{LOOCV}_2$. They are scoring rules on two different spaces (probability measures on $y_1$ and $y_2$ respectively) and should not be compared directly.</p>
<h2 id="ppc-again">PPC again</h2>
<p>Although it is a bad practice, we sometimes compare PPC p-value for the purpose of model comparison. In the y=9, n=10, $\hat \theta=0.99$ case, we can compute the two sided p-value
$\min (\Pr(y_{sim} > y \vert n), \Pr(y_{sim} < y \vert n))$ for the binomial model and $\min (\Pr(n_{sim} > n \vert y), \Pr(n_{sim} < n \vert y))$ for the NB model respectively.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> min(pnbinom(n-y,y, 0.99), 1-pnbinom(n-y,y, 0.99) )
0.004717254
> min( pbinom(y,n, 0.99), 1-pbinom(y,n, 0.99))
0.09561792
</code></pre></div></div>
<p>In the marginal likelihood and the log score case, we know we cannot directly compare two likelihoods or two log scores when they are on two sampling spaces. Here, the p-value is naturally normalized. Does it mean we the NB model is rejected while the binomial model passes PPC?</p>
<p>Still we cannot. We should not compare p-values at all.</p>
<h2 id="the-likelihood-principle-and-the-sampling-distribution">The likelihood principle and the sampling distribution</h2>
<p>To avoid unfair comparison of marginal likelihoods and log scores across two sampling spaces, a remedy is consider a
product space: both $y$ and $n$ are now viewed as random variables.</p>
<p>The binomial/negative binomial narrative specify two models $p(n,y\vert \theta)= 1(n=n_{obs}) p(y\vert n, \theta)$ and $p(n,y\vert \theta)= 1(y=y_{obs}) p(n\vert y, \theta)$.</p>
<p>The ratio of these two densities only admit three values:
0, infinity, or a constant y/n.</p>
<p>If we observe several paris of $(n, y)$, we can easily decide which margin is fixed. The harder problem is we only observe one $(n,y)$. Based on the comparison of marginal likelihoods and log scores in the previous sections, it seems both metric would still prefer the binomial model (now it is viewed as a sampling distribution on the product space).</p>
<p>Well, it is almost correct expect that 1) the sample log score is not meaningful if there is only one observation and 2) we need some prior on models to go from marginal likelihood to the Bayes factor. After all, under both sampling model, the event admitting nontrivial ratio, $1(y=y_{obs}) 1(n=n_{obs})$, has zero measure. We could do whatever we want at this point without affecting any asymptotic property in almost sure sense.</p>Yuling YaoThe likelihood principle is often phrase as an axiom in Bayesian statistics. My interpretation of the likelihood principle reads: