Jekyll2021-11-22T20:10:19+00:00https://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoMarginal liklihood and the Lindley paradox2021-11-22T00:00:00+00:002021-11-22T00:00:00+00:00https://www.yulingyao.com/blog/2021/BF<p>I read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.</p> <p>Wagenmakers and Ly pointed out two approaches to escape the Lindley Paradox: either to avoid using a point hypothesis in the Bayes test, or to avoid a vague prior. Notably, we may still have the Lindley Paradox when the null is a spiky continuous distribution rather than the point mass.</p> <p>To make the discussion, consider a Bernoulli experiemnt $y\sim \mathrm{Bin} (n,p)$ and we observe $p = 5001$, and $n=10000$. We specify a point null $\theta=.5$ and the alternitve $\theta\neq .5$ or $\theta \sim$ uniform (0,1). The p-value for the null is approximately $\Pr(z&gt; 1/ (0.5 / sqrt(n)))= \Pr(z&gt;200)$=0, while BF is some very big number as $\theta=.5$ predicts the outcome much better than the vague prior $\theta \sim$ Uniform (0,1).</p> <p>We have made this point in our hierarchical stacking article: a model being true or false is not directly related to it being good or bad in terms of data fitting. Indeed a wronger model may make a better prediction depending on your chosen metric. In the Lindley paradox, at least I think, a Bayesian shall not judge that $\theta=.5$ in light of a very big BF, because we know a priori that Pr(point null)=0.</p> <p>The marginal likelihood is only weakly related to how the model fits the data. It reflects the average leave-q-out log predictive density when q varies from 0 to $n$, among which $q=n$ accounts for a non-proportional share because prior typically has a bad predictive power.</p> <p>To me, this irrelevance to the prediction task is the larger problem of BF: BF is aimed to test which model is more correct, rather than which model fits the data better. Worse, a model consists of two parts: the structure and the magnitude (the specific value in the prior). To appeal to BF, you need to do well on both parts. At some point, it is a test of the prior rather than the test of the model. In contrast, in hypothesis testing/LOO-model comparison/posterior predictive check, the prior is not or less relevant because these approaches examine the prediction ability of the inferred model other than the prior.</p> <p>BF/marginal likelihood does have its merit: we can easily trick empirical loss by using an overfitting model, in which the empirical loss approaches zero while BF will typically be very small because of the large/complex parameter space in the prior. In that sense, BF <em>never</em> overfits; BF <em>always</em> underfit.</p> <p>Can we make BF less sensitive on priors? Yes, use intrinsic BMA, or its $n=1$ limit, the pseudo-BMA (LOO-elpd weighting).</p> <p>Can we monitor empirical loss to test the model being <em>true</em> or <em>false</em> (other than <em>good</em> or <em>bad</em>)? Yes, stay tuned.</p>Yuling YaoI read an arxiv preprint “History and Nature of the Jeffreys-Lindley Paradox” by Eric-Jan Wagenmakers and Alexander Ly. It is a comprehensive journey that reviews the development of the “Jeffreys-Lindley Paradox”, or what is typically called the Lindley Paradox: we can reject a point null at p =0.0001 while the Bayes factor (BF) may favor this point null at BF = 1000.Terrace and gradient2021-10-05T00:00:00+00:002021-10-05T00:00:00+00:00https://www.yulingyao.com/blog/2021/gradient<p>I come across a paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4306294/">“The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask”</a> by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded that</p> <blockquote> <p>From a mathematical viewpoint, the adaptive biasing force method, just like adaptive biasing potential methods, is an adaptive importance-sampling procedure. There is, however, a salient difference between these two techniques. In the latter, the potential of mean force or, equivalently, the corresponding probability distribution along the transition coordinate is being adapted. In contrast, the former relies on biasing the force, i.e., the gradient of the potential. This difference is more important than it might appear at first sight, as potentials and probability distributions are global properties whereas gradients are defined locally. In terms of probability distributions, it means that the count of samples in the neighborhood of a given value of the transition coordinate is insufficient to estimate probability. Knowledge of the underlying probability distribution over a much broader range of $\xi$ is required. This may considerably impede efficient adaptation. In contrast, all that is needed to estimate the gradient is the knowledge of local behavior of the potential of mean force. Other regions along the transition coordinate do not have to be visited. Thus, in many instances, adaptation proceeds markedly faster. Using a common metaphor, the difference between the adaptive biasing potential and adaptive biasing force methods can be compared to inundating the valleys of the free-energy landscape as opposed to plowing over its barriers to yield an approximately flat terrain, conducive to unhampered diffusion.</p> </blockquote> <p>I like the plowing metaphor. I found a photo of Rice Terraces in Yunnan:</p> <image src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/2007_1206_Cleared_Hani_rice_terraces.jpg/640px-2007_1206_Cleared_Hani_rice_terraces.jpg" /> <p>which is in contrast to:</p> <image src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/2017_Aerial_view_Hoover_Dam_4774.jpg/600px-2017_Aerial_view_Hoover_Dam_4774.jpg" /> <p>Aside from the context of free energy computation, the exact same reason implied by the previous metaphor suggests that the gradient-based method is often more an alternative dual approach to the zero order method:</p> <ol> <li>In survival analysis, the Nelson–Aalen estimator is sort of the gradient version of of the Kaplan–Meier estimator (product limit).</li> <li>In optimization, finding the mode of convext function is equivalent to finding the minimin the abs(gradient) function.</li> <li>In cross-validation, the jackknife is the gradient-alternative to importance sampling.</li> <li>In optimization convergence test, we can either monitor if the objective is stable, or if the gradient becomes zero.</li> <li>In MCMC convergence test, we can either monitor if the sample draws have mixed, or if the gradient of the log density has mean zero.</li> </ol> <p>Should we compute more gradients?</p>Yuling YaoI come across a paper “The Adaptive Biasing Force Method: Everything You Always Wanted To Know but Were Afraid To Ask” by Jeffrey Comer et al. When comparing the adaptive biasing force method (gradient based method) and importance sampling based methods (zero-order method), the authors concluded thatHow do we compare two numbers2021-09-15T00:00:00+00:002021-09-15T00:00:00+00:00https://www.yulingyao.com/blog/2021/number<p>I was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.</p> <p>Here are two examples:</p> <ol> <li>As per Google, Pete Buttigieg’s height is 5’8 and Gavin Newsom’s is 6’3, who are on the relatively short and tall end of the modern day politician’s height spectrum respectively. With these two numbers in mind, certainly 6’3 is much bigger than 5’8, right?</li> <li>In the 2020 U.S. Presential election, the Democratic share in TX was 47% and the Republican share was 52%. Hey, it was 47 and 52: what a tossup!</li> </ol> <p>The point is that 6’3 / 5’8 = 190.5 cm / 172.7 cm = 1.10, and 52 / 47 = 1.11. These two sets of comparisons have the same multiplicative difference, but why do we automatically read that 6’3 $»$ 5’8, while 52 $\approx$ 47?</p> <p>One explanation is some sort of anchor effect. We encounter this arbitrary anchor choice in data visualization too: when comparing two coefficients, what $y$-axis scale are we using? Here by looking at the multiplicative difference, we have implicitly included zero as the lower end of the $y$-axis. But an adult male politician’s heigh cannot be zero, so maybe implicitly we have a different lower end point, or the anchor, say 5’6, then the actual multiplicative difference we are reading in mind is (6’3 - 5’6) / (5’8-5’6) = 4.6.</p> <p>Another explanation is that we have mapped the parameters into some decision theory. When a computer reads 6’3, it is just some 32-bit integer. But we are not computers after all. We automatically generate a decision theory, in which the integer 6’3 is mapped to a masculine man wearing a brooks brother suit and oxford shoes, while the number 52% is mapped to some annoying recounting and the reflection of 2000. None of such additional information is coded by the numbers as they are presented.</p>Yuling YaoI was reading an article on how the politician’s height can have a causal effect on electability. But then I realize we often have a different scale for comparing numbers when we know these numbers represent some physical objects.MEBA—Make Empirical-Bayes Bayes again2021-08-19T00:00:00+00:002021-08-19T00:00:00+00:00https://www.yulingyao.com/blog/2021/meba<p>Assuming there are some hyperparameters $\beta$ in the model involving data $y$. We have four ways to get some inference of $\beta$.</p> <h2 id="map-is-bad">MAP is bad</h2> <p>First, we have MAP, or empirical loss optimization. That is, for each $\beta$, we could train the model and obtain some in sample loss $l(y_i \mid \beta )$. Then we minimize this loss: $\hat \beta_{MAP}= \min \sum_{i} l(y_i \mid \beta ).$</p> <p>We could add some prior regularization $p(\beta)$ too, which will modify it to be</p> $\hat \beta_{MAP}= \min \sum_{i} l(\beta \mid y_i) - \log p(\beta).$ <h2 id="we-can-go-loo-or-we-can-go-bayes">We can go LOO, or we can go Bayes</h2> <p>The above procedure is attacked in two ways. One argument is that the empirical loss optimization overfits because of the misuse of in-sample error. We can adjust for this error by using cross-validation. For example, incorporating the leave one out cv and empirical loss optimization, we have</p> $\hat \beta_{LOO}= \min \sum_{i} l( y_i \mid \beta y_{-i}) - \log p(\beta).$ <p>This LOO step is related to empirical Bayes if we are using LOO metrics in exmperical Bayes.</p> <p>Yet another attack to MAP is that it is a point estimate. “You overfit cuz you ignore the uncertainty”. As an attempt to fix it, we have some generalized Bayesian step:</p> $\log p (\beta \mid y) =- \sum_{i} l(y_i \mid \beta ) + p(\beta).$ <h2 id="can-we-go-both">Can we go both?</h2> <p>It is natural to ask: which one is better: Bayes or LOO-MAP? The answer depends. For example, in the context of regression, LASSO (where the hyper parameter is tuned by LOO) is much better than bayesian lasso (in which the hyper parameter is treated as a parameter to fit using the Bayes rule).</p> <p>But an even larger picture is a 2 by 2 table</p> <table> <thead> <tr> <th style="text-align: center">MAP 😱</th> <th style="text-align: center">LOO-MAP (Empirical Bayes) 😐</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><strong>Bayes</strong> 😐</td> <td style="text-align: center">😊</td> </tr> </tbody> </table> <p>We have two directions to improve MAP: either using LOO or using Bayes. But can we combine them? Can we reach that 😊 block?</p> <h2 id="bayesianize-the-empirical-bayes">Bayesianize the Empirical-Bayes</h2> <p>The idea is to define a posterior density via the leave one out likelihood:</p> $\log p (\beta \mid y)= - \sum_{i} l( y_i \mid \beta, y_{-i}) + \log p(\beta).$ <p>Is it justified to be full-bayes? Yes. It can be viewed as data-augmentation. Assuming there is hold out dataset, we could use one dataset to first obtain conditional parameter inference $p( \theta \mid y, \beta)$, we then obtain exact Bayesian inference on hyperparameter $\beta$ as $p( \beta \mid y, y^\prime)$ using hold out data $y^\prime$ and integrating out $\theta$. Now instead of having this hold-out dataset, we integrate it out. That is the LOO likelihood part.</p> <p>Is there an example in which this idea yield success? Yes, we have shown in our hierarchical stacking paper that this LOO-likelihood sampling (hierarchical stacking) yields better predictions than LOO-optimization (no-pooling stacking).</p> <p>Is there computaitonal advantaves over LOO-MAP? Yes, LOO-MAP is often done by grid-search. But we can now use gradient information (wrt $beta$) when sampling this density.</p> <p>Can we extend this to a general inference paradigm, which will sit parallel, if not above, to MAP, Bayes, and empirical Bayes? Highly promising. I am looking forward to that.</p>Yuling YaoAssuming there are some hyperparameters $\beta$ in the model involving data $y$. We have four ways to get some inference of $\beta$.Decision theory is hard2021-06-04T00:00:00+00:002021-06-04T00:00:00+00:00https://www.yulingyao.com/blog/2021/decision<p>One mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would prefer</p> <ol> <li>chicken tikka masala,</li> <li>chicken madras,</li> <li>apple pie.</li> </ol> <p>There is a limited supply and you are asked to order your preference, which is not necessarily honored. To be fair, I don’t think these dishes are on any actual menu, but the point of this example is that options (1) and (2) are nearly identical (from your humble author’s point of view).</p> <h2 id="selection-or--mixing">selection or mixing</h2> <p>One psychological confusion is the difficulty to distinguish between “selection” and “mixing”. The ability of “ordering your preference” refers to first generate a list of latent preferences and then order them. Assuming I have a sophisticated mind and I automatically self-normalize the latent preferences into $x_1, x_2, x_3$ such that $x_i\geq 0$ and $x_1+ x_2+ x_3=1$.</p> <p>But it is not clear how reliable our preference generation ability is. There are two orthogonal approaches to do so. First, one by one. We figure out how much utility we would have when having only chicken tikka masala and so on. Maybe I would prefer chicken overall much better than an apple pie, so I will have $x_1=0.46, x_2=0.44, x_3=0.1$.</p> <p>Second, we can embed this discrete problem into a larger continuous problem: We imagine there is a tasting menu that mixes these three dishes, and we are considering the optimal mixing proportion. This time, because human generally has a convex utility function, the Jensen’s inequality vividly suggests I should not order two chicken curry dishes simultaneously. Then this optimal weight would inflate the preference on the third item, such as $x_1=0.3, x_2=0.3, x_3=0.4$. That is an order flip.</p> <h2 id="sequential-decision-making">sequential decision making</h2> <p>When it comes to sequential decision-making, it is even harder. Instead of an order of the list, we are now asked to bid for a dish one at a time. Also assume my actual preference is 0.5 0.1, 0.4. Because (1) and (2) are alike, my mental process might first distinguish between (1) and (2)— (1) is an easy win. It is like matching, if most coordinates match perfectly, ordering is easy. Then I will process curry dishes and apple pie, in which I might have some struggle: they are just very different two items, and I can typically make up reasons for both of them. But anyway, I find curry better than apple pie after some self-fighting. So I tell the flight crew I will order (1).</p> <p>But then the flight crew checks the headcount in the kitchen, and dish (1) is sold out. So I am asked to choose between (2) and (3) again.</p> <p>A good mental process should be consistent in some way: the behavior of</p> <ul> <li>choosing between (2) and (3) conditioning on (1) being not available</li> <li>choosing between (2) and (3) if (1) had not been brought out at all</li> </ul> <p>should be the same. It is like multinomial classification with $K$ categories is equivalent to a $K-1$ binomial classifications. If that is the case, I should pick item (3) for $x_3=0.4$.</p> <p>Except no, my mental process is often not a martingale. It is natural to be sad when learning (1) is not honored, and that will influence how I make my next stage decision: I might tend to pick (2), just due to its similarity to (1) and this similarity compensates for my disappointment/regretfulness. Is it necessarily irrational? Maybe, but the disappointment is a real feeling, and maximizing the utility of the whole process including the decision-making phase is also a sensible goal.</p>Yuling YaoOne mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would preferWho is more Bayesian, Rubin or Pearl?2021-05-23T00:00:00+00:002021-05-23T00:00:00+00:00https://www.yulingyao.com/blog/2021/causal<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/Catplay-fight.JPG/320px-Catplay-fight.JPG" alt="cat-fight" /></p> <p>OK the short answer is that they are both experts on Bayesian statistics, or at least in terms of using Bayesian inference.</p> <p>But we have argued that there are many levels of Bayesian practice. For example, one Gelman school Bayesian statistician would be featured by using Stan—and more intrinsically, the preference of a generative model that can be often be mapped into a chuck of stan code.</p> <p>The word “generative’’ can mean other things, here I mean that we have a probabilistic model for outcome given input, and we could/want to generate predictions and simulation draws for this outcome. If we have such a model, then the causal inference is trivial by imputing all potential outcomes. In contrast, reduced form models do not model individual outcomes. For example, in ANOVA, we do not model $y$ at all; we make a hypothesis test. In Latin square design: we do not model row effect or column effect (even if we can). In the least square fit, we do not have a probabilistic model for $y\vert x$: whether it has a Gaussian noise or a student-$t$(8) noise, a least-square-fit is equally fine.</p> <p>Many of Rubin’s approaches would fall in the reduced-form modeling culture. I suspect this preference had received some impact from Cochran, who did a lot on experimental design, and Corcoran’s experimental design is pretty much a hedging tool to eliminate all environment factors such that treatment effect can be derived from sample average. A similar goal applies when we are doing matching or weighting: we would like to mimic a randomized experiment after some procedure as much as possible, such that we do not have to model all other confounders by a parametric model. Both matching and weighting are not generative—perhaps matching can be translated to a $k$-nearest neighbors type of model to be conceptually generative, but there is even an additional gap from $k$-nearest neighbors to a probabilistically generative model. A further development through this line is a rich literature on semiparametric theory we automatically dismiss nuisance parameters in estimation, and therefore not focus on individual level prediction. For example: regressing y on the propensity scores appears to be efficient in the semiparametric sense, but it is often not a good individual level prediction.</p> <p>A DAG, on the other hand, is automatically Bayesian and generative. It could typically be mapped to a Stan program (maybe with the exception of discrete variables, but those can be marginalized out too), and vice versa. In this sense, inference from DAG is at least more “generatively Bayesian” than weighting, matching, double robustness, etc.</p> <p>But instead of viewing their graph as one of many plausible models, I find DAG people often obsessed with treating a graph as a description of (conditional) dependence. The historical reason is probably due to computation: in the bygone days, we need a lot of conditional independence to run Gibbs sampling or variational inference. Nowadays in generic computation software such as Stan, conditional independence is not even a consideration. But this ease of computation does not ease DAG people’s obsession with (in)dependence. For example, one critical assumption to be made is the faithfulness condition: a DAG is mapped to a set of dependence or independence relations, and nothing more is allowed in the population. On the contrary, I don’t think a matching person would spend too much time discussing conditional independence: matching is sold for being distribution-free (in the same way as a LSE is prima facie distribution-free).</p> <p>Certainly there are some aspects of Pearl’s approach that are connected to semiparametric efficiency, or the reduced-form-model-for-efficiency culture, when he recommended to only regress on back-door variables. A fully generative model would want to include all variables—I am not saying I want to include all variables and then use the regression coefficient as causal effect; I am talking about using a good regression model for individual level predictions, and then we can extract all potential outcomes.</p> <p>In light of these comparisons, the difference between Rubin and Pearl’s approach to causal inference is orthogonal to the battle between build-a-genertive-model-as-much-as-you-want vs reduced-form-for-robustness-and-efficiency. Perhaps another orthogonal dimension is between use-a-model-to-smooth-the-outcome-then-do-something-else vs directly-manipulate-data-to-obtain-an-algorithm.</p> <p>There is a third participating counterpart in this casual inference battle. I want to call it Gelman’s approach but I guess he hates human names in methods. The idea is that we would like to build a fully generative model (as big as you want) for individual level prediction. Then all causal questions, individual or average, are answered by imputations/generative quantities. Still, there are non-checkable assumptions to be made such as unfoundedness. However, we would not assume our model is the true model and derive a sequence of conditional independence therein. Rather, we would check and improve our model in terms of better individual level predictions. Perhaps this view is indeed more “generative” than DAG. A DAG can, but sometimes prefer not to specify the joint distributions of all variables, and that is why it needs a “faithfulness assumption”—a DAG can be used to represent a family of generative models, in the way that a reduced-form LSE can represent both the normal- and student-t-error-model.</p> <p>I think people in the foxhole have followed this fully-generative approach. Maybe it is time to develop more theories. For example, if I <em>know</em> the individual level outcome $y$ depends on treatment $z$, some related variables $x_1$, and a another set of input $x_2$, and I also <em>know</em> that $z$ depends on $x_1$ but is<br /> independent of $x_2$, then at the risk of being attacked by both Rubin and Pearl, a foxhole -fully-generative-Bayesian would still want an outcome model $y$ that includes $x_2$, $x_1$, and $z$. This model is likely less efficient in estimating ATE, but has the benefit of robustness against model misspecification. Is there any usage of the propensity score in this fully-generative approach? How to balance the goal of generic model fitting (robustness) and a targeted estimation (efficiency) in the workflow?</p>Yuling YaoI passed my defense today2021-05-07T00:00:00+00:002021-05-07T00:00:00+00:00https://www.yulingyao.com/blog/2021/defense<p>I passed my <a href="https://www.yulingyao.com/pdf/thesis_YY.pdf">thesis</a> defense today. The capability of passing the defense per se appeared less exciting than I had imagined, in part because everyone passes the dissertation defense anyway. It is like the p-value in a medical publication: readers have already known it is significant before reading the paper, or the “Result’’ section of a conference paper: readers already know the experiment manifests “our proposed method beats all benchmarks (after some random-seed-hacking)”—an inevitable promise contains its own languishment and gloominess from the very beginning.</p> <image src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/BhagavadGita-19th-century-Illustrated-Sanskrit-Chapter_1.20.21.jpg/320px-BhagavadGita-19th-century-Illustrated-Sanskrit-Chapter_1.20.21.jpg" /> <p>[Image of Bhagavad Gita from Hindu Epic, by which Oppenheimer illustrated the idea of self-implied-destruction.]</p> <p>Graduate school is analogous to an open-ended workflow: we cease the process before the convergence. I have accumulated maybe 10 first/co-first-author papers, but I also have more unfinished papers awaiting to be finished. I built a few methods and they generated ~800 citations, but I also know these methods have limitations and rarely scale to real big models and dataset. I started to establish myself professionally, but I also did not win a Nobel Prize for my thesis, like de Broglie, or got a faculty job in Berkeley or anything, like Andrew Gelman, or dropped school and started a search engine or a social media. I learned some more statistics, perhaps only a little bit more math, a hint of computer science, and some blog writing skills, but I also forgot many things along the way such as differential geometry or functional analysis. I read papers on a frequent basis, but I also have brand new textbooks on my shelf that I bought a few years ago and have not read a single page since. I have taken those classes, while I have never fully understood RKHS, decoupling, or semiparametrics. I have tried from time to time, while I apprehend anything but stochastic PDE, optimal transport, pricing of Asian options, gradient flow, Riemannian manifold, Cheeger’s inequality, or virtually the vast majority of things.</p> <p>A Cox-type quote here would be</p> <blockquote> <p>All graduate school experiences are unsound, but some are useful.</p> </blockquote> <p>which itself brings further debates on whether I need better diagnostics for my usefulness. The hesitation lies in that I cannot rerun the model even if the diagnostics indicate horrible fit. In this sense, gradate school is not even a sophisticated workflow, it is a one-shot pre-asymptotic MCMC sampler: you get what you get. Slightly worse than a Markov chain, human-beings have a longer memory, driving me to ask retrospective questions: If I had seen this local mode that I end up with, should I adopt a different step size five years ago? If I had known that I would not fall in the typical set, should I shift my adaptation direction during warm-up?</p> <p>Clearly, I would have fewer regrets if I were Markovian, or if I abandoned the potential outcome frameworks.</p> <p>Anyway, I am glad and grateful for finishing this life-phase. I am pasting my acknowledgement from my thesis here:</p> <blockquote> <p>Starting by “this thesis would not have been possible without these people” would be cliché and kitsch. Such claim misleadingly defines the causal effect of an unrealistic and irreproducible intervention, although part of this thesis addresses leave-one-out cross validation or influence analysis. That being said, the aid from my advisor, collaborators and friends throughout my graduate study and research is not a latent parameter to estimate; it is input that is palpable and treasurable.</p> </blockquote> <blockquote> <p>It would be unscalable to exhaustively list all the support I enjoyed from my doctoral advisor Andrew Gelman. Aside from inspirations I have learned from his blackboard and chalks, emails, overleaf editing logs, and those meetings with three hundred topics emerging in the same room, Andrew presents me an example of what a good statistician could be. To the same extent that model fitting benefits from fake data simulation, my trajectory toward an aspiring researcher is boosted by these simulations of fake Andrew in my head—”What would Andrew say to a sloppy graph? What model would Andrew consider given this dataset? What insights would Andrew write down on his pocket notebook after reading these papers?”—Well, even this metaphor is adapted from his.</p> </blockquote> <blockquote> <p>I am grateful to Aki Vehtari for his guidance and discussions on many, if not all, of my research works. In addition to all bibliographies he meticulously pointed me to, a simulated Aki would also appear in the previous imaginary loop: “What would Aki say to this new idea? Would Aki be happy with this software implementation?”</p> </blockquote> <blockquote> <p>I would like to thank collaborators with whom I have worked closely: Dan Simpson, Lex van Geen, Bob Carpenter, Yu-Sung Su, Jonah Gabry, Ben Bales, Jonathan Auerbach, Gregor Pirš, Charles Margossian, Collin Cademartori, and others. As a universal statistical principle applies, “the most important thing is what data you work with, not what you do with the data”, whose corollary lower bounds how much I have already learned from stacking all these collaborators.</p> </blockquote> <hr /> <p>P.S., it is very unrelated, but my text editor did not recognize the word “languishment” when I typed it, so I double checked in google. It appears Google-Translate automatically translates this word into “语言” in Chinese, which means “language”. Hmmm, Bayes classifier is no good.</p> <p><img src="/blog/images/2021/ScreenShot20210507.png" alt="gg" title="google" /></p>Yuling YaoI passed my thesis defense today. The capability of passing the defense per se appeared less exciting than I had imagined, in part because everyone passes the dissertation defense anyway. It is like the p-value in a medical publication: readers have already known it is significant before reading the paper, or the “Result’’ section of a conference paper: readers already know the experiment manifests “our proposed method beats all benchmarks (after some random-seed-hacking)”—an inevitable promise contains its own languishment and gloominess from the very beginning.Bipartisan vaccination?2021-04-17T00:00:00+00:002021-04-17T00:00:00+00:00https://www.yulingyao.com/blog/2021/vacc<p>I read a NYT graph article entitled <a href="https://www.nytimes.com/interactive/2021/04/17/us/vaccine-hesitancy-politics.html">Least Vaccinated U.S. Counties Have Something in Common: Trump Voters</a>. Apart from beautiful visualizations, their graph comparison seems persuading to draw two conclusions:</p> <ol> <li>States with larger Trump vote shares are likely to have more adults who are vaccine hesitant.</li> <li>States with larger Trump vote shares have a smaller share of fully vaccinated adult residents.</li> </ol> <p>Such association is legit. But I am concerned that this article can have several misleading aspects.</p> <ol> <li><strong>The bipartisan gap, or the income gap, or the urban-rural gap?</strong> The shares of trump voters are associated with many other variables. Slightly rearrange what variable to show, one would equally draw a conclusion such as “<em>States with lower average incomes are likely to have more adults who are vaccine hesitant</em>”, or perhaps a conspiracy theory might want to change the title to a thrilling “<em>Rural states are ignored in vaccine distributions</em>”. All these variables are quite related. The root of conspiracy theories is to attribute all variations to one single variable, no matter it is race or voting.</li> <li><strong>The state level and individual level relation</strong>. To be fair, the article did not suggest “individual Trump voters are likely to be more vaccine hesitant”, but I am afraid many readers would interpret the state level associations in this way. Although this micro level explaination is plausible, we just cannot draw individual level relation from group level data. A famous example is that income is generally positively associated with education, but universities or colleges is properly among the lower-end of sections in terms of average salaries, despite its high level of average employee education: the industry-clustering is correlated with outcome variables. The same reasoning applies here for individual level inference.</li> <li><strong>What can the country level comparison tell.</strong> The article did do more analysis on country level comparisons, with a section title “Counties where more residents voted for Trump often have lower vaccination rates”. Based on the graph, it is a weak evidence. If you look into states like Virginia, Oregon, or New Jersey, probably you would draw the opposite conclusion. Again, generally countries level relations do not have to the same direction as in individual level or in state level. Think about income for example: richer states are more blue while richer voters are more republican. In between, the county level relation can mix these two ends: many part of long island is redder and rich because rich people choose to cluster there, while in there are other red and poor upper state counties because they are rural. Again, both of these two mechanisms are outcome-dependent clustering but they result in divergent signs in country level relations. In this example, the country level comparison is mostly driven by (a) rural-urban distinction and (b) elder population shares. I guess that is why the pattern in Vermont is mostly random.</li> <li><strong>Vaccine hesitation vs Vaccine rollout</strong> The article mixes these two outcome variables: vaccine hesitation and vaccine rollout. But are the correlation between them a fixed constant? Since it (April) is still in the early phase of vaccine distribution nationwide, is the vaccine trust the main bottleneck of vaccine rollout? I think the answer is negative and it is more evident in country levels. Look at upper state NY. The vaccine hesitation is relatively high within the state, but the rollout rates are good. Hamilton County has 65% fully vaccinated adults (remember the optimal rejection rate of an HMC sampler is indeed 0.651), highest in the state. Sure, in the longer run when there is enough supply, the vaccine rollout rate will eventually become (1- vaccine hesitation rate) and thereby perfectly correlated. But for the short term there are many other factors too. It seems the author has tries to fix this problem by showing that the average share of delivered doses reported as used is lower in 10 most hesitate states: but this usage efficiency itself would be largely correlated with rural-urban distinction.</li> </ol> <p>It is convenient/eye-catching to explain everything by bipartisan gap. But we need more modeling for the full picture story.</p>Yuling YaoI read a NYT graph article entitled Least Vaccinated U.S. Counties Have Something in Common: Trump Voters. Apart from beautiful visualizations, their graph comparison seems persuading to draw two conclusions:Note on “model diversity”2021-04-12T00:00:00+00:002021-04-12T00:00:00+00:00https://www.yulingyao.com/blog/2021/negativecorr<p>In my <a href="https://statmodeling.stat.columbia.edu/2021/01/26/hierarchical-stacking-part-ii/">previous blog post</a> on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.</p> <p>So I took a look at that thesis. In the NC framework we are combining $K$ point estimates $f_{1}, \dots, f_{K}$</p> $f_{ens} (x)=\sum_{k=1}^K w_k f_i(x)$ <p>and try to minimize the MSE of the ensemble.</p> $\mathrm{MSE}= \frac{1}{K}\sum_{i=1}^K w_i E (f_i(x)-y)^2 + \frac{1}{K^2} \sum_{i=1}^K \mathrm{Var}(f_i) + \frac{1}{K^2} \sum_{i=1}^K \sum_{j\neq i} \mathrm{Cov}(f_i, f_j).$ <p>The three terms read</p> $\mathrm{bias}^2 + \frac{1}{K} \mathrm{variance} + (1-\frac{1}{K}) \mathrm{covariance}.$ <p>The intuition is that we want to maximize the diversity, perhaps that means minimize correlation?</p> <p>An alternative decomposition is called ambiguity. As the MSE of the ensemble is</p> $\mathrm{MSE}= \mathrm{E} \vert f_{ens} (x)- y \vert ^2 = \sum_{i=1}^K w_i E (f_i(x)-y)^2 - \sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2.$ <p>Here the term</p> $\sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2$ <p>is called <em>ambiguity</em>. Compared with the covariance, the ambiguity only has one term so is likely more tractable.</p> <p>That is the basis of negative correlation (NC) learning, in which we train K neural nets. Instead of minimizing the individual $\mathrm{E} (f_i(x)-y)^2$, we minimize the error minus the ambiguity (or plus the covariance): $$\min (f_i-y)^2 + \lambda (f_i - f_{ens}) \sum_{j\neq i} (f_j -f_{ens})$$</p> <p>Certainly all the theory results derived here are brilliant. But there are some reasons why we did not stop here.</p> <p>First, the bias variance trade-off only applies to MSE, while we want to quantify the ensemble richness with respect any pre-specified utility. Of course, Some Jensen inequality still holds.</p> <p>Second, also because of the central role of MSE here, the term ambiguity does not apply to combining predictive distributions. We do not have the concept of correlation there: say we have two random variable $\mathrm{Corr}(x_1, x_2)=0.5$, but what is $\mathrm{Corr}(N(x_1,1) , N(x_1,1))$?</p> <p>Third, this ambiguity term is only a good summary of the diversity when it stands next to the individual bias. Standing alone, this term is independent of data (kinda like why we would like to attack PCA for ignoring $y$).</p> <p>These reasons are why we propose a new metric in our hierarchical stacking paper: how often an individual model wins.</p>Yuling YaoIn my previous blog post on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.I failed an interview for being Bayesian2021-04-07T00:00:00+00:002021-04-07T00:00:00+00:00https://www.yulingyao.com/blog/2021/job2<p>I get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was that</p> <blockquote> <p>“You were not open-minded to solutions and first principles to the problem other than the ones that you are comfortable with.”</p> </blockquote> <p>I recall during the interview there were a few case studies on data analysis. I proposed a complete workflow: how to design the experiment, how to make causal and decision theory adjustments, how to build models, how to make regularization, how to compute and approximate, and how to make model improvement. This is what I did in my applied data analysis all the time.</p> <p>It turned out the interviewers were expecting some “first principles” such as</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if(prediction problem) run a neural net; if(causal inference) run a t-test; if(model evaluation) compute AIC; if(decision theory is involved) flip a coin; </code></pre></div></div> <p>Hmm, if these magical simplifications are what I were not open-minded to, it is hard for me to regret.</p> <p>Apart from reminding me of a paper review I once received “Because all MCMC methods are not scalable to big data, your new development therein is not interesting”, the feedback above might also seem to suggest that this company itself is not open-minded to candidates who are capable of solving certain problems in ways that are not familiar to the company but might have been proven successful elsewhere. To be fair, this we-are-hiring-people-who-have-certain-skills-but-nothing-more attitude makes sense in business: these companies typically have a comprehensive pipeline to problem solving and an entry level employee should really focus on implementing the given pipeline. I blame the Neanderthal inside me for not being compatible.</p> <p>This whole story echoes what Andrew used to say</p> <blockquote> <p>Making Bayes inference is the only correct thing to do when you have correct model and correct prior no matter who you are. Being Bayesian means you make Bayes inference anyway even all assumptions are wrong.</p> </blockquote> <p>from which you could tell where I got my “<em>not being open-minded</em>” from, if I were enforced to accept such accusation.</p> <p>In short, I am not regretful for being assertive/creative in interviews. Or put it in another way, I am not regretful for receiving doctoral level training on applied statistics (in contrast to, say, taking two online courses on “mastering all machine learning and statistics and data science and programming in 14 days”), which grants me those assertiveness and creations. As Winston Churchill pointed out:</p> <blockquote> <p>You failed an interview for proposing your own solutions? Good. That means you’ve stood up for something, sometime in your life.</p> </blockquote>Yuling YaoI get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was that