Jekyll2021-06-10T02:36:53+00:00https://www.yulingyao.com/blog/feed.xmlYuling Yao’s BlogBayesian Statistics, Machine LearningYuling YaoDecision theory is hard2021-06-04T00:00:00+00:002021-06-04T00:00:00+00:00https://www.yulingyao.com/blog/2021/decision<p>One mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would prefer</p>
<ol>
<li>chicken tikka masala,</li>
<li>chicken madras,</li>
<li>apple pie.</li>
</ol>
<p>There is a limited supply and you are asked to order your preference, which is not necessarily honored. To be fair, I don’t think these dishes are on any actual menu, but the point of this example is that options (1) and (2) are nearly identical (from your humble author’s point of view).</p>
<h2 id="selection-or--mixing">selection or mixing</h2>
<p>One psychological confusion is the difficulty to distinguish between “selection” and “mixing”. The ability of “ordering your preference” refers to first generate a list of latent preferences and then order them. Assuming I have a sophisticated mind and I automatically self-normalize the latent preferences into $x_1, x_2, x_3$ such that $x_i\geq 0$ and $x_1+ x_2+ x_3=1 $.</p>
<p>But it is not clear how reliable our preference generation ability is. There are two orthogonal approaches to do so. First, one by one. We figure out how much utility we would have when having only chicken tikka masala and so on. Maybe I would prefer chicken overall much better than an apple pie, so I will have $x_1=0.46, x_2=0.44, x_3=0.1$.</p>
<p>Second, we can embed this discrete problem into a larger continuous problem: We imagine there is a tasting menu that mixes these three dishes, and we are considering the optimal mixing proportion. This time, because human generally has a convex utility function, the Jensen’s inequality vividly suggests I should not order two chicken curry dishes simultaneously. Then this optimal weight would inflate the preference on the third item, such as $x_1=0.3, x_2=0.3, x_3=0.4$. That is an order flip.</p>
<h2 id="sequential-decision-making">sequential decision making</h2>
<p>When it comes to sequential decision-making, it is even harder. Instead of an order of the list, we are now asked to bid for a dish one at a time. Also assume my actual preference is 0.5 0.1, 0.4. Because (1) and (2) are alike, my mental process might first distinguish between (1) and (2)— (1) is an easy win. It is like matching, if most coordinates match perfectly, ordering is easy.
Then I will process curry dishes and apple pie, in which I might have some struggle: they are just very different two items, and I can typically make up reasons for both of them. But anyway, I find curry better than apple pie after some self-fighting. So I tell the flight crew I will order (1).</p>
<p>But then the flight crew checks the headcount in the kitchen, and dish (1) is sold out. So I am asked to choose between (2) and (3) again.</p>
<p>A good mental process should be consistent in some way: the behavior of</p>
<ul>
<li>choosing between (2) and (3) conditioning on (1) being not available</li>
<li>choosing between (2) and (3) if (1) had not been brought out at all</li>
</ul>
<p>should be the same.
It is like multinomial classification with $K$ categories is equivalent to a $K-1$ binomial classifications. If that is the case, I should pick item (3) for $x_3=0.4$.</p>
<p>Except no, my mental process is often not a martingale. It is natural to be sad when learning (1) is not honored, and that will influence how I make my next stage decision: I might tend to pick (2), just due to its similarity to (1) and this similarity compensates for my disappointment/regretfulness. Is it necessarily irrational? Maybe, but the disappointment is a real feeling, and maximizing the utility of the whole process including the decision-making phase is also a sensible goal.</p>Yuling YaoOne mental challenge is decision-making with more than two options. To simplify the dilemma that your humble author is encountering, assume that it is on a long haul flight and you are asked by a friendly cabin crew which of the following dish you would preferWho is more Bayesian, Rubin or Pearl?2021-05-23T00:00:00+00:002021-05-23T00:00:00+00:00https://www.yulingyao.com/blog/2021/causal<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/65/Catplay-fight.JPG/320px-Catplay-fight.JPG" alt="cat-fight" /></p>
<p>OK the short answer is that they are both experts on Bayesian statistics, or at least in terms of using Bayesian inference.</p>
<p>But we have argued that there are many levels of Bayesian practice. For example, one Gelman school Bayesian statistician would be featured by using Stan—and more intrinsically, the preference of a generative model that can be often be mapped into a chuck of stan code.</p>
<p>The word “generative’’ can mean other things, here I mean that we have a probabilistic model for outcome given input, and we could/want to generate predictions and simulation draws for this outcome. If we have such a model, then the causal inference is trivial by imputing all potential outcomes. In contrast, reduced form models do not model individual outcomes. For example, in ANOVA, we do not model $y$ at all; we make a hypothesis test. In Latin square design: we do not model row effect or column effect (even if we can). In the least square fit, we do not have a probabilistic model for $y\vert x$: whether it has a Gaussian noise or a student-$t$(8) noise, a least-square-fit is equally fine.</p>
<p>Many of Rubin’s approaches would fall in the reduced-form modeling culture. I suspect this preference had received some impact from Cochran, who did a lot on experimental design, and Corcoran’s experimental design is pretty much a hedging tool to eliminate all environment factors such that treatment effect can be derived from sample average. A similar goal applies when we are doing matching or weighting: we would like to mimic a randomized experiment after some procedure as much as possible, such that we do not have to model all other confounders by a parametric model. Both matching and weighting are not generative—perhaps matching can be translated to a $k$-nearest neighbors type of model to be conceptually generative, but there is even an additional gap from $k$-nearest neighbors to a probabilistically generative model. A further development through this line is a rich literature on semiparametric theory we automatically dismiss nuisance parameters in estimation, and therefore not focus on individual level prediction. For example: regressing y on the propensity scores appears to be efficient in the semiparametric sense, but it is often not a good individual level prediction.</p>
<p>A DAG, on the other hand, is automatically Bayesian and generative. It could typically be mapped to a Stan program (maybe with the exception of discrete variables, but those can be marginalized out too), and vice versa. In this sense, inference from DAG is at least more “generatively Bayesian” than weighting, matching, double robustness, etc.</p>
<p>But instead of viewing their graph as one of many plausible models, I find DAG people often obsessed with treating a graph as a description of (conditional) dependence. The historical reason is probably due to computation: in the bygone days, we need a lot of conditional independence to run Gibbs sampling or variational inference. Nowadays in generic computation software such as Stan, conditional independence is not even a consideration. But this ease of computation does not ease DAG people’s obsession with (in)dependence. For example, one critical assumption to be made is the faithfulness condition: a DAG is mapped to a set of dependence or independence relations, and nothing more is allowed in the population. On the contrary, I don’t think a matching person would spend too much time discussing conditional independence: matching is sold for being distribution-free (in the same way as a LSE is prima facie distribution-free).</p>
<p>Certainly there are some aspects of Pearl’s approach that are connected to semiparametric efficiency, or the reduced-form-model-for-efficiency culture, when he recommended to only regress on back-door variables. A fully generative model would want to include all variables—I am not saying I want to include all variables and then use the regression coefficient as causal effect; I am talking about using a good regression model for individual level predictions, and then we can extract all potential outcomes.</p>
<p>In light of these comparisons, the difference between Rubin and Pearl’s approach to causal inference is orthogonal to the battle between build-a-genertive-model-as-much-as-you-want vs reduced-form-for-robustness-and-efficiency. Perhaps another orthogonal dimension is between use-a-model-to-smooth-the-outcome-then-do-something-else vs directly-manipulate-data-to-obtain-an-algorithm.</p>
<p>There is a third participating counterpart in this casual inference battle. I want to call it Gelman’s approach but I guess he hates human names in methods. The idea is that we would like to build a fully generative model (as big as you want) for individual level prediction. Then all causal questions, individual or average, are answered by imputations/generative quantities. Still, there are non-checkable assumptions to be made such as unfoundedness. However, we would not assume our model is the true model and derive a sequence of conditional independence therein. Rather, we would check and improve our model in terms of better individual level predictions. Perhaps this view is indeed more “generative” than DAG. A DAG can, but sometimes prefer not to specify the joint distributions of all variables, and that is why it needs a “faithfulness assumption”—a DAG can be used to represent a family of generative models, in the way that a reduced-form LSE can represent both the normal- and student-t-error-model.</p>
<p>I think people in the foxhole have followed this fully-generative approach. Maybe it is time to develop more theories. For example, if I <em>know</em> the individual level outcome $y$ depends on treatment $z$, some related variables $x_1$, and a another set of input $x_2$, and I also <em>know</em> that $z$ depends on $x_1$ but is<br />
independent of $x_2$, then at the risk of being attacked by both Rubin and Pearl,
a foxhole -fully-generative-Bayesian would still want an outcome model $y$ that includes $x_2$, $x_1$, and $z$. This model is likely less efficient in estimating ATE, but has the benefit of robustness against model misspecification. Is there any usage of the propensity score in this fully-generative approach? How to balance the goal of generic model fitting (robustness) and a targeted estimation (efficiency) in the workflow?</p>Yuling YaoI passed my defense today2021-05-07T00:00:00+00:002021-05-07T00:00:00+00:00https://www.yulingyao.com/blog/2021/defense<p>I passed my <a href="https://www.yulingyao.com/pdf/thesis_YY.pdf">thesis</a> defense today. The capability of passing the defense per se appeared less exciting than I had imagined, in part because everyone passes the dissertation defense anyway. It is like the p-value in a medical publication: readers have already known it is significant before reading the paper, or the “Result’’ section of a conference paper: readers already know the experiment manifests “our proposed method beats all benchmarks (after some random-seed-hacking)”—an inevitable promise contains its own languishment and gloominess from the very beginning.</p>
<image src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/BhagavadGita-19th-century-Illustrated-Sanskrit-Chapter_1.20.21.jpg/320px-BhagavadGita-19th-century-Illustrated-Sanskrit-Chapter_1.20.21.jpg" />
<p>[Image of Bhagavad Gita from Hindu Epic, by which Oppenheimer illustrated the idea of self-implied-destruction.]</p>
<p>Graduate school is analogous to an open-ended workflow: we cease the process before the convergence. I have accumulated maybe 10 first/co-first-author papers, but I also have more unfinished papers awaiting to be finished. I built a few methods and they generated ~800 citations, but I also know these methods have limitations and rarely scale to real big models and dataset. I started to establish myself professionally, but I also did not win a Nobel Prize for my thesis, like de Broglie, or got a faculty job in Berkeley or anything, like Andrew Gelman, or dropped school and started a search engine or a social media. I learned some more statistics, perhaps only a little bit more math, a hint of computer science, and some blog writing skills, but I also forgot many things along the way such as differential geometry or functional analysis. I read papers on a frequent basis, but I also have brand new textbooks on my shelf that I bought a few years ago and have not read a single page since. I have taken those classes, while I have never fully understood RKHS, decoupling, or semiparametrics. I have tried from time to time, while I apprehend anything but stochastic PDE, optimal transport, pricing of Asian options, gradient flow, Riemannian manifold, Cheeger’s inequality, or virtually the vast majority of things.</p>
<p>A Cox-type quote here would be</p>
<blockquote>
<p>All graduate school experiences are unsound, but some are useful.</p>
</blockquote>
<p>which itself brings further debates on whether I need better diagnostics for my usefulness. The hesitation lies in that I cannot rerun the model even if the diagnostics indicate horrible fit. In this sense, gradate school is not even a sophisticated workflow, it is a one-shot pre-asymptotic MCMC sampler: you get what you get. Slightly worse than a Markov chain, human-beings have a longer memory, driving me to ask retrospective questions: If I had seen this local mode that I end up with, should I adopt a different step size five years ago? If I had known that I would not fall in the typical set, should I shift my adaptation direction during warm-up?</p>
<p>Clearly, I would have fewer regrets if I were Markovian, or if I abandoned the potential outcome frameworks.</p>
<p>Anyway, I am glad and grateful for finishing this life-phase. I am pasting my acknowledgement from my thesis here:</p>
<blockquote>
<p>Starting by “this thesis would not have been possible without these people” would be cliché and kitsch. Such claim misleadingly defines the causal effect of an unrealistic and irreproducible intervention, although part of this thesis addresses leave-one-out cross validation or influence analysis. That being said, the aid from my advisor, collaborators and friends throughout my graduate study and research is not a latent parameter to estimate; it is input that is palpable and treasurable.</p>
</blockquote>
<blockquote>
<p>It would be unscalable to exhaustively list all the support I enjoyed from my doctoral advisor Andrew Gelman. Aside from inspirations I have learned from his blackboard and chalks, emails, overleaf editing logs, and those meetings with three hundred topics emerging in the same room, Andrew presents me an example of what a good statistician could be. To the same extent that model fitting benefits from fake data simulation, my trajectory toward an aspiring researcher is boosted by these simulations of fake Andrew in my head—”What would Andrew say to a sloppy graph? What model would Andrew consider given this dataset? What insights would Andrew write down on his pocket notebook after reading these papers?”—Well, even this metaphor is adapted from his.</p>
</blockquote>
<blockquote>
<p>I am grateful to Aki Vehtari for his guidance and discussions on many, if not all, of my research works. In addition to all bibliographies he meticulously pointed me to, a simulated Aki would also appear in the previous imaginary loop: “What would Aki say to this new idea? Would Aki be happy with this software implementation?”</p>
</blockquote>
<blockquote>
<p>I would like to thank collaborators with whom I have worked closely: Dan Simpson, Lex van Geen, Bob Carpenter, Yu-Sung Su, Jonah Gabry, Ben Bales, Jonathan Auerbach, Gregor Pirš, Charles Margossian, Collin Cademartori, and others. As a universal statistical principle applies, “the most important thing is what data you work with, not what you do with the data”, whose corollary lower bounds how much I have already learned from stacking all these collaborators.</p>
</blockquote>
<hr />
<p>P.S., it is very unrelated, but my text editor did not recognize the word “languishment” when I typed it, so I double checked in google. It appears Google-Translate automatically translates this word into “语言” in Chinese, which means “language”. Hmmm, Bayes classifier is no good.</p>
<p><img src="/blog/images/2021/ScreenShot20210507.png" alt="gg" title="google" /></p>Yuling YaoI passed my thesis defense today. The capability of passing the defense per se appeared less exciting than I had imagined, in part because everyone passes the dissertation defense anyway. It is like the p-value in a medical publication: readers have already known it is significant before reading the paper, or the “Result’’ section of a conference paper: readers already know the experiment manifests “our proposed method beats all benchmarks (after some random-seed-hacking)”—an inevitable promise contains its own languishment and gloominess from the very beginning.Bipartisan vaccination?2021-04-17T00:00:00+00:002021-04-17T00:00:00+00:00https://www.yulingyao.com/blog/2021/vacc<p>I read a NYT graph article entitled <a href="https://www.nytimes.com/interactive/2021/04/17/us/vaccine-hesitancy-politics.html">Least Vaccinated U.S. Counties Have Something in Common: Trump Voters</a>. Apart from beautiful visualizations, their graph comparison seems persuading to draw two conclusions:</p>
<ol>
<li>States with larger Trump vote shares are likely to have more adults who are vaccine hesitant.</li>
<li>States with larger Trump vote shares have a smaller share of fully vaccinated adult residents.</li>
</ol>
<p>Such association is legit. But I am concerned that this article can have several misleading aspects.</p>
<ol>
<li><strong>The bipartisan gap, or the income gap, or the urban-rural gap?</strong> The shares of trump voters are associated with many other variables. Slightly rearrange what variable to show, one would equally draw a conclusion such as “<em>States with lower average incomes are likely to have more adults who are vaccine hesitant</em>”, or perhaps a conspiracy theory might want to change the title to a thrilling “<em>Rural states are ignored in vaccine distributions</em>”. All these variables are quite related. The root of conspiracy theories is to attribute all variations to one single variable, no matter it is race or voting.</li>
<li><strong>The state level and individual level relation</strong>. To be fair, the article did not suggest “individual Trump voters are likely to be more vaccine hesitant”, but I am afraid many readers would interpret the state level associations in this way. Although this micro level explaination is plausible, we just cannot draw individual level relation from group level data. A famous example is that income is generally positively associated with education, but universities or colleges is properly among the lower-end of sections in terms of average salaries, despite its high level of average employee education: the industry-clustering is correlated with outcome variables. The same reasoning applies here for individual level inference.</li>
<li><strong>What can the country level comparison tell.</strong> The article did do more analysis on country level comparisons, with a section title “Counties where more residents voted for Trump often have lower vaccination rates”. Based on the graph, it is a weak evidence. If you look into states like Virginia, Oregon, or New Jersey, probably you would draw the opposite conclusion. Again, generally countries level relations do not have to the same direction as in individual level or in state level. Think about income for example: richer states are more blue while richer voters are more republican. In between, the county level relation can mix these two ends: many part of long island is redder and rich because rich people choose to cluster there, while in there are other red and poor upper state counties because they are rural. Again, both of these two mechanisms are outcome-dependent clustering but they result in divergent signs in country level relations. In this example, the country level comparison is mostly driven by (a) rural-urban distinction and (b) elder population shares. I guess that is why the pattern in Vermont is mostly random.</li>
<li><strong>Vaccine hesitation vs Vaccine rollout</strong> The article mixes these two outcome variables: vaccine hesitation and vaccine rollout. But are the correlation between them a fixed constant? Since it (April) is still in the early phase of vaccine distribution nationwide, is the vaccine trust the main bottleneck of vaccine rollout? I think the answer is negative and it is more evident in country levels. Look at upper state NY. The vaccine hesitation is relatively high within the state, but the rollout rates are good. Hamilton County has 65% fully vaccinated adults (remember the optimal rejection rate of an HMC sampler is indeed 0.651), highest in the state. Sure, in the longer run when there is enough supply, the vaccine rollout rate will eventually become (1- vaccine hesitation rate) and thereby perfectly correlated. But for the short term there are many other factors too. It seems the author has tries to fix this problem by showing that the average share of delivered doses reported as used is lower in 10 most hesitate states: but this usage efficiency itself would be largely correlated with rural-urban distinction.</li>
</ol>
<p>It is convenient/eye-catching to explain everything by bipartisan gap. But we need more modeling for the full picture story.</p>Yuling YaoI read a NYT graph article entitled Least Vaccinated U.S. Counties Have Something in Common: Trump Voters. Apart from beautiful visualizations, their graph comparison seems persuading to draw two conclusions:Note on “model diversity”2021-04-12T00:00:00+00:002021-04-12T00:00:00+00:00https://www.yulingyao.com/blog/2021/negativecorr<p>In my <a href="https://statmodeling.stat.columbia.edu/2021/01/26/hierarchical-stacking-part-ii/">previous blog post</a> on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.</p>
<p>So I took a look at that thesis. In the NC framework we are combining $K$ point estimates $f_{1}, \dots, f_{K}$</p>
\[f_{ens} (x)=\sum_{k=1}^K w_k f_i(x)\]
<p>and try to minimize the MSE of the ensemble.</p>
\[\mathrm{MSE}= \frac{1}{K}\sum_{i=1}^K w_i E (f_i(x)-y)^2 + \frac{1}{K^2} \sum_{i=1}^K \mathrm{Var}(f_i) + \frac{1}{K^2} \sum_{i=1}^K \sum_{j\neq i} \mathrm{Cov}(f_i, f_j).\]
<p>The three terms read</p>
\[\mathrm{bias}^2 + \frac{1}{K} \mathrm{variance} + (1-\frac{1}{K}) \mathrm{covariance}.\]
<p>The intuition is that we want to maximize the diversity, perhaps that means minimize correlation?</p>
<p>An alternative decomposition is called ambiguity. As the MSE of the ensemble is</p>
\[\mathrm{MSE}= \mathrm{E} \vert f_{ens} (x)- y \vert ^2 = \sum_{i=1}^K w_i E (f_i(x)-y)^2 - \sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2.\]
<p>Here the term</p>
\[\sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2\]
<p>is called <em>ambiguity</em>. Compared with the covariance, the ambiguity only has one term so is likely more tractable.</p>
<p>That is the basis of negative correlation (NC) learning, in which we train K neural nets. Instead of minimizing the individual $\mathrm{E} (f_i(x)-y)^2$, we minimize the error minus the ambiguity (or plus the covariance):
\(\min (f_i-y)^2 + \lambda (f_i - f_{ens}) \sum_{j\neq i} (f_j -f_{ens})\)</p>
<p>Certainly all the theory results derived here are brilliant. But there are some reasons why we did not stop here.</p>
<p>First, the bias variance trade-off only applies to MSE, while we want to quantify the ensemble richness with respect any pre-specified utility. Of course, Some Jensen inequality still holds.</p>
<p>Second, also because of the central role of MSE here, the term ambiguity does not apply to combining predictive distributions. We do not have the concept of correlation there: say we have two random variable $\mathrm{Corr}(x_1, x_2)=0.5$, but what is $\mathrm{Corr}(N(x_1,1) , N(x_1,1))$?</p>
<p>Third, this ambiguity term is only a good summary of the diversity when it stands next to the individual bias. Standing alone, this term is independent of data (kinda like why we would like to attack PCA for ignoring $y$).</p>
<p>These reasons are why we propose a new metric in our hierarchical stacking paper: how often an individual model wins.</p>Yuling YaoIn my previous blog post on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.I failed an interview for being Bayesian2021-04-07T00:00:00+00:002021-04-07T00:00:00+00:00https://www.yulingyao.com/blog/2021/job2<p>I get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was that</p>
<blockquote>
<p>“You were not open-minded to solutions and first principles to the problem other than the ones that you are comfortable with.”</p>
</blockquote>
<p>I recall during the interview there were a few case studies on data analysis. I proposed a complete workflow: how to design the experiment, how to make causal and decision theory adjustments, how to build models, how to make regularization, how to compute and approximate, and how to make model improvement. This is what I did in my applied data analysis all the time.</p>
<p>It turned out the interviewers were expecting some “first principles” such as</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if(prediction problem)
run a neural net;
if(causal inference)
run a t-test;
if(model evaluation)
compute AIC;
if(decision theory is involved)
flip a coin;
</code></pre></div></div>
<p>Hmm, if these magical simplifications are what I were not open-minded to, it is hard for me to regret.</p>
<p>Apart from reminding me of a paper review I once received “Because all MCMC methods are not scalable to big data, your new development therein is not interesting”,
the feedback above might also seem to suggest that this company itself is not open-minded to candidates who are capable of solving certain problems in ways that are not familiar to the company but might have been proven successful elsewhere. To be fair, this we-are-hiring-people-who-have-certain-skills-but-nothing-more attitude makes sense in business: these companies typically have a comprehensive pipeline to problem solving and an entry level employee should really focus on implementing the given pipeline. I blame the Neanderthal inside me for not being compatible.</p>
<p>This whole story echoes what Andrew used to say</p>
<blockquote>
<p>Making Bayes inference is the only correct thing to do when you have correct model and correct prior no matter who you are. Being Bayesian means you make Bayes inference anyway even all assumptions are wrong.</p>
</blockquote>
<p>from which you could tell where I got my “<em>not being open-minded</em>” from, if I were enforced to accept such accusation.</p>
<p>In short, I am not regretful for being assertive/creative in interviews. Or put it in another way, I am not regretful for receiving doctoral level training on applied statistics (in contrast to, say, taking two online courses on “mastering all machine learning and statistics and data science and programming in 14 days”), which grants me those assertiveness and creations. As Winston Churchill pointed out:</p>
<blockquote>
<p>You failed an interview for proposing your own solutions? Good. That means you’ve stood up for something, sometime in your life.</p>
</blockquote>Yuling YaoI get an interview feedback from a company. I initially thought my interviews went well but it turned out that the company had a different opinion. Generally, it would be silly for me to post every job rejection. However this particular story was special because (a) the process was quite lengthy, including 9 rounds and each round is nearly hour-long, and more importantly (b) as I am now informed, the main problem they had with me was thatTwo approaches for online updates in the election forecast2021-03-03T00:00:00+00:002021-03-03T00:00:00+00:00https://www.yulingyao.com/blog/2021/update<p>The term online update here is referred to updating a statistical model after certain modeled outcome is observed. A concrete example is in election forecast: the state election result comes in sequence, and that is when some website has to offer “real-time update of our prediction”.</p>
<p>It seems there are two ways for this task. Approach 1 is model based update, or maybe we shall call Bayesian update—the model provides a posterior outcome predictive density Pr(CA, NY, …). An online update becomes a conditional estimate: Pr(CA $\vert$ NY = observed outcome). In practice, we only need to collect posterior simulation draws that leads to this outcome.</p>
<p>If the outcome is continuous (shares of the vote), the probability is zero for any simulation draws to match the exact observations. We could use some ABC method here, and update the exact conditional probability by Pr(CA $\mid$ NY $\approx$ observed outcome) equipped come chosen distance metric.</p>
<p>The problem with simulation based approach is that if we observe some tail event, say R winning NY, there is hardly any simulation draw to match this event. Or simply when the number of events is large, every time along the update we would discard some simulation draws. Whichever reason, the update efficiency is limited by number of simulation draws. A quick approximation is to further approximate the posterior outcome model Pr(CA, NY, …) by a multivariate normal model, such that any conditional update comes in closed form solution, at the cost of less modeling flexibility.</p>
<p>Another way for this update task is some regression approach. Say we have point prediction for each state $y_{NY}, y_{CA}, \dots$, and sequentially we observe the actual outcome, $\tilde y_{NY}, \tilde y_{CA}, \dots$, we could run a regression $\tilde y_{i} = \beta_{1} y_i + \beta_{0} + \epsilon_i, \epsilon_i \sim \mathrm{normal}(0, \sigma).$ Then the online update task becomes the standard parameter update problem with more and more data comes in. This approach is compatible with point predictions, and has the advantage to adjust systematic “polling bias” (think about 2016).</p>
<p>A further question is how to combine these two approaches. In particular, approach 1 (simulation draws) can make use the posterior correlation between state outcomes. A plausible way is to replace the regression model by</p>
\[\tilde y_{i} = y_i + \beta_0 + \epsilon_i,\]
<p>But instead of the iid residuals in the regression model, this time we model $\epsilon_i$ as from a multivariate normal distribution, whose covariance is adapted from the posterior predictions,</p>
\[\mathrm{Corr}(\epsilon_i, \epsilon_j)= \mathrm{Corr}(y_i, y_j).\]
<p>The extra $\beta_0$ terms is still the systematic “polling bias” that was not seen by the existing model.</p>
<p>Certainly this new model is not ideal: The first two moments are oversimplified descriptions of a multi-D density; The normal model cannot adapt to heavy tailed predictions; The tail correlation is not necessarily the same the the overall correlation, etc.</p>Yuling YaoThe term online update here is referred to updating a statistical model after certain modeled outcome is observed. A concrete example is in election forecast: the state election result comes in sequence, and that is when some website has to offer “real-time update of our prediction”.Career development2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://www.yulingyao.com/blog/2021/job<p>Recently I have gone through a few industry job applications. Here is a few examples that manifest how amusing this procedure can become</p>
<ol>
<li>Company A desk-rejected me for a position I did not apply for (i.e., I applied for a full-time position, and I received an email from HR that they would not consider me for an intern—yes I made the correct application as I have checked my receipt—and this HR email was “no-reply”).</li>
<li>Fun fact: company A is famous for AI, ML, DL, and all fancy things. I wonder if AI safety would be undermined if the basic database procedure goes corrupted.</li>
<li>After finishing the a very long interview with Company B, I received an email from HR saying this interview had been rearranged to another day. It turned out in the first three minutes the interviewer had a connection issue and contacted the HR. This interviewer managed to go back but the HR did not know.</li>
<li>Fun fact: company B is famous for AI, ML, DL, and all fancy things. I guess at least I could trust their differential privacy.</li>
<li>Company C asked effectively identical interview questions in two subsequent sessions. I cannot help but to repeat to the fact that company C is famous for AI, ML, DL, and all fancy things.</li>
<li>I am not extremely thrilled with the fact that most tech companies requires no more statistical knowledge than the level of the first three chapters of “regression and other story”–not saying it is not a good book. But I guess a MD would be depressed too if they were only examined by their knowledge on babysitting skills.</li>
<li>To be fair, finance companies often had much harder math problems, until I learned afterwards that many, if not all, of their problems were from one published book, and this book and sold on Amazon. We would certainly eliminate all overfitting issues in machine learning if all test data are sold on Amazon.</li>
<li>Many tech companies desk-reject me. I was later informed that an insider referral was often necessarily for a resume to be picked from the pile in the first place. I guess a referral is not always bad: not everyone can afford a PPO plan.</li>
<li>It is like we are running a linear regression with covariate dimension = 1e6 so we have to run some pre-screening, and we do so by sequentially pick the variables that have high insider correlation from the all currently-selected variables. Don’t ask what an insider correlation means. I don’t even know what it means in human context.</li>
</ol>
<p>Well, let me take a step back. People make mistakes. I mess up things a lot in my work if not always. So I should not laugh at those inconvenience I encounter. But the cynical part of me, which gets amplified after all these, has the following cynical advice to future job seeker. An ideal candidate in modern job market should have an ivy degree so the hiring manager would be happy. However, during the graduate school, instead of going to lectures, they should spend as much time as possible on</p>
<ol>
<li>building linkedin connections,</li>
<li>reading first 3 chapters of RAOS,</li>
<li>practicing leetcode,</li>
<li>buying that hedge fund book from Amazon,</li>
<li>praying that the company’s email database would work,</li>
<li>Besides, don’t do anything else. Weeeee!</li>
</ol>Yuling YaoRecently I have gone through a few industry job applications. Here is a few examples that manifest how amusing this procedure can becomeFour open questions on ensemble methods2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://www.yulingyao.com/blog/2021/openproblem<p>In a recent paper I wrote, I discussed a few open questions on ensemble methods:</p>
<blockquote>
<ol>
<li>Both BMA and stacking are restricted to a linear mixture form, would it be beneficial to consider other aggregation forms such as convolution of predictions or a geometric bridge of predictive densities?</li>
<li>Stacking often relies on some cross-validation, how can we better account for the finite sample variance therein?</li>
<li>While staking can be equipped with many other scoring rules, what is the impact of the scoring rule choice on the convergence rate and robustness?</li>
<li>Beyond current model aggregation tools, can we develop an automated ensemble learner that could fully explore and expand the space of model classes—for example, using an autoregressive (AR) model and a moving-average (MA) model to learn an ARMA model?</li>
</ol>
</blockquote>
<p>I think they are all important directions!</p>Yuling YaoIn a recent paper I wrote, I discussed a few open questions on ensemble methods: Both BMA and stacking are restricted to a linear mixture form, would it be beneficial to consider other aggregation forms such as convolution of predictions or a geometric bridge of predictive densities? Stacking often relies on some cross-validation, how can we better account for the finite sample variance therein? While staking can be equipped with many other scoring rules, what is the impact of the scoring rule choice on the convergence rate and robustness? Beyond current model aggregation tools, can we develop an automated ensemble learner that could fully explore and expand the space of model classes—for example, using an autoregressive (AR) model and a moving-average (MA) model to learn an ARMA model?A Bayesian reflection of “Invariance, Causality and Robustness”2021-02-23T00:00:00+00:002021-02-23T00:00:00+00:00https://www.yulingyao.com/blog/2021/causal<p>I was reading Peter Bühlmann’s statistical science article <a href="https://arxiv.org/abs/1812.08233">“Invariance, Causality and Robustness”</a>. To be fair, he gave a short course in 2020 here in Columbia, but after reading this paper I guess I did not totally understand his lecture last time.</p>
<h2 id="going-beyond-the-potential-outcome">Going beyond the potential outcome</h2>
<p>The motivation to depart from the potential outcome framework is that in an open-ended observational data gathering we do not know what input is treatment and what is covariate. Or put it in another way, rather than to find the effect of the cause, Bühlmann is investigating the cause of the effect. Denote $X$ to be all input variables and $Y$ the outcome, one goal is to understand which variable in $X$ that actually causes/casually impact $Y$.</p>
<p>Bühlmann defines causality as invariance under different environments. Assuming a non-confounder situation, or equivalently if we have collected all possibly relevant input variables $X$. As per Bühlmann’s definition, we would like to collect data from various environments (perturbation of marginal of x, different countries, different experiment design, etc) and the (x, y) relation that remains unchanged in all environment is then the causality.</p>
<p>For a concrete example, consider the input x is gene sequence of length 1000, $y$ is some protein expression, and we collect data from 10 “experiments”. We denote data by $(x_{ijd}, y_{ij})$, $i=1,\dots, n, j=1, \dots, 10, d=1,\dots,100$, so $i, j, k$ are indexes for data, environment, and covariate dimension respectively.</p>
<p>Bühlmann folds it into a multiple testing problem: to seek a subset $\mathcal{S}\in {1, \dots, 1000}$, such that the conditional distribution of $y\mid X_\mathcal{S}$ is invariant, i.e.,</p>
\[y_{i j}= \sum_{d\in \mathcal{S}}\beta_{d} x_{ijd} + \mathrm{iid~ noise.}, \forall j.\]
<p>Given subset $\mathcal{S}$ we can certainly test this hypothesis by
testing if the regression coefficient is invariant under environment. To find this subset $\mathcal{S}$ (the true cause), we can test the hypothesis over all $2^d$ subsets, which needs to be done in the lens of multiple testing. This is the basic of Bühlmann’s method.</p>
<h2 id="bayesian-adaptation">Bayesian adaptation</h2>
<p>From a Bayesian perspective, if I collect data from 10 “environments”, a more natural way is to fit a hierarchal regression with all inputs included,</p>
\[y_{i j}= \sum_{d=1}^{1000} \beta_{jd} x_{ijd} + \mathrm{iid~ noise}, ~~ \beta_{j d}\sim\mathrm{normal}( \mu_{d}, \sigma_d), ~~\mathrm{other~priors}.\]
<p>Now if the true DG is really the reduced form (i.e., there exists a sparse subset $X_{\mathrm{X}}$ who are $y$’s parent nodes), then with large sample size, we should expect</p>
\[\sigma_d \to 0, ~~\forall d\in \mathrm{S}.\]
<p>Put it in another way, we can interpret the $\sigma_d$ as the extent to which the finding can be <em>transported</em> to other environments, and therefore, $\sigma_d = 0$ means causality.</p>
<p>The nice thing of the Bayesian framework is</p>
<ol>
<li>We avoid multiple testing.</li>
<li>generalization to non-linear models is straightforward.</li>
<li>In practice we will never observe $p (\sigma_d\mid y)= \delta(0)$. That said, we can place a horseshoe prior on $\sigma_1, \dots,\sigma_{1000}$ to enforce sparsity. I almost want to call this model an <em>automated cause finding</em>.</li>
</ol>
<p>An alternative Bayesian approach is some predictive projection (post-inference variable selection). After obtaining the full posterior $p(\sigma \mid y)$, we project it to a subspace $\sigma_{d}=0$ for some $d$.</p>
<h2 id="other-thoughts">Other thoughts</h2>
<ol>
<li>There is <em>generally</em> no contradiction between prediction and causality. A good prediction requires knowing the true DG, and true DG= causality. Conversely, knowing causality = knowing true model = robust predictive extrapolation.</li>
<li>But certainly I can come up with some counter-example when I used the word <em>generally</em>. Think about the example in which we know the true DG is $x_1 \to y$. $x_1$ is a car’s speed and $y$ is the moving distance in a given distance in 1 minute. But suppose we do not know physics and we therefore collect all other inputs, such as the color, the weight, the number of the wheels… from many observed cars. Now if we do the automated cause finding, we will get $y=\beta_1 x_1$, the correct physical finding, and all other variables are irreverent. But in principle we cannot measure the speed exactly so we always have some measurement error. Let’s say $x_1$= speed + normal (0,0.01), and $x_2$= speed + normal (0,0.02). To improve the prediction, we would like include both $x_1$ and $x_2$ in the linear model—i.e., at the cost of polluting the causal finding.</li>
<li>The framework of Invariance= Causality= true DG is a good justification why we often propose <em>fake data simulation</em> in Bayesian model, even though we seldom make reference to causality. Fake data simulation = generating (fake) data from various environments, and if the fitted model being desired = invariance under environments.</li>
<li>One conceptual challenge is how to make the division of input and environment (analogous to dividing between covariates and treatment in PO approach). If I collect survey from 10 countries, should I view country as an input (a possible treatment/cause), or an environment? So I guess after all there is some human decision here.</li>
</ol>Yuling YaoI was reading Peter Bühlmann’s statistical science article “Invariance, Causality and Robustness”. To be fair, he gave a short course in 2020 here in Columbia, but after reading this paper I guess I did not totally understand his lecture last time.