Note on "model diversity"

Posted by Yuling Yao on Apr 12, 2021.       Tag: modeling  

In my previous blog post on hierarchical stacking, reader “Chaos” pointed to me Gavin Brown’s Ph.D. thesis on Negative Correlation (NC) Learning which had a good characterization of the importance of diversity to stacking or stacking-like approaches.

So I took a look at that thesis. In the NC framework we are combining $K$ point estimates $f_{1}, \dots, f_{K}$

\[f_{ens} (x)=\sum_{k=1}^K w_k f_i(x)\]

and try to minimize the MSE of the ensemble.

\[\mathrm{MSE}= \frac{1}{K}\sum_{i=1}^K w_i E (f_i(x)-y)^2 + \frac{1}{K^2} \sum_{i=1}^K \mathrm{Var}(f_i) + \frac{1}{K^2} \sum_{i=1}^K \sum_{j\neq i} \mathrm{Cov}(f_i, f_j).\]

The three terms read

\[\mathrm{bias}^2 + \frac{1}{K} \mathrm{variance} + (1-\frac{1}{K}) \mathrm{covariance}.\]

The intuition is that we want to maximize the diversity, perhaps that means minimize correlation?

An alternative decomposition is called ambiguity. As the MSE of the ensemble is

\[\mathrm{MSE}= \mathrm{E} \vert f_{ens} (x)- y \vert ^2 = \sum_{i=1}^K w_i E (f_i(x)-y)^2 - \sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2.\]

Here the term

\[\sum_{i=1}^K w_i E (f_i(x)- f_{ens}(x))^2\]

is called ambiguity. Compared with the covariance, the ambiguity only has one term so is likely more tractable.

That is the basis of negative correlation (NC) learning, in which we train K neural nets. Instead of minimizing the individual $\mathrm{E} (f_i(x)-y)^2$, we minimize the error minus the ambiguity (or plus the covariance): \(\min (f_i-y)^2 + \lambda (f_i - f_{ens}) \sum_{j\neq i} (f_j -f_{ens})\)

Certainly all the theory results derived here are brilliant. But there are some reasons why we did not stop here.

First, the bias variance trade-off only applies to MSE, while we want to quantify the ensemble richness with respect any pre-specified utility. Of course, Some Jensen inequality still holds.

Second, also because of the central role of MSE here, the term ambiguity does not apply to combining predictive distributions. We do not have the concept of correlation there: say we have two random variable $\mathrm{Corr}(x_1, x_2)=0.5$, but what is $\mathrm{Corr}(N(x_1,1) , N(x_1,1))$?

Third, this ambiguity term is only a good summary of the diversity when it stands next to the individual bias. Standing alone, this term is independent of data (kinda like why we would like to attack PCA for ignoring $y$).

These reasons are why we propose a new metric in our hierarchical stacking paper: how often an individual model wins.