Some difficulty of fitting splines on latent variables

Posted by Yuling Yao on Sep 16, 2020.       Tag: modeling  
in general B-spline is sensitive to the boundary knots, while the unknown support of latent variable models amplifies such sensitivity.

By spline I mean B-spline, and by B-spline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.

I think there are three distinct usage of spline:

  1. Interpolation: we enforce the spline to go across given (x, y) points, and make predictions for new x.

  2. Smoothing: instead of exact interpolation, we allow some fitting error, but we also penalizing the smoothness of the fitted curve (the integral of 2rd order derivative in 1D or laplacian in 2D).

  3. Filtering. It functions likes 2 that we regress the observations to the basis function matrix, but instead of viewing the penalization term as a bias-variance tradeoff, we shift the goal to find a latent variable surface. The bias in 2 is now interpreted as observations noise and the latent surface is now the true value.

Now, in certain models, we are model some latent variables by splines. For example, the observations $y_{1i}, y_{2i}, i=1, \dots, n$ is modeled by two gaussian process \(y_{1i}\sim \mathrm{gp} ({f_{1i}}, \sigma), \quad y_{2i}\sim \mathrm{gp} (f_{2i}, \sigma)\) But the actual goal is to model the auto regression dynamic \(f_{2i} \sim \mathrm{~some~function~of~} f_{1i}\)

It is actually straightforward to do that in Stan. We just model the autoregression using splines.

However, there are still some challenge here:

  1. For observed values, the spline basis matrix $B_j(x_i)$ can be pre-computed and stored. The basis function becomes a feature extraction and is as simple as a regression. But for latent modeling, the basis matrix $B_j(f_i)$ has to be evaluated for every new value of parameter $f_i$. That is a lot of evaluation if either the number of basis or the sample size is big.
  2. The usual B-spline regression can be GREATLY improved in efficiency by sparse matrix. In one-D, suppose the $K$ knots are the data quantiles with probability $1/K, … (K-1)/K$, the B_j(x_i) will only have $O(1/K)$ non-zero elements. For two-D with tensor product of basis, that is $O(1/K^2)$. The B matrix is almost always sparse for latent variable splines too. But then how can we do sparse matrix when the matrix changes everyone iterations? I mean, we can still convert that B to be a sparse matrix, but we have to convert it in every iteration.
  3. The most horrible part is that the spline is zero outside its boundary knots— but we do not really know the boundary of the latent variables cuz they are latent. That is bad. If the sampler or the optimizer enters a region where the latent variables {$f_i$} are inferred to be wider than the knots— which are often prefixed, the gradient of the coefficient before the B matrix becomes zero. It either gets stuck in optimizer, or becomes random walk in HMC.
  4. A careful analysis shows that the region where the above gradients are non-zero is indeed a very thin slice of the joint space. One solution is to trim the space to match it: put a soft constraint on the span of the latent variables. I think in general B-spline is sensitive to the boundary knots. The unknown support amplifies such sensitivity.