Some difficulty of fitting splines on latent variables
Posted by Yuling Yao on Sep 16, 2020.By spline I mean Bspline, and by Bspline I mean use piecewise polynomial basis in regressions. It might also be called kernel regressions.
I think there are three distinct usage of spline:

Interpolation: we enforce the spline to go across given (x, y) points, and make predictions for new x.

Smoothing: instead of exact interpolation, we allow some fitting error, but we also penalizing the smoothness of the fitted curve (the integral of 2rd order derivative in 1D or laplacian in 2D).

Filtering. It functions likes 2 that we regress the observations to the basis function matrix, but instead of viewing the penalization term as a biasvariance tradeoff, we shift the goal to find a latent variable surface. The bias in 2 is now interpreted as observations noise and the latent surface is now the true value.
Now, in certain models, we are model some latent variables by splines. For example, the observations $y_{1i}, y_{2i}, i=1, \dots, n$ is modeled by two gaussian process \(y_{1i}\sim \mathrm{gp} ({f_{1i}}, \sigma), \quad y_{2i}\sim \mathrm{gp} (f_{2i}, \sigma)\) But the actual goal is to model the auto regression dynamic \(f_{2i} \sim \mathrm{~some~function~of~} f_{1i}\)
It is actually straightforward to do that in Stan. We just model the autoregression using splines.
However, there are still some challenge here:
 For observed values, the spline basis matrix $B_j(x_i)$ can be precomputed and stored. The basis function becomes a feature extraction and is as simple as a regression. But for latent modeling, the basis matrix $B_j(f_i)$ has to be evaluated for every new value of parameter $f_i$. That is a lot of evaluation if either the number of basis or the sample size is big.
 The usual Bspline regression can be GREATLY improved in efficiency by sparse matrix. In oneD, suppose the $K$ knots are the data quantiles with probability $1/K, … (K1)/K$, the B_j(x_i) will only have $O(1/K)$ nonzero elements. For twoD with tensor product of basis, that is $O(1/K^2)$. The B matrix is almost always sparse for latent variable splines too. But then how can we do sparse matrix when the matrix changes everyone iterations? I mean, we can still convert that B to be a sparse matrix, but we have to convert it in every iteration.
 The most horrible part is that the spline is zero outside its boundary knots— but we do not really know the boundary of the latent variables cuz they are latent. That is bad. If the sampler or the optimizer enters a region where the latent variables {$f_i$} are inferred to be wider than the knots— which are often prefixed, the gradient of the coefficient before the B matrix becomes zero. It either gets stuck in optimizer, or becomes random walk in HMC.
 A careful analysis shows that the region where the above gradients are nonzero is indeed a very thin slice of the joint space. One solution is to trim the space to match it: put a soft constraint on the span of the latent variables. I think in general Bspline is sensitive to the boundary knots. The unknown support amplifies such sensitivity.