Hi Jakob,
Thanks for asking this question. It’s one of those cases where there are some handy numerical tricks to speed things up, but I don’t think they’ve been documented anywhere for greta (though this came up in an old, closed GitHub issue https://github.com/greta-dev/greta/issues/281). Here are a few things to consider:
bad proposals
You probably already saw this, but you can handle numerical issues like this failed Cholesky decomposition as bad proposals (divergent transitions in Stan terminology) by setting one_by_one = TRUE
in the call to mcmc. Though you don’t want to have many (maybe no more than 1%?) bad samples during the sampling phase as it can lead to biased inferences. Sampling from the prior above gives me bad proposals during sampling, so we need to do something else.
only the diagonal
This won’t actually help with the Cholesky decomposition issue, but it helps with thinking about the next step. In Stan, it’s recommended to use quad_form_diag()
using a vector of diagonal elements, rather than the naive matrix multiplies. This skips some multiplications by zero, which makes little difference with a matrix of dimension 2, but can make a big difference as the matrix increases in size. You can do something similar in greta like this (I tidied up a couple of other little things and made it work with generic matrix dimension):
# a not particularly fast version of quad_form_diag, but faster than the matrix
# multiplies for large matrices
quad_form_diag <- function (mat, vec) {
mat <- sweep(mat, 1, vec, "*")
sweep(mat, 2, vec, "*")
}
M <- 2
N <- 10
tau <- exponential(0.5, dim = M)
Omega <- lkj_correlation(3, M)
Sigma <- quad_form_diag(Omega, tau)
ab <- multivariate_normal(
mean = zeros(1, M),
Sigma = Sigma,
dimension = M,
n_realisations = N
)
I’m not planning to include this quad_form_diag()
function in greta, because a) It’s not needed when you propagate the Cholesky factors in the next step, and b) I’m hoping this type of operation can soon be automatically detected and done internally - something I’ll talk about at the end.
propagating the Cholesky factors
When greta samples the parameters for Omega
, it actually works with the Cholesky factor of the matrix, to get around the symmetry constraint of the matrix. I.e. it samples a matrix U
and then forms \Omega = U' U:
Omega <- t(Omega_U) %*% Omega_U
Note: R uses the upper triangular matrix, but you’ll more commonly see the lower triangular matrix L = U' in discussions of Cholesky factorisation.
multivariate_normal()
also requires the Cholesky factor of Sigma
to compute the density, which it usually obtains by doing a Cholesky decomposition on Sigma
. This operation is susceptible to numerical under/overflow, so can fail for some values of the parameters - which is what the errors are about. But if we can get the Cholesky factor of Sigma
without doing this decomposition, we can avoid those errors. In fact, the Cholesky factor of Sigma can be computed from the Cholesky factor of Omega
and the vector of standard deviations tau
, as:
Sigma_U = sweep(Omega_U, 2, tau, '*')
This is the same trick the Stan manual suggests, using diag_pre_multiply()
on the vector of standard deviations and the Cholesky factor of the correlation matrix.
To do this you need Omega_U
. The way to do this in greta is just to do: Omega_U <- chol(Omega)
. greta has an internal mechanism (representations) to skip operations like this when the result is already known. Instead of doing a Cholesky decomposition on Omega
, it just grabs the Cholesky factor that it was working with already, and uses that. You can see this if you plot the model:
Omega <- lkj_correlation(3, M)
Omega_U <- chol(Omega)
m <- model(Omega_U)
plot(m)
The nameless variable in the bottom left is a vector-veersion of the Cholesky factor, it’s reshaped into a matrix, and then converted into the symmetric matrix Omega
(which is the target of the LKJ distribiution), using internal functions flat_to_chol()
and chol_to_symmetric()
. But Omega_U
is just a copy of the Cholesky factor, no need to decompose it.
So we can get to Sigma_U
without doing a Cholesky decomposition, and then use an internal greta function to turn Sigma_U
into Sigma
and pass it into multivariate normal. The function chol_to_symmetric()
keeps track of the fact that Sigma
has a Cholesky factor representation, so that we don’t need a Cholesky decomposition at any point:
chol_to_symmetric <- .internals$utils$greta_array_operations$chol_to_symmetric
M <- 2
N <- 10
tau <- exponential(0.5, dim = M)
Omega <- lkj_correlation(3, M)
Omega_U <- chol(Omega)
Sigma_U <- sweep(Omega_U, 2, tau, "*")
Sigma <- chol_to_symmetric(Sigma_U)
ab <- multivariate_normal(
mean = zeros(1, M),
Sigma = Sigma,
dimension = M,
n_realisations = N
)
A plot might make this clearer:
m <- model(ab)
plot(m)
You can see the representations being used to pass the Cholesky factors instead of the symmetric matrices at two points.
This still hits an occasional (but much rarer) numerical issue with a matrix solve in computing multivariate normal, but that’s taken care of by one_by_one = TRUE
.
whitening decomposition
The above code should work well in a model when the multivariate normal variables are observed, rather than being unknown variables. I.e. when doing this:
distribution(y) <- multivariate_normal(
mean = zeros(1, M),
Sigma = Sigma,
dimension = M,
n_realisations = N
)
But when ab
is an unknown parameter in the model it’s possible to improve sampling again by a ‘whitening’ decomposition (whitening as in removing correlation from; akin to converting to white-noise signal), which is like a multivariate version of hierarchical decentring. This is also mentioned in the Stan manual: https://mc-stan.org/docs/2_18/stan-users-guide/multivariate-hierarchical-priors-section.html
The complete code for this is:
M <- 2
N <- 10
tau <- exponential(0.5, dim = M)
Omega <- lkj_correlation(3, M)
Omega_U <- chol(Omega)
Sigma_U <- sweep(Omega_U, 2, tau, "*")
z <- normal(0, 1, dim = c(N, M))
ab <- z %*% Sigma_U
This samples well, and has no numerical issues when I run it.
future plans
One of the design principles of greta is that the user shouldn’t have to think about this sort of computational and numerical issue when modelling. For relatively common scenarios like this it should just work. That’s why the representations mechanism is there, so that we can use numerical tricks when available. greta doesn’t have both a poisson
and a poisson_log
distribution like Stan (the latter for working with the log of the rate parameter, for numerical stability) - greta just has poisson()
and uses the log representation when it’s available.
Ultimately, I’d like for greta to be able to do that for more complex sets of operations, like in this case. Then the user could just write the naive code like in your example, and greta would handle all the tricks: the diagonalisation; the efficient diagonal multiplication; the propagation of Cholesky factors through the matrix multiplication; and the whitening decomposition if (and only if) the resulting greta array is used as a parameter, not a distribution over data. That quite technically tricky though, so will require a bit of thinking and refactoring.
example model
If someone reading this would like to contribute to greta, adding this last bit of code as an example model on the website would be really helpful! You can find instructions to do that here: https://github.com/greta-dev/greta/blob/master/.EDIT_WEBSITE.md#adding-example-models