I mean, if I imagine the log likelihood of a GMM then I see no discrete parameters in it. So the HMC would just simply work fine - according to my very intuitive understanding of the situation - right ? So as long as the log likelihood is a continuous function of its parameters Greta works just fine, right ? Because the derivatives need to be calculated for HMC and that can be done just fine if the parameters themselves are continuous, right ?

HOWEVER !

Is the non-identifiability problem solvable in Greta ?

Yepp, regarding the first part, you are correct. If you marginalize the latent variables Z in âZ -> Xâ, you dont have discrete parameters and could implement the GMM in Stan just like in Greta.

Regarding identifiability, I believe your concerns are spot on (but I havenât followed the latest greta dev, so please correct me if I am wrong): the best way to go is probably to use non-exchangeable priors. That is, Iâd follow Michaelâs suggestion here.

Thanks @dirmeier , so there is no âorderedâ way of solving the issue ?

It seems that the ânon-exchangableâ solutions is not very âreliableâ (taken from):

This example clearly demonstrates the subtle challenge of trying to resolve labeling degeneracy with most non-exchangeable prior distributions. When there are many data the mixture likelihood will be very informative and it can easily overwhelm even strong prior information. The posterior will no longer be symmetric between the degenerate modes, but the disfavored modes will still have sufficiently significant posterior mass to require exploration in the fit.

However reading the rest of the post, it seems that the " Breaking the Labeling Degeneracy by Enforcing an Ordering" is the most reliable way to solve this issue. Can that be done in Greta nowadays ?

I am pretty new to the world of HMC but somehow I feel that Greta is perhaps a bit âeasierâ / âsimplerâ than Stan, since it uses âjustâ plain R functions, and that gives me the intuition that it is a simple solution to HMC.

Also, I can in principle fully understand Greta if I âjustâ read thesource code, which is âjustâ âsimpleâ R, right ? No magic ? No compiler behind the scenes, etc âŚ, right ?

To my knowledge there is no âorderedâ way, no. Maybe @nick has something planned.
And yes, non-exchangeable priors often donât work to well either.

There is not a huge difference in building models in Greta vs Stan, but yeah, Greta is surely more intuitive. Regarding code, Greta is certainly not easier than Stan, cause much of the computation is done by tensorflow and tensorflow-probability, and Greta wraps around those.

HMC itself is fairly easy to implement in R (see https://arxiv.org/abs/1111.4246 or Radford Nealâs work https://arxiv.org/abs/1206.1901), so if you understand GMMs, youâll also manage to understand HMC. Maybe start with a simple model first and try implementing it, like a normal observation model with Laplace prior on the mean). Do that with MH, then extend it to HMC. âŚ

Thanks for your interest in greta! I just wanted to make a few points:

Discrete parameters are a problem for any MCMC software that uses HMC which includes both Stan and greta. Marginalising out discrete parameters is the way to correct this problem. In greta we have recently implemented first class mixture models (here) which solve many common discrete variable issues and we are currently working on ways to make more complicated marginalisation easier (see this branch) but this remains a work in progress.

Mixture models are hard for all Bayesian, irrespective of software. Ordered constraints or non-exchangeable priors need to be used to ensure inference is reliable. (At the moment ordered constraints are not supported in greta)

Stan and greta are complicated in different ways and on different levels (I donât think one is really more or less complicated than the other) greta does, however, have a much simpler and more intuitive R facing interface.

I was looking for it and have not found it âŚ and I was thinking it does not exist. But it does that is a relief âŚ ok, please forget my previous answerâŚ as I said I am new to this world âŚ and got a bit lost.

Quick question: - does Greta use automatic symbolic differentiation for the HMC ?

In my understanding HMC is the following : combination of MC (Metropolis) and MD (molecular dynamics, with energy conservation).

For molecular dynamics the gradient is needed, so the question is, how does Greta calculate this gradient ? Symbolically using automatic differentiation ? Or somehow numerically ?

I have âtwoâ main and conceptually different questions:

At this moment I would like to create a little proof of concept using HMC for fitting GMMs, is the âfirst class mixture model supportâ makes fitting a 3D GMM (with 4 components) with unrestricted covariance matrix âfoolproofâ, where âfoolproofâ means no problem with non-identifiability (say, the âorderingâ solution). Is it possible to do so at this very moment with Greta (âeasilyâ = âquicklyâ) or should I rather try to do it in STAN ?

âmore intuitive R facing interfaceâ => yes, that is what makes me consider Greta as a âlearningâ language (and later also a production language), the reason for this is that I imagine that Great has an excellent documentation because it âinheritsâ the documentation from âRâ automatically, right ?

In other words " greta models are written directly in R, so thereâs no need to learn another language like BUGS or Stan" - as written on website

=> so this means, that Greta has an excellent documentation âfor freeâ, right ?

I mean, if Greta reuses âRâ then the already existing, excellent, detailed, precise documentation (and support) for âRâ is also exactly the same for Greta, right ?

In other words, Greta inherits the documentation from âRâ for free, right ?

In other words, if I learn âRâ then I can sit down, specify a likelihood function and start my HMC simulation, right ?

In other words, in order to specify the likelihood function I do not need to learn any new language and I can also use all existing R functions for specifying my likelihood function ? Is it so ?

There is no rush to answer at all, I understand that I am asking lot of questions here, I am thankful to you and to @dirmeier for taking your (plural) time and answering my questions in great detail and care.

So, the reason I deleted my above post is, that I am not an expert on autodiff, but I give it a try, since you asked. Maybe someone who knows more can correct me:

Symbolic diff is not the same as auto diff. Tensorflow uses reverse-mode auto diff (I believe), where for each element of a program you need to âregister a gradientâ. So TF does it, not Greta.

Due to TF needing to register gradients, I donât believe you can (yet) come up with your own likelihoods in Greta but need to use the distributions that are provided.

About your other questions:

Going through the Stan or Greta code will give you almost nothing. The 2 papers I posted are perfect intros into HMC/NUTS and will give you the tools to develop any GMM. In addition I recommend Michael Betancourtâs intros on HMC. I know computer scientists love reading code (I am one myself), but one should first understand the theory and then the intricacies of the implementation. I.e. donât look at the code, but do the math for the GMM yourself and then implement it in R. The advantage of probabilistic languages is that this part will be done for you and you only need to define models.

The interface @Voltemand was referring to is the language how you write models (I believe). If you know R already, you can write Greta code, too. Stan is in that sense more general, bcs models are written into â.stanâ files. It donât understand what you mean with âfree, inherited documentationâ? Why donât you have a look at the example page where we made Stan vs. Greta comparisons. What language do you prefer? If itâs python you might also be interested in Edward/Edward2.

Both Stan and Greta are perfectly able to fit GMMs. I personally use Greta for larger (deep) models, and Stan for smaller more complex ones.

Thank you, yes, basically by âinheritingâ the documentation from R I was meaning that, for example,
if I look at this piece of code, then it exactly tells me what Greta is doing. If I can read R then I know what I am doing if I am using Greta. Sometimes the code is the most accurate documentation

âone should first understand the theoryâ -> yes, indeed, I am still trying to understand the
part below (from here).

Marginalizing out the discrete assignments yields a likelihood that depends on only continuous parameters, making it amenable to state-of-the-art tools like Stan. Moreover, modeling the latent mixture probabilities instead of the discrete assignments admits more precise inferences as a consequence of the Rao-Blackwell theorem. From any perspective the marginalized mixture likelihood is the ideal basis for inference.

To be honest this is something I indeed do not get yet ^^^ - need to think about this part.

Yes, indeed, fair point, I did look, itâs a very nice piece of set of examples.

The language is actually a tiny little detail. Most important is that there is good documentation for the library, and now I more or less understand the concept of Greta, and it seems that the R code is very clear and readable ( I have great difficulty in reading C++ code ). I usually like to understand how a piece
of software works. Greta seems to be very understandable, and that is good. I think for now I might be ok with STAN / ScalaStan, maybeâŚ I still need to try it though (ScalaStan I mean), but the understandability of the R code of Greta is a huge huge + (at least in my personal opinion). I am a big fan of source code reading

Yes, this is a very good point, I did not know this, I need to look into this.

HmmâŚ I think I had a quick look at some HMC paper and I think I got what it isâŚ pretty quickly âŚ it all came originally from statistical physics , there it is called hybrid monte carlo, and I think I did some homework writing some MC code in 2002, as a homework exercise, in Fortran :), later I was doing all sorts of MC and MD computer simulations for solid state physics for some 8 years, so after a few minutes of reading a few lines from one of those papers I think I got what the point of HMC is. I wrote lot of papers using it - it just had a different name and the energy term was a bit different - so I feel totally home - one of my favorite book in 2002 was Computer Simulation of Liquids, I went to sleep and woke up with this book nice memories .

It is very interesting to see how this method is used for Bayesian inference ! Pretty clever , but totally makes lot of sense. I guess 10 years of learning/researching these things gave some intuition about these topics but I have to admit, this is the first time I am using HMC for Bayesian inference.

Yes, I have read the Bishop book a few times, it was not a very easy readâŚ but in the end I did learn a few things from it. The GMM part was covered by a course at Aalto (Espoo, Finland) called âAdvanced Probabilistic Modelsâ, one of my favorite courses I took, but also maybe the most difficult one in âMLâ

So yes, my questions do sound a bit confusing or strange I admit but the answers you @dirmeier and @Voltemand gave plus a little bit of reading Gretaâs source code clarified a LOT of things. I still need to look into automatic vs symbolic differentiation but I got a very nice high level understanding about Greta from your answers, @dirmeier and @Voltemand !

Thank you for taking your times and giving detailed answers. It is very good to have a clear(er) high level intuition on how Greta works.

So yes, I think the pieces are starting to come together nicely in my head, this is quite a relief ! Thanks !

This conversation has moved on a bit from the original question, but I just wanted to point out that thereâs now a mixture() function in greta for setting up mixture models (including GMMs), using the marginalisation approach: https://greta-stats.org/reference/mixture.html

Re. ordered variables, yes Iâm planning to provide a nice interface that sort of constraint in a future version of greta. For now, you can make your own ordered variables with something like this:

k <- 5
x <- c(variable(), variable(lower = 0, dim = k - 1)
x <- cumsum(x)

Defining ordered variables where you want to place priors on each of the elements is tricky. Itâs usually doable, but itrequires doing some maths, so a simpler interface for creating them would be preferable.

Hereâs a script I was playing around with for maximum-likelihood GMMs in greta, fitted to the old faithful dataset. The two functions at the top just wrap greta code to set up positive-definite and simplex variables without priors. Hope this helps!

# maximum-likelihood estimation of a two-component Gaussian mixture model for
# the old faithful eruptions data
# a symmetric positive-definite variable greta array
covar <- function (n = 2) {
L <- zeros(n, n)
idx <- lower.tri(L, diag = TRUE)
L[idx] <- variable(dim = sum(idx))
L %*% t(L)
}
# a simplex variable greta array
simplex <- function (n = 2) {
if (n < 2) {
stop ("nah")
} else if (n == 2) {
x <- uniform(0, 1)
x <- c(x, 1 - x)
} else {
x <- uniform(n)
x <- x / sum(x)
}
x
}
library(greta)
# NB the likelihood surface of the model is multimodal (non-convex), so this doesn't always converge correctly!
faithful <- scale(faithful)
n <- nrow(faithful)
# add some ordering to the gaussian means, the second element of mu2 must be greater than the second element of mu1
mu1 <- variable(dim = c(1, 2))
mu2 <- mu1 + cbind(variable(), variable(lower = 0))
sigma1 <- covar(2)
d1 <- multivariate_normal(mu1, sigma1,
n_realisations = n)
sigma2 <- covar(2)
d2 <- multivariate_normal(mu2, sigma2,
n_realisations = n)
weights <- simplex(2)
distribution(faithful) <- mixture(d1, d2, weights = weights)
m <- model(mu1, mu2, sigma1, sigma2)
o <- opt(m, adjust = FALSE)
plot(faithful, pch = 16, cex= 0.5, col = grey(0.7))
library(car)
ellipse(as.vector(o$par$mu1), o$par$sigma1, 1.96)
ellipse(as.vector(o$par$mu2), o$par$sigma2, 1.96)