GMM in Greta, is non-identifiability a show stopper in Greta?

jhegedus42 · January 23, 2019, 12:05pm

Hi,

Is there a way to use Greta to fit a GMM ?

I read that Greta cannot sample from discrete distributions, but STAN cannot sample from them either but
it can do GMM by marginalising.

I mean, if I imagine the log likelihood of a GMM then I see no discrete parameters in it. So the HMC would just simply work fine - according to my very intuitive understanding of the situation - right ? So as long as the log likelihood is a continuous function of its parameters Greta works just fine, right ? Because the derivatives need to be calculated for HMC and that can be done just fine if the parameters themselves are continuous, right ?

HOWEVER !

Is the non-identifiability problem solvable in Greta ?

More specifically, these problems :

Regards,

Jozsef

dirmeier · January 23, 2019, 9:46pm

Yepp, regarding the first part, you are correct. If you marginalize the latent variables Z in “Z -> X”, you dont have discrete parameters and could implement the GMM in Stan just like in Greta.

Regarding identifiability, I believe your concerns are spot on (but I haven’t followed the latest greta dev, so please correct me if I am wrong): the best way to go is probably to use non-exchangeable priors. That is, I’d follow Michael’s suggestion here.

jhegedus42 · January 23, 2019, 10:09pm

Thanks @dirmeier , so there is no “ordered” way of solving the issue ?

It seems that the “non-exchangable” solutions is not very ‘reliable’ (taken from):

This example clearly demonstrates the subtle challenge of trying to resolve labeling degeneracy with most non-exchangeable prior distributions. When there are many data the mixture likelihood will be very informative and it can easily overwhelm even strong prior information. The posterior will no longer be symmetric between the degenerate modes, but the disfavored modes will still have sufficiently significant posterior mass to require exploration in the fit.

However reading the rest of the post, it seems that the " Breaking the Labeling Degeneracy by Enforcing an Ordering" is the most reliable way to solve this issue. Can that be done in Greta nowadays ?

I am pretty new to the world of HMC but somehow I feel that Greta is perhaps a bit ‘easier’ / ‘simpler’ than Stan, since it uses ‘just’ plain R functions, and that gives me the intuition that it is a simple solution to HMC.

Also, I can in principle fully understand Greta if I “just” read thesource code, which is “just” “simple” R, right ? No magic ? No compiler behind the scenes, etc …, right ?

Cheers

Jozsef

dirmeier · January 23, 2019, 10:46pm

To my knowledge there is no “ordered” way, no. Maybe @nick has something planned.
And yes, non-exchangeable priors often don’t work to well either.

There is not a huge difference in building models in Greta vs Stan, but yeah, Greta is surely more intuitive. Regarding code, Greta is certainly not easier than Stan, cause much of the computation is done by tensorflow and tensorflow-probability, and Greta wraps around those.

HMC itself is fairly easy to implement in R (see https://arxiv.org/abs/1111.4246 or Radford Neal’s work https://arxiv.org/abs/1206.1901), so if you understand GMMs, you’ll also manage to understand HMC. Maybe start with a simple model first and try implementing it, like a normal observation model with Laplace prior on the mean). Do that with MH, then extend it to HMC. …

Voltemand · January 23, 2019, 11:43pm

Hi jhegdus42,

Thanks for your interest in greta! I just wanted to make a few points:

Discrete parameters are a problem for any MCMC software that uses HMC which includes both Stan and greta. Marginalising out discrete parameters is the way to correct this problem. In greta we have recently implemented first class mixture models (here) which solve many common discrete variable issues and we are currently working on ways to make more complicated marginalisation easier (see this branch) but this remains a work in progress.
Mixture models are hard for all Bayesian, irrespective of software. Ordered constraints or non-exchangeable priors need to be used to ensure inference is reliable. (At the moment ordered constraints are not supported in greta)
Stan and greta are complicated in different ways and on different levels (I don’t think one is really more or less complicated than the other) greta does, however, have a much simpler and more intuitive R facing interface.

All the best!

jhegedus42 · January 24, 2019, 2:29pm

Ok, I wrote an answer a few minutes ago and then I realised that I found what I was looking for. I was looking for the Stan reference manual. For example, something like this : http://www.datascienceassn.org/sites/default/files/Stan%20Modeling%20Language%20User's%20Guide%20and%20Reference%20Manual%2C%20Version%202.2.0.pdf

I was looking for it and have not found it … and I was thinking it does not exist. But it does that is a relief … ok, please forget my previous answer… as I said I am new to this world … and got a bit lost.

dirmeier · January 24, 2019, 3:32pm

Rather take the recent ones :

jhegedus42 · January 24, 2019, 6:54pm

@dirmeier yes, thanks

Quick question: - does Greta use automatic symbolic differentiation for the HMC ?

In my understanding HMC is the following : combination of MC (Metropolis) and MD (molecular dynamics, with energy conservation).

For molecular dynamics the gradient is needed, so the question is, how does Greta calculate this gradient ? Symbolically using automatic differentiation ? Or somehow numerically ?

Cheers,

Jozsef

jhegedus42 · January 24, 2019, 7:49pm

Dear @Voltemand,

I have “two” main and conceptually different questions:

At this moment I would like to create a little proof of concept using HMC for fitting GMMs, is the “first class mixture model support” makes fitting a 3D GMM (with 4 components) with unrestricted covariance matrix “foolproof”, where “foolproof” means no problem with non-identifiability (say, the “ordering” solution). Is it possible to do so at this very moment with Greta (“easily” = “quickly”) or should I rather try to do it in STAN ?
“more intuitive R facing interface” => yes, that is what makes me consider Greta as a “learning” language (and later also a production language), the reason for this is that I imagine that Great has an excellent documentation because it “inherits” the documentation from “R” automatically, right ?
- In other words " greta models are written directly in R, so there’s no need to learn another language like BUGS or Stan" - as written on website
  - => so this means, that Greta has an excellent documentation “for free”, right ?
  - I mean, if Greta reuses “R” then the already existing, excellent, detailed, precise documentation (and support) for “R” is also exactly the same for Greta, right ?
  - In other words, Greta inherits the documentation from “R” for free, right ?
  - In other words, if I learn “R” then I can sit down, specify a likelihood function and start my HMC simulation, right ?
  - In other words, in order to specify the likelihood function I do not need to learn any new language and I can also use all existing R functions for specifying my likelihood function ? Is it so ?

There is no rush to answer at all, I understand that I am asking lot of questions here, I am thankful to you and to @dirmeier for taking your (plural) time and answering my questions in great detail and care.

Kind Regards,

Jozsef

dirmeier · January 25, 2019, 9:00am

So, the reason I deleted my above post is, that I am not an expert on autodiff, but I give it a try, since you asked. Maybe someone who knows more can correct me:

Symbolic diff is not the same as auto diff. Tensorflow uses reverse-mode auto diff (I believe), where for each element of a program you need to “register a gradient”. So TF does it, not Greta.
Due to TF needing to register gradients, I don’t believe you can (yet) come up with your own likelihoods in Greta but need to use the distributions that are provided.

About your other questions:

Going through the Stan or Greta code will give you almost nothing. The 2 papers I posted are perfect intros into HMC/NUTS and will give you the tools to develop any GMM. In addition I recommend Michael Betancourt’s intros on HMC. I know computer scientists love reading code (I am one myself), but one should first understand the theory and then the intricacies of the implementation. I.e. don’t look at the code, but do the math for the GMM yourself and then implement it in R. The advantage of probabilistic languages is that this part will be done for you and you only need to define models.
The interface @Voltemand was referring to is the language how you write models (I believe). If you know R already, you can write Greta code, too. Stan is in that sense more general, bcs models are written into “.stan” files. It don’t understand what you mean with “free, inherited documentation”? Why don’t you have a look at the example page where we made Stan vs. Greta comparisons. What language do you prefer? If it’s python you might also be interested in Edward/Edward2.
Both Stan and Greta are perfectly able to fit GMMs. I personally use Greta for larger (deep) models, and Stan for smaller more complex ones.

Voltemand · January 25, 2019, 9:14am

That was exactly exactly what I was referring to @dirmeier. Thanks for your great response - I agree with all of it.

jhegedus42 · January 26, 2019, 1:42am

@dirmeier

Thank you, yes, basically by “inheriting” the documentation from R I was meaning that, for example,
if I look at this piece of code, then it exactly tells me what Greta is doing. If I can read R then I know what I am doing if I am using Greta. Sometimes the code is the most accurate documentation

“one should first understand the theory” -> yes, indeed, I am still trying to understand the
part below (from here).

Marginalizing out the discrete assignments yields a likelihood that depends on only continuous parameters, making it amenable to state-of-the-art tools like Stan. Moreover, modeling the latent mixture probabilities instead of the discrete assignments admits more precise inferences as a consequence of the Rao-Blackwell theorem. From any perspective the marginalized mixture likelihood is the ideal basis for inference.

To be honest this is something I indeed do not get yet ^^^ - need to think about this part.

Yes, indeed, fair point, I did look, it’s a very nice piece of set of examples.

The language is actually a tiny little detail. Most important is that there is good documentation for the library, and now I more or less understand the concept of Greta, and it seems that the R code is very clear and readable ( I have great difficulty in reading C++ code ). I usually like to understand how a piece
of software works. Greta seems to be very understandable, and that is good. I think for now I might be ok with STAN / ScalaStan, maybe… I still need to try it though (ScalaStan I mean), but the understandability of the R code of Greta is a huge huge + (at least in my personal opinion). I am a big fan of source code reading

Yes, this is a very good point, I did not know this, I need to look into this.

Hmm… I think I had a quick look at some HMC paper and I think I got what it is… pretty quickly … it all came originally from statistical physics , there it is called hybrid monte carlo, and I think I did some homework writing some MC code in 2002, as a homework exercise, in Fortran :), later I was doing all sorts of MC and MD computer simulations for solid state physics for some 8 years, so after a few minutes of reading a few lines from one of those papers I think I got what the point of HMC is. I wrote lot of papers using it - it just had a different name and the energy term was a bit different - so I feel totally home - one of my favorite book in 2002 was Computer Simulation of Liquids, I went to sleep and woke up with this book nice memories .

It is very interesting to see how this method is used for Bayesian inference ! Pretty clever , but totally makes lot of sense. I guess 10 years of learning/researching these things gave some intuition about these topics but I have to admit, this is the first time I am using HMC for Bayesian inference.

Yes, I have read the Bishop book a few times, it was not a very easy read… but in the end I did learn a few things from it. The GMM part was covered by a course at Aalto (Espoo, Finland) called “Advanced Probabilistic Models”, one of my favorite courses I took, but also maybe the most difficult one in “ML”

So yes, my questions do sound a bit confusing or strange I admit but the answers you @dirmeier and @Voltemand gave plus a little bit of reading Greta’s source code clarified a LOT of things. I still need to look into automatic vs symbolic differentiation but I got a very nice high level understanding about Greta from your answers, @dirmeier and @Voltemand !

Thank you for taking your times and giving detailed answers. It is very good to have a clear(er) high level intuition on how Greta works.

So yes, I think the pieces are starting to come together nicely in my head, this is quite a relief ! Thanks !

Cheers,

Jozsef

nick · February 6, 2019, 9:43pm

This conversation has moved on a bit from the original question, but I just wanted to point out that there’s now a mixture() function in greta for setting up mixture models (including GMMs), using the marginalisation approach: https://greta-stats.org/reference/mixture.html

Re. ordered variables, yes I’m planning to provide a nice interface that sort of constraint in a future version of greta. For now, you can make your own ordered variables with something like this:

k <- 5
x <- c(variable(), variable(lower = 0, dim = k - 1)
x <- cumsum(x)

Defining ordered variables where you want to place priors on each of the elements is tricky. It’s usually doable, but itrequires doing some maths, so a simpler interface for creating them would be preferable.

Here’s a script I was playing around with for maximum-likelihood GMMs in greta, fitted to the old faithful dataset. The two functions at the top just wrap greta code to set up positive-definite and simplex variables without priors. Hope this helps!

# maximum-likelihood estimation of a two-component Gaussian mixture model for
# the old faithful eruptions data

# a symmetric positive-definite variable greta array
covar <- function (n = 2) {
  L <- zeros(n, n)
  idx <- lower.tri(L, diag = TRUE)
  L[idx] <- variable(dim = sum(idx))
  L %*% t(L)
}

# a simplex variable greta array
simplex <- function (n = 2) {
  if (n < 2) {
    stop ("nah")
  } else if (n == 2) {
    x <- uniform(0, 1)
    x <- c(x, 1 - x)
  } else {
    x <- uniform(n)
    x <- x / sum(x)
  }
  x
}

library(greta)

# NB the likelihood surface of the model is multimodal (non-convex), so this doesn't always converge correctly!

faithful <- scale(faithful)
n <- nrow(faithful)

# add some ordering to the gaussian means, the second element of mu2 must be greater than the second element of mu1
mu1 <- variable(dim = c(1, 2))
mu2 <- mu1 + cbind(variable(), variable(lower = 0))

sigma1 <- covar(2)
d1 <- multivariate_normal(mu1, sigma1,
                          n_realisations = n)

sigma2 <- covar(2)
d2 <- multivariate_normal(mu2, sigma2,
                          n_realisations = n)

weights <- simplex(2)

distribution(faithful) <- mixture(d1, d2, weights = weights)

m <- model(mu1, mu2, sigma1, sigma2)

o <- opt(m, adjust = FALSE)

plot(faithful, pch = 16, cex= 0.5, col = grey(0.7))
library(car)
ellipse(as.vector(o$par$mu1), o$par$sigma1, 1.96)
ellipse(as.vector(o$par$mu2), o$par$sigma2, 1.96)