Introducing rgreta


#1

Hi all,

Over the last few weeks I’ve been working on a package called rgreta which aims to provide simple simulation of greta models. I have just pushed the project to github and you can find it here.

rgreta provides a new function simulate which will simulate the output of a greta model by sampling parameters and running operations.

Example usage (from the README):

We define a simple greta model:

library(greta)

int <- normal(0, 10)
coef <- normal(0, 10)
sd <- cauchy(0, 3, truncation = c(0, Inf))

mu <- int + coef * attitude$complaints

distribution(attitude$rating) <- normal(mu, sd)

m <- model(int, coef, sd)

We can then simulate from the model as is:

devtools::install_github("Voltemand/rgreta")
library(rgreta) 
simulations <- simulate(m)
head(simulations) 

or we can specify model parameters

simulations <- simulate(m, list(int = 1, coef = 1, sd = 1)
head(simulations) 

This package is very much in the essential feature adding/bug crushing stage so I’d love to hear about any bugs or ideas.

The major limitation of the package at the moment is that it only supports sampling from Normal and Cauchy distributions and running addition and multiplication operations. If there is interest in the package however it should be very easy to add more.

All the best,
Jeffrey


#2

I think it might help if you elaborate on the use cases that rgreta is designed to help with. At the moment, it seems the simulate command is equivalent to something like:

m <- model(int, coef, sd)
draws = mcmc(m)
draws %>% as.matrix() %>% dplyr::as_tibble() %>%
  mutate(simulate = int + coef * attitude$complaints)

I am guessing the goal is to allow for easier prior predictive and posterior predictive checks (which would be great), but it would be helpful to hear a little more about the motivation of the package.


#3

Awesome!

This is functionality I was planning to add to greta. Having i.i.d. random sampling from distributions will be super-useful for doing posterior predictive checks, simulating from priors for sanity checking, and also for generating sensible starting values. It great that you’ve already made such an impressive start!

If this were a part of greta, you could add a method to each distribution class with the corresponding random generator in R (or, ideally, r-tensorflow) code.

I’d be very happy to help out where I can if you want to try adding this stuff in a feature branch of greta?

Nick


#4

Hi ajf,

Thanks for your question.

The code you give isn’t quite doing the same thing as simulate(m) the code you give draws from the posterior distributions of the parameters and then calculates from each draw a value for an intermediate parameter in the model.

Each call to simulate, on the other hand, will draw a value for each of the parameters from their priors and then pass these values through the model eventually drawing a value for the ‘data’. Each call to simulate (at this stage) therefore generates an independent sample of the data from the data generating process. For the model I’ve described each call to simulate is equivalent to running the following R code:

int <- rnorm(1, 0, 10)
coef <- rnorm(1, 0, 10)
sd <- rhalfcauchy(0, 3) # pseudo-code rhalfcauchy is not in base R
mu <- int + coef * attitude$complaints
data <- rnorm(length(attitude$rating), mu, sd)

+ simulate can fix certian paraemters in the model easily!

As to my motivation, I primarily wanted to see if was possible to simulate greta models in this way. I do think though that simulation - prior, posterior checks and the more recent SBC are really important parts of a bayesian workflow. Also I like to, when I’m starting an analysis, draw simulated data from the true process with certain parameter values and see if mcmc can recover them. The simulate function partially simplifies this process.

All the best!


#5

Hi Nick,

That would be great - I’d be very happy to work on a branch. How should I go about adding a branch from a git perspective? Just proceed as normal for a bug fix etc. or something different?

I think the first thing to do would be to add sampling for all the distributions. After that it would just be a matter of working out an interface for the checks and then implementing the basic simulation algorithm of rgreta

All the best!

p.s. I’ve had a bit of trouble managing multiple git branches in the past so it might be best if you could review my pull request at some stage - just in case I accidentally delete it…


#6

Great!

Yes, exactly like for a bugfix: fork the main greta repo on GitHub, clone it, and the create a branch locally. Very happy to review PRs, or help with git troubles if I can. Just tag me on GitHub!


#7

Okay I’ve had a bit of a go at an implementation (you can see it here) but I’ve run into a few problems. (By the way, not sure if here is the best place to have these sort of discussions but not sure where else)

The basic idea for simulating a distribution is a new method for every distribution: tf_sample() which is called from the wrapper sample() at the distribution node level. For example for the normal distribution:

tf_sample = function() {

      parameters <- lapply(self$parameters, function(x) x$value())
      dist <- tfp$distributions$Normal(loc = parameters$mean,
                                       scale = parameters$sd)
      tf$Session()$run(dist$sample(self$dim))

    }

I’m not sure if there is any better infrastructure in greta to do this tf$Session$run idea - it does feel a bit inelegant.

Does this look like a somewhat sensible idea?

A big issue with this approach is how we sample distributions not implemented in tensorflow_probability or ones for which we don’t use it. Some ideas I had:

  • Simulate in R. Issue - how to broadcast?
  • Simulate using distribution identities - i.e. F is the ratio of Chi squared
  • Don’t simulate and throw an error if we are told to

Do you have any ideas?

All the best!


#8

Great! Probably best if you open an issue in the main GitHub repo about adding sampling, copy this stuff over, then we can discuss it there.

I wasn’t thinking you’d execute a tf session inside those methods, just return the tensor for the random elements (so end the function with dist$sample(self$dim)).

Then the function that does simulation of the whole model can put that tensor wherever it needs to go. E.g. when having the variable node define itself on the dag inside the simulation function, it can use this method, and place the resulting tensor in the dag’s tf_environment with the appropriate name (rather than collecting from the free state vector and transforming it, like it would we defining the model for inference). Whenever the tensorflow session is executed, that will then generate a different IID random draw from this distribution.

For the distributions that don’t have TF probability random number generators, we’ll have to code our own in tensorflow code. It’s normally easiest to generate IID standard uniform variables and push them through the distribution’s inverse CDF. E.g. for a logistic distribution:

tf_rlogis <- function (n, location = 0, scale = 1) {
  u <- tf$random$uniform(shape(n))
  z <- log(u / (1 - u))
  z * scale + location
}

# get tensor for result
draws <- tf_rlogis(1000, 2, 1.3)

# compare with stats::rlogis
par(mfrow = c(2, 1))
sess <- tf$Session()
hist(sess$run(draws))
hist(rlogis(1000, 2, 1.3))