[Feedback requested!] proposed simulate() interface

Thanks! Yeah, that would probably be the most common use case. Though I worry that it would be non-intuitive given other functions operating on models (mcmc, opt, etc.) always return summaries of the greta arrays named in the model call. Not sure which is best.

I’ve been thinking again about the marginalisation interface in the feature/marginalise branch, and also doing predictive simulations (when you get new data to predict to after model fitting). In both of those use cases, you might want to create a new variable greta array after creating the model and doing MCMC, and then simulate their values given the MCMC samples (like you can already do for deterministic greta arrays with calculate()). You can’t do that with the current interface, since you have to pass in the model object, which wouldn’t have the new greta arrays.

In light of that, maybe having greta arrays be the simulation targets would be better, like in @cboettig’s post. so then the interface would look like this (no model passed in):

simulate(targets, nsim = 1, seed = NULL, values = list(), precision = c("double", "single"), ...)

and you’d use it like this:

# build a greta model
n <- 10
y <- rnorm(n)
y <- as_data(y)

library(greta)
sd <- lognormal(1, 2)
mu <- normal(0, 1, dim = n)
distribution(y) <- normal(mu, sd)
m <- model(mu, sd)

# prior simulation:

# simulate one random draw of mu and sd from the model prior:
sims <- simulate(list(mu, sd))

# 100 simulations of mu and sd)
sims <- simulate(list(mu, sd), nsim = 100)

# 100 simulations of y and mu
# NB: y must be a greta array here
sims <- simulate(list(y, mu), nsim = 100)

# conditional prior simulation:

# 100 simulations of y and mu, with sd fixed at 3
sims <- simulate(list(y, mu), nsim = 100, values = list(sd = 3))

# posterior simulation:

# generate posterior samples from the model as usual
draws <- mcmc(m)

# draw 100 simulations of y, conditioned on 100 random posterior
# samples of all the model parameters
sims <- simulate(y, nsim = 100, values = draws)

This way you would have to explicitly pass in the things you want to simulate. It wouldn’t be able to guess what you want with defaults, but at least you wouldn’t be confused!

A downside of this is that it doesn’t match the pattern for the simulate S3 class, where the first argument is a (fitted) model object.

So maybe it should be renamed to something else (sample is already taken) or maybe this functionality should be rolled into the same function as calculate (enabling a list of targets in that function)?

Goddamnit, sane API design is much harder than writing code!

Ooh… I do like the proposal of

sample(targets, nsim = 1, seed = NULL, values = list(), precision = c("double", "single"), ...)

I agree that this is less confusing than the original proposal passing the model object.

Would you rather have a separate random_sample() function with the interface you propose, or have calculate() do sampling when calculating stochastic variables?

In the latter option calculate() would accept a single target or a list of targets, and have seed and nsim arguments (defaulting to 1 for a list, or the number of draws). This would be backward compatible with the current interface. It would only error about values not being provided if they were variable greta arrays without distributions, otherwise it would sample missing values from their priors.

There’s definitely some overlap in the use of these two functions, and most arguments are identical, so might be simpler to combine them.

What would be very useful is to be able to do a posterior predictive distribution along of the line of the “generated quantities” section in a Stan program (or perhaps this should be done via “calculate”?), e.g.

# ... see prior model
distribution(y) <- normal(mu, sd)
draws <- mcmc(m)

y_posterior_predictive = normal(mu,sd)
sims <- simulate(m, nsim = 100, targets = list(y_posterior_predictive), values = draws)

Yes exactly. If we decide to wrap this into calculate, then doing:

sims <- calculate(y, draws)

would achieve this.

Hi Nick,

Looping back to this again. Is feature/simulate or feature/marginalise the right branch to test this out on?

Was just exploring how this would look in the context of gp’s here: http://rpubs.com/cboettig/greta-gp . I wrote down examples manually in base R of what I’m trying to do, (super simple stuff here of drawing from the prior etc), but not clear what the corresponding greta syntax would be?

This is on feature/simulate but the interface decided on here isn’t yet implemented.

Just had a very quick look at your doc. I don’t think you’ll be able to achieve what you want (sampling from the analytical expression for the conditional distribution) with greta.gp::gp, it’s just not designed to work that way. You can evaluate the kernels as covariance matrices though, and write the expression for the conditional density yourself?

Chiming in here only to give this a bit of a bump. Having the ability to simulate observations at each iteration (or – better yet – a subset of MCMC samples using the proposed nsim arg) would be huge for us (esp. as it relates to posterior predictive checks). I’d be more than happy to test implementations of simulate() as soon as the interface is decided on!

Great, thanks! I think rolling this into calculate() is the current plan. I should be able to implement that in January, when I should have some time for a decent development sprint on greta. Will keep you posted on this thread.

Thanks Nick, this is great news. And no critical urgency – we’ve got a hack that does what we need, it’s just embarrassingly inelegant and woefully slow. @vlandau and I would be happy to help provide feedback and test whenever it makes sense. Feel free to tag us here or in the issues pages.

Quick update as I’m working on this now - PR here.

I’m planning to implement both a limited simulate(model, nsim = 100) to simulate from the whole prior model (IFF it is a complete generative model, so no variable()'s), and the interface above for simulating conditional on fixed values (a list or draws) via calculate(). So the following would work:

library(greta)
y <- as_data(y)
sd <- lognormal(1, 2)
mu <- normal(0, 1, dim = n)
distribution(y) <- normal(mu, sd)

# a named list giving draws from the model prior of the objects: mu, sd
calculate(mu, sd, nsim = 100)

# draws from the prior over the observed data y (which must be a greta
# array)
calculate(y, nsim = 100)

# draws from the prior over the observed data y, *conditional on a fixed
# value of mu*
calculate(y, values = list(mu = 3), nsim = 100)

m <- model(mu, sd)
draws <- mcmc(m)

#  draws from the *posterior* over the observed data y for a random
# subset of 100 posterior samples
calculate(y, values = draws, nsim = 100)

# a named list giving draws from the model prior of the objects: mu, sd, y
# (all of the greta arrays in this environment that are associated with
# the  model) this is a convenient wrapper for calculate(mu, sd, y,
# nsim = 100), which will be used internally
simulate(m, nsim = 100)

With this interface, it might be possible for people to be tripped up by simulation when they expect they are calculating something deterministic. E.g. if they were to do:

beta <- normal(0, 1, dim = 3)
eta <- X %*% beta
prob <- plogis(x)
calculate(prob)

not realising that prob is stochastic (depends on values for beta). I’m imagining this could happen when users are beginners, or have a more complicated model, where it’s not obvious that the target is going to be stochastic.

My proposed solution is to set a default of nsim = NULL in calculate(), in which case the above would error, with a message telling the user to provide either an nsim or values argument. I.e. nsim = NULL is equivalent to a ‘don’t simulate’ argument.

Alrighty, these features are now all implemented in the branch feature/simulate, which you can install with: remotes::install_github("greta-dev/greta@feature/simulate"). The example for calculate shows how to do all the new things: prior simulation, prior simulation conditional on some fixed values of parameters, and posterior simulation (simulating new data given posterior samples).

There are a bunch of tests in there, but it took a lot of thinking and hacking to implement (the same model/greta code implies very different DAGs for inference, prior simulation, and posterior simulation), so I’d really appreciate people trying it out and reporting bugs, so I know what else to test for.

Not all distributions can be sampled from at the moment, so the next step will be implementing sampling for more of those (e.g. mixture, joint, LKJ, Wishart), and some tests that the samples are correct.

One thing to bear in mind is that the interface to calculate has changed in a way that isn’t completely backwards-compatible. You can now pass in multiple greta arrays via a dots argument, but need to explicitly name the other arguments, like values.

OK, more tests implemented to make sure the samples are coming from the right distributions. Sampling is now implemented for every distribution with the exception of the hypergeometric distribution, and the truncated f distribution (both of which are tricky to sample from within tensorflow).

This is pretty much ready to be merged into master!

Merged into master now - thanks everyone who helped design the interface, and reported bugs!

The main dev branch isn’t going to be released imminently though, so there’s time for more feedback and bug reports.

Thanks Nick, the new simulate and calculate interfaces work super well, and should greatly simplify code we were using to develop posterior predictive checks and derived quantities. Looked like a giant lift based on the stream of commits – extremely impressive work!

This might be a bit if a niche question, but curious if you’ve got thoughts as to how you might simulate observations conditional on greta arrays created after MCMC. You’ve got an example in the documentation for calculate() that looks like:

# you can use calculate on greta arrays created even after the inference on
# the model - e.g. to plot response curves
petal_length_plot <- seq(min(iris$Petal.Length),
                         max(iris$Petal.Length),
                         length.out = 100)
mu_plot <- alpha + petal_length_plot * beta
mu_plot_draws <- calculate(mu_plot, values = draws)

Without falling back to base R’s rnorm, is there an easy way in greta to make draws for a new y on the basis of the vector of means mu_plot (and of course the other parameters in the model)? The reason we’re interested is that we’d like to create posterior predictive checks for observations withheld from the model.

The reason it is tricky seems to involve how the greta array for the responses (y <- as_data(iris$Petal.Width)) is used. All of the observations that fed into distribution(y) are used in estimation, so there’s no clear way to use greta to simulate new observations for a separate validation or testing partition of iris…? I suspect I’m articulating all of this very poorly, so if I’m not making sense please let me know. I could try to share a more complete example.

Hi Luke,

Great, I’m really glad it’s working for you!

Shooting from the hip here, but is this what you want?:

y_test <- normal(mu_plot, sd)
calculate (y_test, values = draws, nsim = 1000)

If not, please do share an example and I’ll take a closer look!

Hell of a shot! :man_facepalming: Feel bad I bothered you with this. Those two lines are exactly what I needed!

1 Like

Ha! Great, glad that helped.

Just started testing calculate() for prior and posterior predictive checks. Its AMAZING!! And fast. Thanks for this feature.

Great, I’m glad it’s helpful. I’m pretty pleased with it!