[Feedback requested!] proposed simulate() interface

nick · August 18, 2019, 12:21am

A feature we’ve been wanting for a long time is the ability to draw IID samples from a model prior (and ideally also simulate IID samples from distributions conditioned on MCMC samples of a model posterior). This would make it much easier to do prior predictive checks (do the priors lead to plausible parameters and datasets?) and posterior predictive checks (if we simulate data from the posteriors, does it look like the data we observed?) without having to write a bunch of custom code. There’s some work going on now into the internal changes that are needed to make this happen (here and here).

At this stage, it would be really helpful to get some feedback on what the user interface to this functionality would look like. I have a proposal below which I think covers these use cases and remains both straightforward and consistent with other simulate methods.

What do you think?

Does this cover your use cases?
Is it obvious how to use it?
Is there any other related functionality that should be included in this or related functions?

Here’s some example documentation for the proposed interface; a simulate S3 method (from the generic stats::simulate()) for objects of class ‘greta_model’:

simulate.greta_array (greta)

Simulate Responses From greta_model Object

Description

Simulate responses from a “greta_array” model object, either from the prior or conditioned on the posterior or some fixed values.

Usage

## S3 method for class 'greta_model'
simulate(object, nsim = 1, seed = NULL, targets = list(), values = list(), precision = c("double", "single"), ...)

Arguments

object: a greta_model object
nsim: positive integer scalar - the number of responses to simulate
seed: an optional seed to be used in set.seed immediately before the simulation so as to generate a reproducible sample
targets: an optional list of greta arrays for which to generate samples. If not specified, the greta arrays named in the call to model will be used.
values: a named list giving fixed values of greta arrays in the model to hold constant, or an mcmc.list`` object returned bymcmc`.
precision: the floating point precision to use when calculating values
...: optional additional arguments, none are used at present

Details

If the values argument is not provided, simulate() draws nsim independent samples from the model prior for all of the target greta arrays - either those named in the call to model() or those in targets, if provided.

If values are provided, samples of the target values will be independent, conditional on these provided values. If values is a named list of values for some greta arrays, each of the nsim independent samples of the target greta arrays will be drawn conditional on those fixed values. If an mcmc.list object (created by fitting a greta model with mcmc) is passed to values, then nsim samples of the target greta arrays will be returned, each of them conditioned on a different, randomly selected posterior sample from values.

Value

A named list of vectors, matrices or arrays containing independent samples of the named greta arrays. The number of samples will be prepended as the first dimension of the greta array, so that a vector fo samples is returned for each scalar greta array, and a matrix is returned for each vector greta array etc.

Examples

# build a greta model
n <- 10
y <- rnorm(n)
y <- as_data(y)

library(greta)
sd <- lognormal(1, 2)
mu <- normal(0, 1, dim = n)
distribution(y) <- normal(mu, sd)
m <- model(mu, sd)

# prior simulation:

# simulate one random draw of mu and sd from the model prior:
sims <- simulate(m)

# 100 simulations of mu and sd)
sims <- simulate(m, nsim = 100)

# 100 simulations of y and mu (greta arrays names in targets)
# NB: y must be a greta array here
sims <- simulate(m, nsim = 100, targets = list(y, mu))

# conditional prior simulation:

# 100 simulations of y and mu, with sd fixed at 3
sims <- simulate(m, nsim = 100, targets = list(y, mu), values = list(sd = 3))

# posterior simulation:

# generate posterior samples from the model as usual
draws <- mcmc(m)

# draw 100 simulations of y, conditioned on 100 random posterior
# samples of all the model parameters
sims <- simulate(m, nsim = 100, targets = list(y), values = draws)

Voltemand · August 18, 2019, 1:09am

I like it! The only thing I think could be a little confusing is the two types of arguments to values. Maybe it could be two seperate arguments instead: values and draws?

nick · August 18, 2019, 1:33am

Yeah, I can see that. It’s difficult to know the ideal design when two inputs are semantically similar and mutually exclusive (you can’t specify both at once), but are different data types. Neither this, nor having two separate arguments, seem ideal.

Other options:

Define a new values object type, have values take that input type (so that the input type is always the same), and provide functions to coerce both the named list and mcmc.list into values objects.
Split this into two separate simulate functions, one for each input type.

Both of these bloat the API and require more understanding from the user though, so the variable input types might be the lesser of two evils.

This values argument is the same as the current interface to calculate(). I think it’s important to have this interface be consistent with that one, since they both achieve similar things.

Another option would be to have the conditional simulation happen in that function. Though I think it makes more sense here.

Voltemand · August 18, 2019, 2:02am

Good points! Given all of that I think having the two argument types is definitely the best option. It’s only really a minor complaint anyway.

howard · August 19, 2019, 5:52pm

Looks good. Can values allow for a change in the training “y” values as well?

cboettig · August 19, 2019, 5:59pm

Thanks Nick, this looks really nice to me. I think the mechanism you’ve settled on to handle both prior, conditional prior, and posterior predictions makes sense to me. The examples really help. Definitely looking forward to this.

I wonder if you’re considering exposing any of the lower-level interface for the IID draws? e.g. a sample() method? Or would simulate() be the only user-facing function introduced here?

nick · August 19, 2019, 9:42pm

Yes, you can simulate y with this. They don’t get placed back in the y object though. Is that what you meant? The ability to build a model, change the values in one of the greta arrays and then do inference with the new values?

If so, that should definitely be possible, though I’m not sure what the interface to that would be. A different function to this one though.

nick · August 19, 2019, 9:51pm

Great! The internal sampling is all done in tensorflow, so the lowest-level functionality wouldn’t be useful to users of greta.

Adding a different version of this interface that acts on greta arrays might be possible. Like:

x <- normal(0, lognormal(0, 1))
simulate(x)

but this would essentially be the same as:

x <- normal(0, lognormal(0, 1))
simulate(model(x))

Is that the sort of thing you were thinking?

cboettig · August 19, 2019, 10:10pm

Thanks Nick! Yeah, I was thinking along those lines, but I think you’re right that it would be cleaner to restrict the simulate method to the model object after all like you show above.

mdscheuerell · August 20, 2019, 7:45pm

I think this is a great addition to greta’s functionality. Nice work, Nick.

I agree with @Voltemand’s comment about the options for the values arg, but it’s not a big deal.

[minor point] I’d recommend moving the call to library(greta) in the Examples section up to the top because as_data() depends on it.

ajf · August 23, 2019, 7:22pm

This will be very useful! My only feedback would be to consider the defaults for targets for the predictive checks. Typically, I know what to expect from the distributions on mu and sigma. So, simulating from them for my prior predictive is not often that useful and simulating from them from the posterior is meaningless, as you already simulated them. By default, I often want to know the effect of my prior choices on the outcome scale or my posterior predicitions on the outcome scale e.g. y.

So possibly, make all variables with either a likelihood specification (e.g. distribution(y) <- normal(mu, sd)) or a calculated node (e.g. mu = a + b * x) be the default targets.

Great stuff!! Thanks for making an awesome package more awesome.

nick · August 27, 2019, 4:47am

Thanks! Yeah, that would probably be the most common use case. Though I worry that it would be non-intuitive given other functions operating on models (mcmc, opt, etc.) always return summaries of the greta arrays named in the model call. Not sure which is best.

I’ve been thinking again about the marginalisation interface in the feature/marginalise branch, and also doing predictive simulations (when you get new data to predict to after model fitting). In both of those use cases, you might want to create a new variable greta array after creating the model and doing MCMC, and then simulate their values given the MCMC samples (like you can already do for deterministic greta arrays with calculate()). You can’t do that with the current interface, since you have to pass in the model object, which wouldn’t have the new greta arrays.

In light of that, maybe having greta arrays be the simulation targets would be better, like in @cboettig’s post. so then the interface would look like this (no model passed in):

simulate(targets, nsim = 1, seed = NULL, values = list(), precision = c("double", "single"), ...)

and you’d use it like this:

# build a greta model
n <- 10
y <- rnorm(n)
y <- as_data(y)

library(greta)
sd <- lognormal(1, 2)
mu <- normal(0, 1, dim = n)
distribution(y) <- normal(mu, sd)
m <- model(mu, sd)

# prior simulation:

# simulate one random draw of mu and sd from the model prior:
sims <- simulate(list(mu, sd))

# 100 simulations of mu and sd)
sims <- simulate(list(mu, sd), nsim = 100)

# 100 simulations of y and mu
# NB: y must be a greta array here
sims <- simulate(list(y, mu), nsim = 100)

# conditional prior simulation:

# 100 simulations of y and mu, with sd fixed at 3
sims <- simulate(list(y, mu), nsim = 100, values = list(sd = 3))

# posterior simulation:

# generate posterior samples from the model as usual
draws <- mcmc(m)

# draw 100 simulations of y, conditioned on 100 random posterior
# samples of all the model parameters
sims <- simulate(y, nsim = 100, values = draws)

This way you would have to explicitly pass in the things you want to simulate. It wouldn’t be able to guess what you want with defaults, but at least you wouldn’t be confused!

A downside of this is that it doesn’t match the pattern for the simulate S3 class, where the first argument is a (fitted) model object.

So maybe it should be renamed to something else (sample is already taken) or maybe this functionality should be rolled into the same function as calculate (enabling a list of targets in that function)?

Goddamnit, sane API design is much harder than writing code!

cboettig · August 27, 2019, 3:28pm

Ooh… I do like the proposal of

sample(targets, nsim = 1, seed = NULL, values = list(), precision = c("double", "single"), ...)

I agree that this is less confusing than the original proposal passing the model object.

nick · August 27, 2019, 10:23pm

Would you rather have a separate random_sample() function with the interface you propose, or have calculate() do sampling when calculating stochastic variables?

In the latter option calculate() would accept a single target or a list of targets, and have seed and nsim arguments (defaulting to 1 for a list, or the number of draws). This would be backward compatible with the current interface. It would only error about values not being provided if they were variable greta arrays without distributions, otherwise it would sample missing values from their priors.

There’s definitely some overlap in the use of these two functions, and most arguments are identical, so might be simpler to combine them.

howard · September 8, 2019, 11:51pm

What would be very useful is to be able to do a posterior predictive distribution along of the line of the “generated quantities” section in a Stan program (or perhaps this should be done via “calculate”?), e.g.

# ... see prior model
distribution(y) <- normal(mu, sd)
draws <- mcmc(m)

y_posterior_predictive = normal(mu,sd)
sims <- simulate(m, nsim = 100, targets = list(y_posterior_predictive), values = draws)

nick · September 9, 2019, 1:44am

Yes exactly. If we decide to wrap this into calculate, then doing:

sims <- calculate(y, draws)

would achieve this.

cboettig · October 25, 2019, 9:37pm

Hi Nick,

Looping back to this again. Is feature/simulate or feature/marginalise the right branch to test this out on?

Was just exploring how this would look in the context of gp’s here: http://rpubs.com/cboettig/greta-gp . I wrote down examples manually in base R of what I’m trying to do, (super simple stuff here of drawing from the prior etc), but not clear what the corresponding greta syntax would be?

nick · October 25, 2019, 11:49pm

This is on feature/simulate but the interface decided on here isn’t yet implemented.

Just had a very quick look at your doc. I don’t think you’ll be able to achieve what you want (sampling from the analytical expression for the conditional distribution) with greta.gp::gp, it’s just not designed to work that way. You can evaluate the kernels as covariance matrices though, and write the expression for the conditional density yourself?

lzachmann · November 13, 2019, 7:57pm

Chiming in here only to give this a bit of a bump. Having the ability to simulate observations at each iteration (or – better yet – a subset of MCMC samples using the proposed nsim arg) would be huge for us (esp. as it relates to posterior predictive checks). I’d be more than happy to test implementations of simulate() as soon as the interface is decided on!

nick · November 13, 2019, 9:47pm

Great, thanks! I think rolling this into calculate() is the current plan. I should be able to implement that in January, when I should have some time for a decent development sprint on greta. Will keep you posted on this thread.