Greta Sampling seems very slow

I just recently installed Greta using the instructions in this post: Installing Greta: Robust instructions as of March 31, 2019

I am running a simple poisson-gamma hierarchical model, and the sampler seems to be very slow compared to other MCMC samplers like JAGS and STAN for example. Is it possible I have something configured incorrectly that’s causing it to be slow?

Here’s my R code for this model:

## load required packages and set basic options
################################################################################

library("tidyverse"); theme_set(theme_minimal())
library("parallel"); options(mc.cores = detectCores())
library("greta")
library("bayesplot")
library("bench")


## generate/specify data
################################################################################

theta <- 5 # poisson theta

set.seed(1)

(y <- rpois(1, theta))
y <- as_data(y)



## specify greta model
################################################################################

theta <- gamma(3,1)
distribution(y) <- poisson(theta)

greta_model <- model(theta)

plot(greta_model)

## configure model settings
################################################################################

n_chains <- 4
n_iter <- 1e4L
n_warmup <- 1e3L


## fit model
################################################################################

  
  
  greta_fit <- mcmc(
    "model" = greta_model, "n_samples" = n_iter,
    "warmup" = n_warmup, "chains" = n_chains
  )
  
  
  
  ## assess fit
  ################################################################################
  
  summary(greta_fit)

The code looks fine to me.

For small models like this, I would expect JAGS and Stan to be much faster at sampling than greta. greta is designed and optimised for models with large vectorised and linear algebra operations, since it can parallelise each of those operations across multiple CPUs, or on GPUs. But there’s a computational overhead to each operation in greta, so it’ll never be as fast as a compiled C++ for loop - which is how JAGS and Stan work.

That said, it took about 30s to draw 10,000 samples using your greta code on my laptop, which in my experience with Stan is faster than it normally takes to compile the model code.

Just as another example of how greta works in a very different way to JAGS and Stan: If you were to use 400 chains of 100 iterations each (ie. the same number of posterior samples), this takes about half the time as 4 chains of 10,000 iterations. That’s even though it’s still doing 1000 warmup steps on each of those chains.

That’s because greta vectorises everything (even across chains) and splits those operations across CPUs, and can handle those vectorised operations more effectively. This makes a real difference in big models, and when you have access to more than a handful of CPU cores.

I’m trying to implement a spatial capture-recapture model in greta, but I think I must be doing something wrong because it is taking a very long time to compile and run, and I haven’t even tried the full model yet. Is there some way to improve the code below?

Simulate a dataset:

## Simulate activity centers for N=100 individuals
N <- 100
xlim <- ylim <- c(0,1)
s <- cbind(runif(N, xlim[1], xlim[2]),
           runif(N, ylim[1], ylim[2]))

## Trap locations
x0 <- seq(0.25, 0.75, length=7)
x <- cbind(rep(x0, each=7), rep(x0, times=7))
J <- nrow(x) ## nTraps

plot(s, xlim=c(0,1), ylim=c(0,1))
points(x, pch=3)

## Parameters of encounter model
lam0 <- 0.3
sigma <- 0.1

K <- 5 ## nOccasions
lambda <- dist <- matrix(NA, N, J)

## Encounter histories
yall <- array(NA, c(N, J)) ## Shout-out from Georgia
for(j in 1:J) {
    dist[,j] <- sqrt((s[,1]-x[j,1])^2 +
                     (s[,2]-x[j,2])^2)
    lambda[,j] <- lam0*exp(-dist[,j]^2 / (2*sigma^2))
    yall[,j] <- rpois(N, lambda[,j]*K)
}

## Discard data on individuals not detected
detected <- as.integer(rowSums(yall)>0)
(n <- sum(detected))
y <- yall[detected,]

Clean up and load greta (hopefully with 16 cores enabled):

## stash objects that aren't needed to fit the model
stashenv <- new.env()
obj.list <- sapply(ls(), get)
list2env(obj.list, envir=stashenv)
keep.obj <- c("y", "n", "J", "K", "x", "xlim", "ylim", "detected")
rm(list=ls()[!ls() %in% keep.obj])
ls()


library(parallel)
getOption("mc.cores")
options("mc.cores"=16)

library(greta)

Specify model:

lam0 <- uniform(0, 3)
sigma <- uniform(0, 2)
psi <- beta(1,1)

## Model for the captured individuals (will deal with N later)
s.obs <- greta_array(dim=c(n, 2))
for(i in 1:n) {
    ## s.obs[i,1:2] <- joint(uniform(xlim[1], xlim[2]),
    ##                       uniform(ylim[1], ylim[2]))
    s.obs[i,1] <- uniform(xlim[1], xlim[2])
    s.obs[i,2] <- uniform(ylim[1], ylim[2])
}

d2.obs <- zeros(n, J)
lambda.obs <- zeros(n, J)
for(j in 1:J) {
    d2.obs[,j] <- (s.obs[,1]-x[j,1])^2 + (s.obs[,2]-x[j,2])^2
    lambda.obs[,j] <- lam0*exp(-d2.obs[,j]/(2*sigma^2))*K
}

distribution(y) <- poisson(lambda.obs)

Compile and sample:

system.time({
    gm <- model(lam0, sigma)
})
## CRAN version
##    user  system elapsed 
## 601.162 360.196 180.352 
## github version
##     user   system  elapsed 
## 1228.589  720.429  280.871 
system.time({
    mc <- mcmc(gm, n_samples=1000, warmup=200, chains=1, n_cores=16)
}) ## This has been running for >1 hr with no sign of a progress bar yet

Session info:

> detectCores()
[1] 32

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] greta_0.3.1.9010

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        lattice_0.20-38   codetools_0.2-16  listenv_0.8.0    
 [5] prettyunits_1.1.1 future_1.16.0     digest_0.6.23     crayon_1.3.4     
 [9] rappdirs_0.3.1    grid_3.6.1        R6_2.4.1          jsonlite_1.6     
[13] magrittr_1.5      coda_0.19-3       tfruns_1.4        rlang_0.4.4      
[17] progress_1.2.2    whisker_0.4       reticulate_1.14   vctrs_0.2.2      
[21] tools_3.6.1       hms_0.5.3         compiler_3.6.1    base64enc_0.1-3  
[25] pkgconfig_2.0.3   tensorflow_2.0.0  globals_0.12.5   

For some reason, it looks like only 1 core is being used.

Eventually, I will need to add a lot of complexity to this model, so if it is this slow now, I’m afraid it won’t be viable for the full SCR model, which will involve data augmentation and then marginalization of the indicator variable z out of the likelihood… assuming that it still isn’t possible to sample discrete variables in greta.

Any ideas for improvement? Thanks for the great software!

I managed to get this working in a reasonable amount of time after fixing a typo and condensing the greta model a bit. I’ll post a full SCR later once I have it ready.

Hi Richard,

Thanks for posting and sorry for the slow reply - was out hiking last week.

The trick with greta is to vectorise everything, and banish all for-loops. greta (i.e. tensorflow) can use multiple cores by splitting up large single operations across cores, but it can’t do anything with for-loops since they break everything up into tiny operations. And in fact the overhead in the way tensorflow does things means for-loops done on the R side turn out to be incredibly slow. Hence only one core used, and incredibly slow sampling, in your model. As the sizes of those operations (e.g. numbers of traps & individuals) increases, greta will be able to use more of your cores.

Here’s a vectorised version of your model and run times on my macbook:

library(greta)

lam0 <- uniform(0, 3)
sigma <- uniform(0, 2)
psi <- beta(1,1)

## Model for the captured individuals (will deal with N later)
# s.obs <- greta_array(dim=c(n, 2))
# for(i in 1:n) {
#   ## s.obs[i,1:2] <- joint(uniform(xlim[1], xlim[2]),
#   ##                       uniform(ylim[1], ylim[2]))
#   s.obs[i,1] <- uniform(xlim[1], xlim[2])
#   s.obs[i,2] <- uniform(ylim[1], ylim[2])
# }

x_coords <- uniform(xlim[1], xlim[2], dim = n)
y_coords <- uniform(ylim[1], ylim[2], dim = n)
s.obs <- cbind(x_coords, y_coords)

# d2.obs <- zeros(n, J)
# lambda.obs <- zeros(n, J)
# for(j in 1:J) {
#   d2.obs[,j] <- (s.obs[,1]-x[j,1])^2 + (s.obs[,2]-x[j,2])^2
#   lambda.obs[,j] <- lam0*exp(-d2.obs[,j]/(2*sigma^2))*K
# }

d2.obs <- rdist(s.obs, x) ^ 2
lambda.obs <- lam0 * exp(-d2.obs / (2 * sigma ^ 2)) * K

distribution(y) <- poisson(lambda.obs)
system.time({
  gm <- model(lam0, sigma)
})
# user  system elapsed 
# 0.405   0.008   0.415 
system.time({
  mc <- mcmc(gm, n_samples=1000, warmup=200, chains=1, n_cores=16)
})
# 16 cores were requested, but only 8 are available.
# warmup ===========================================   200/200 | eta:  0s | 4% bad 
# sampling =========================================== 1000/1000 | eta:  0s          
# user  system elapsed 
# 64.285  11.911  25.055 

As you add more complexity into the model, you’ll probably want to use %*% (matrix multiplication) and sweep() for the mapping from activity centres to traps.

I’ve done a little thinking about SCR in greta (with help from some folks who know more about SCR than me) and am planning a greta extension package that rolls up common components into helpful functions. That’s a way down my to do list at the moment, but I’m very happy to keep chatting about SCR models on here!

Thanks Nick. I’m moving this conversation over here where I posted code for fitting an SCR model that includes N.