Apparent memory leak in greta

My code runs mcmc sampling multiple times for few datasets. The issue I am having is that R ends up using GBs of memory and brings the system to halt.
I am using the development version of greta.

I created a similar code to show what happens,

library(greta)
library(titanic)
library(pryr)

# function to run mcmc and get samples
drawsample = function(X, y, w_mean, w_sigma){
    # Prior
    w = multivariate_normal(t(w_mean), w_sigma, n_realisations = 1)
    
    # define distribution over output
    linear = X %*% t(w)
    p = ilogit(linear)
    distribution(y) = bernoulli(p)
    
    # define model
    m = model(w)
    
    # draw samples
    draws <- mcmc(m, n_samples = 1000, chains = 1)
    return(draws)
}

cols = c('Age', 'Parch', 'Fare', 'Survived')
data = titanic_train[cols]
data = data[complete.cases(data), ]

X = data[c('Age', 'Parch', 'Fare')]
y = data$Survived

num_features = 3
w_mean = zeros(num_features)
w_sigma = diag(num_features)

for(iter in 1:5){
    print(mem_change(drawsample(X,y, w_mean, w_sigma)))
    print(mem_used())
}

The output I get is like following,

warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
169 kB
191 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
816 kB
192 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
588 kB
192 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
775 kB
193 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
1.18 MB
193 MB

What is the possible issue here?

EDIT: Upon further investigation, it seems to me that all the greta variables I define are leaking memory, and not just the mcmc function.

After going through this post, I think that one possibilities in the memory leak happens due to incorrect handling of tensorflow-probability. May be somewhere some resource needs to be released by greta when using tfp, but it’s not?

Thanks for this! I think I know what’s happening here.

In your code, the object w_mean is a greta array, whereas all the other objects passed into drawsample() are R objects. Because w_mean is a greta array, it keeps track of the other greta arrays to which it is connected (the unnamed ones created in drawsample()). so rather than creating a new DAG of the model each time, it thinks you want to add more greta arrays to the previous one. So by the fifth iteration, you are simultaneously fitting all 5 models, which takes more memory.

If I’m right, then you should see all the conjoined models by doing plot(model(w_mean)) and it you define w_mean instead as w_mean <- rep(0, num_features) (an R object rather than a greta array) you shouldn’t see the memory increasing.

Would you mind checking to see whether that’s what’s going on?

1 Like

I changed w_mean to rep and similarly changed w_sigma to base::diag. This does bring down the mem_change value, but I am not sure if it’s completely solving the problem. Here’s the output I get for 50 iterations after the changes,

warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
345 kB
163 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
319 kB
163 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
314 kB
163 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
356 kB
163 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
305 kB
164 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
364 kB
164 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
358 kB
164 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
320 kB
165 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
380 kB
165 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
335 kB
165 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
359 kB
165 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
394 kB
166 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
337 kB
166 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
400 kB
166 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
380 kB
167 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
362 kB
167 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
414 kB
167 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
366 kB
167 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
425 kB
168 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
432 kB
168 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
374 kB
168 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
445 kB
169 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
400 kB
169 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
405 kB
169 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
455 kB
169 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
395 kB
170 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
463 kB
170 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
448 kB
170 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
414 kB
171 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
476 kB
171 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
427 kB
171 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
452 kB
171 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
486 kB
172 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
431 kB
172 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
493 kB
172 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
465 kB
173 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
453 kB
173 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
513 kB
173 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
455 kB
173 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
521 kB
174 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
523 kB
174 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
465 kB
174 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
534 kB
175 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
494 kB
175 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
497 kB
175 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
541 kB
175 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
488 kB
176 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
551 kB
176 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
537 kB
176 MB
warmup ====================================== 1000/1000 | eta:  0s          
  sampling ====================================== 1000/1000 | eta:  0s          
506 kB
177 MB

Couldn’t get plots to show for some reason. Even when I called the plot function, it doesn’t show any plot. All my packages were not available as well, so I guess the setup has been possibly messed up with system update.

Making the same changes is other similar code does not show any improvement.

OK, yeah that is strange. I can replicate that on my machine too. Would you mind posting this on the Github issue tracker?

It’s certainly possible that it’s a python/tensorflow thing - objects being created in python and not being deleted (or garbage collected).

As a workaround for your analysis, would it be possible to run each model in a different R session? E.g. with processx, or some higher-level interface. I would expect that to solve this issue.

P.S. not sure why the plot didn’t work for you, here’s what it looked like for me:

1 Like

I created an issue of github. I’ll try processx in a week or two. Thanks for the help!