What and how does Greta runs parallel?

What does it mean in practice that Greta is “parallel” / “uses Tensorflow”.

Which part of HMC can be run in parallel with Greta ? Why is Tensorflow such a big deal ?

I might be missing something.

For example in case of a HMM. What can Greta run parallel ?

Is there some special “keyword” that tells that a given loop should be distributed ?

Say, if I have a HMM and 50.000 x 100 observations (100 is the length of the HMM chain, 50.000 is the number of sequences I have observed). Would it be possible in this case to parallelise over the 50.000 observations ?

This is not a very obvious issue.

Or is it so that Tensorflow simple ‘just’ parallelises all vector calculations ? Say vector matrix multiplication ? Matrix matrix multiplication ? That’s nice if there are lot of parameters but how about lot of observations (assumed to be independent) ?

In my understanding, yes, TensorFlow Probability parallelises all vector calculations by optimising hardware processes (CPU / GPU). It is a matter of improving efficiency / runtime. MCMC chains are also executed in parallel but I suppose that is set on top of the TFP backend. A “lot of observations” will presumably scale predominantly with memory. I do know that cross-validation procedures can be parallelised, whereby observations are split, but in general they are kept together. Give it a try benchmarking TF from CPU compared to GPU using computationally intensive problem. I might try it myself some time soon, I will come back here if I do.

Exactly, tensorflow distributes single operations; so something like a (large) matrix multiplication can be spread out over many cpus, or on a GPU. That’s different to how most MCMC software parallelises, which is to run multiple chains, and put each chain on a separate CPU. Because greta uses tensorflow, we can do parallelisation across many more CPUs, so make more use of HPC hardware.

Prior to version 0.3, greta did exactly what you describe @franciscolima - it ran each chain in parallel as a separate process, and each of those processes could use multiple CPUS. From 0.3.0 we actually do inference on all the chains simultaneously, by adding an extra dimension to all objects used to compute the model, with each element of that dimension representing a different chain. So a greta matrix, like a <- variable(dim = c(5, 3))), is internally converted to a 3D array. E.g. with 4 chains, a would be represented as a 4 x 5 x 3 array. Then operations on that object would be broadcast along the first dimension. All the MCMC chains are then run in step, so iteration n happens for all chains at exactly the same time.

This is a bit hard to get your head around, but it’s one of the really useful features of tensorflow and is generally more efficient than the old way. E.g. it means we can tune samplers using the information from all 4 chains at once. The fact that we can do this is due to the magical powers and hard work of the tensorflow and tensorflow-probability teams, the latter of whom coded up the simultaneous MCMC sampler code.

In the HMM example, provided the computationally demanding operations in the HMM code are vectorised across the 50,000 observed sequences, tensorflow can distribute those computations.

There are some nuances to when tensorflow can distribute things efficiently, and when HPC hardware like GPUs are more efficient. You can find some information in the tensorflow github issues and stack overflow, but it’s generally best to benchmark things to be sure!

1 Like

I’ve noticed that as I increase the number of CPUs (e.g., scaling up the size of a VM), the percent utilization of each core goes down. For example, for a simple logistic regression problem with about 55k observations, I was getting ~70% utilization of each core on a 4 core machine, but when bumping up to a 16 core machine, each core was only utilized at ~33%. This means that compute time doesn’t scale linearly with the number of processors (and in fact there seems to be strongly diminishing returns.

I’m guessing this is something on the tf-probability end and not something that can be “fixed” on the greta side? I’m just curious if you know what is going on here, and why? Is it actually cheaper to just use the cores inefficiently than the deal with the overhead of a potentially more complex scheduler that would ensure full CPU usage?

Figured this is related to this thread so I’d tack it on here instead of opening a new one. Thanks for any insights!

Sorry to revive this. I’m interested in Greta’s ability to parallelize models across many cores but I have not been able to find any clear benchmarks on this. Are there recent comparisons of the same models running on Stan with 1 core per chain vs Greta with 200 (for example) cores total? Perhaps even using ESS as a metric? Thanks!

1 Like

Hi @mgn,

We haven’t benchmarked this recently - it would be particularly interesting given the recent changes in TF2, however greta vs STAN isn’t always a clear winner - greta is very good at certain kinds of modelling tasks, and might be slower than STAN in others. We can look into this in the future, would be very interesting to compare TF2 to STAN, but at the moment we need to get greta working with TF2.

@nick might have more to say on this!