What and how does Greta runs parallel?


#1

What does it mean in practice that Greta is “parallel” / “uses Tensorflow”.

Which part of HMC can be run in parallel with Greta ? Why is Tensorflow such a big deal ?

I might be missing something.

For example in case of a HMM. What can Greta run parallel ?

Is there some special “keyword” that tells that a given loop should be distributed ?

Say, if I have a HMM and 50.000 x 100 observations (100 is the length of the HMM chain, 50.000 is the number of sequences I have observed). Would it be possible in this case to parallelise over the 50.000 observations ?

This is not a very obvious issue.

Or is it so that Tensorflow simple ‘just’ parallelises all vector calculations ? Say vector matrix multiplication ? Matrix matrix multiplication ? That’s nice if there are lot of parameters but how about lot of observations (assumed to be independent) ?


#2

In my understanding, yes, TensorFlow Probability parallelises all vector calculations by optimising hardware processes (CPU / GPU). It is a matter of improving efficiency / runtime. MCMC chains are also executed in parallel but I suppose that is set on top of the TFP backend. A “lot of observations” will presumably scale predominantly with memory. I do know that cross-validation procedures can be parallelised, whereby observations are split, but in general they are kept together. Give it a try benchmarking TF from CPU compared to GPU using computationally intensive problem. I might try it myself some time soon, I will come back here if I do.


#3

Exactly, tensorflow distributes single operations; so something like a (large) matrix multiplication can be spread out over many cpus, or on a GPU. That’s different to how most MCMC software parallelises, which is to run multiple chains, and put each chain on a separate CPU. Because greta uses tensorflow, we can do parallelisation across many more CPUs, so make more use of HPC hardware.

Prior to version 0.3, greta did exactly what you describe @franciscolima - it ran each chain in parallel as a separate process, and each of those processes could use multiple CPUS. From 0.3.0 we actually do inference on all the chains simultaneously, by adding an extra dimension to all objects used to compute the model, with each element of that dimension representing a different chain. So a greta matrix, like a <- variable(dim = c(5, 3))), is internally converted to a 3D array. E.g. with 4 chains, a would be represented as a 4 x 5 x 3 array. Then operations on that object would be broadcast along the first dimension. All the MCMC chains are then run in step, so iteration n happens for all chains at exactly the same time.

This is a bit hard to get your head around, but it’s one of the really useful features of tensorflow and is generally more efficient than the old way. E.g. it means we can tune samplers using the information from all 4 chains at once. The fact that we can do this is due to the magical powers and hard work of the tensorflow and tensorflow-probability teams, the latter of whom coded up the simultaneous MCMC sampler code.

In the HMM example, provided the computationally demanding operations in the HMM code are vectorised across the 50,000 observed sequences, tensorflow can distribute those computations.

There are some nuances to when tensorflow can distribute things efficiently, and when HPC hardware like GPUs are more efficient. You can find some information in the tensorflow github issues and stack overflow, but it’s generally best to benchmark things to be sure!