Exactly, tensorflow distributes single operations; so something like a (large) matrix multiplication can be spread out over many cpus, or on a GPU. That’s different to how most MCMC software parallelises, which is to run multiple chains, and put each chain on a separate CPU. Because greta uses tensorflow, we can do parallelisation across many more CPUs, so make more use of HPC hardware.
Prior to version 0.3, greta did exactly what you describe @franciscolima - it ran each chain in parallel as a separate process, and each of those processes could use multiple CPUS. From 0.3.0 we actually do inference on all the chains simultaneously, by adding an extra dimension to all objects used to compute the model, with each element of that dimension representing a different chain. So a greta matrix, like a <- variable(dim = c(5, 3))
), is internally converted to a 3D array. E.g. with 4 chains, a
would be represented as a 4 x 5 x 3 array. Then operations on that object would be broadcast along the first dimension. All the MCMC chains are then run in step, so iteration n happens for all chains at exactly the same time.
This is a bit hard to get your head around, but it’s one of the really useful features of tensorflow and is generally more efficient than the old way. E.g. it means we can tune samplers using the information from all 4 chains at once. The fact that we can do this is due to the magical powers and hard work of the tensorflow and tensorflow-probability teams, the latter of whom coded up the simultaneous MCMC sampler code.
In the HMM example, provided the computationally demanding operations in the HMM code are vectorised across the 50,000 observed sequences, tensorflow can distribute those computations.
There are some nuances to when tensorflow can distribute things efficiently, and when HPC hardware like GPUs are more efficient. You can find some information in the tensorflow github issues and stack overflow, but it’s generally best to benchmark things to be sure!