Size and number of leapfrog steps in HMC

hrlai · July 8, 2022, 3:18am

Greetings, fellow greta users who spend lots of time tuning the number of leapfrog steps (the Lmin and Lmax arguments in hmc()) and their stepsize (epsilon).

Among datasets, I have been using L as few as ~10 to as many as ~100, and stepsizes epsilon ranging from the default 0.1 to 0.001. Most if not all of the time, I have no idea what I am doing The guidelines in the helpfiles and this forum, and @nick’s scattered suggestions generally served me well (search “tune”, “leapfrog”, “stepsize” etc. in the forum). There are many technical papers on tuning these parameters, but they mostly point out that static HMC (what we have in greta currently) demands fine tunings and then sell NUTS as an easier alternative.

I really like greta so I need to befriend the static HMC sampler… but for someone like me, it’s really hard to imagine that a stepsize actually it. Does it have a unit? For example, if I scale my response and/or explanatory variables, that changes the scale of my coefficients, so do I also need to change the step size and number (since the explored parameter space has a different scale…)? Or is this unnecessary, because the diag_sd also rescale the posterior space… ??? In other words, I don’t know what I’m doing because I don’t know what a stepsize of 10 means relative to the scales of my parameter and data… is 10 too large or too small?

Anyway, not asking for a definite guide here (because they is none…), but just thought to gauge how people usually tune their HMC sampler day to day. I usually increase the number of steps incrementally by 5, with or without decreasing stepsize at the same time. Still don’t know what I’m doing.

nick · July 8, 2022, 4:44am

You don’t need to manually tune the epsilon or diag_sd! Those are automatically tuned during the warm-up phase. The only thing to manually tune is the number of leapfrog steps (L).

More leapfrog steps means better sampling, but takes longer to evaluate. Execution time is linear in the number of leapfrog steps, so L=20 will take about twice as long to run as L=10 (in theory). There’s an Lmin and an Lmax, because in practice it’s a good idea to jitter the number a bit to avoid the sampler oscillating between two points on parameter space. But in practice, varying Lmin but always setting Lmax=Lmin+5 should be all you need to do to resolve that.

So in the first instance, I would make sure you have enough warmup to learn epsilon and diag_sd well. The tuning parameters are learned jointly from the warm-up samples across all chains. So running lots of chains (10-20) is a good option to improve the warmup tuning. That’s often better than a longer warmup because running lots of chains is often cheap in greta (and tensorflow probability), so long as you are not running low on memory. For smallish models, sampling with about 10 chains often takes no longer than sampling with 1 chain. Even on a single core.

If it’s still sampling poorly, try increasing Lmin (and setting Lmax=Lmin+5). Though I wouldn’t push Lmin above about 30 (normally 5-10 is fine) If it’s not sampling fairly well with that many leapfrog steps there’s probably something wrong with your model definition, or some trick like hierarchical decentring that can help you.

There’s lots of lore/dark arts around practical use of MCMC, so hopefully this is helpful. It would probably be a good idea for us to put some of this (and explaining the many chains thing) in a user guide somewhere.

njtierney · July 8, 2022, 10:42pm

Added a note here: https://github.com/greta-dev/greta/issues/541

hrlai · July 27, 2022, 9:58pm

Thank you @nick for the summary. Reading it forced me to think about the modelling aspects again, particularly the priors.

I decided not to brute force by increasing Lmin, keeping it <= 30. After a few struggle I realised that the parameters have quite different scale (i.e., the intercepts are a magnitude larger than the main effects, with a few interaction effects that are either weak or weakly identified from the noisy data). So I changed the variance of these coefficients from being fixed to being hierarchy, i.e., using a Normal-gamma prior in this case for some added regularisation. At first pass the chains improved a lot! But they weren’t ideal. Then I realised that the gamma prior for variance are not informative enough to prevent the sampler from sampling relatively large positive variance. So I change from gamma(2, 2) to gamma(10, 10) and it finally did the job, without changing the posterior median too much.

Lesson learned: if tuning the sampler didn’t work well, then there is probably some inefficiency due to model specification. Probably the best place to start digging are priors… ?

njtierney · July 28, 2022, 3:13am

Thanks so much for that, @hrlai - sounds like you were able to explore some good options to get a good result!