Should I Have Used Mixture To Marginalize an Unobserved Discrete RV?

Hi All:

I recently mimicked a Richard McElreath blog entry regarding marginalizing an unobserved discrete random variable. His work is in Stan, so I reproduced the workflow in Greta and found it much easier/simpler/elegant.

My biggest fear is that it looks so easy that maybe I am missing something. Going in to the exercise, I thought I would use a mixture() distribution with weights determined by the data. If you take a look, you will see that is not what I had to do. It was really quite astounding how it just worked. Anyway, check out my write-up at:

Blog Entry Using Greta

and feel free to compare to McElreath’s orginal: http://elevanth.org/blog/2018/01/29/algebra-and-missingness/

I appreciate any feedback.

The minimum code to reproduce the work extracted from the blog entry is pasted below for convenience.


This code generates some sample data:

library(dplyr)
library(greta)
# Actual Values That Models Will Try To Recover
actual_p_u = 0.5; actual_p_s = 1; actual_sigma = 0.75
set.seed(1)
N_children <- 51
s <- rbinom( N_children , size=1 , prob=actual_sigma )
s_obs <- s   # copy to get censored obs
s_obs[ sample( 1:N_children , size=21 ) ] <- NA  ##censored obs NA
tea <- rbinom( N_children , size=1 , prob= s*actual_p_s + (1-s)*actual_p_u )
dataDF = tibble(s_obs,tea)
nonCensoredDataDF = dataDF %>% filter(!is.na(s_obs))
censoredDataDF = dataDF %>% filter(is.na(s_obs))

And this code, runs the greta model:

library(greta)
s_i = as_data(nonCensoredDataDF$s_obs) #data
tea_i = as_data(nonCensoredDataDF$tea) #data
tea_j = as_data(censoredDataDF$tea) #data

p_s = beta(2,2) #prior
p_u = beta(2,2) #prior
sigma = beta(2,2) # hyperprior

pi_i = s_i * p_s + (1 - s_i) * p_u #operation for i
pi_j = sigma * p_s + (1 - sigma) * p_u # operation for j

distribution(s_i) = bernoulli(sigma) #likelihood i
distribution(tea_i) = bernoulli(pi_i) #likelihood i
distribution(tea_j) = bernoulli(pi_j) #likelihood j

m = model(p_s,p_u,sigma) #model

And here is the model block of the analogous Stan code:

model{
  // priors
  p_cheat ~ beta(2,2);
  p_drink ~ beta(2,2);
  sigma ~ beta(2,2);

  // probability of tea
  for ( i in 1:N_children ) {
    if ( s[i] == -1 ) {
      // ox unobserved
      target += log_mix( 
                  sigma , 
                  bernoulli_lpmf( tea[i] | p_drink ) , 
                  bernoulli_lpmf( tea[i] | p_cheat ) );
    } else {
      // ox observed
      tea[i] ~ bernoulli( s[i]*p_drink + (1-s[i])*p_cheat );
      s[i] ~ bernoulli( sigma );
    }
  }//i
}

Firstly, congrats on being the first to post on the forum! :tada:

Yes, your model is correct. Because this model deals with point probabilities and Bernoulli distributions, you can construct the mixture directly from the probabilities, as you have. To check this, we can look at the contributions to the model likelihood for the data points where tea drinking was observed, but stabling wasn’t; for tea_j.

The Bernoulli distribution with probability p has density p if y = 1, and density 1 - p if y = 0. In a mixture model, we compute a weighted sum of different densities, so for a two-part Bernoulli mixture with alternative parameters p_1 and p_2, and corresponding weights w_1 and w_2, when y = 1 we have density p_1 w_1 + p_2 w_2, and when y = 0 we have density (1 - p_1) w_1 + (1 - p_2) w_2.

In this case, we have that:
p_1 = p_s (p_s)
w_1 = \sigma (sigma)
p_2 = p_u (p_u)
w_2 = 1 - \sigma (1 - sigma).

So when y = 1 the mixture model would give us p_s \sigma + p_u (1 - \sigma)
and when y = 0 it would give us: (1 - p_s) \sigma + (1 - p_u) (1 - \sigma)

In your model, you define p_{ij} = p_s \sigma + p_u (1 - \sigma), and then plug this in as the parameter of a single Bernoulli distribution. So when y = 1 the density is p_{ij}, which is equal to p_s \sigma + p_u (1 - \sigma), just as in the mixture model. When y = 1, the density is 1 - p_{ij} which is 1 - (p_s \sigma + p_u (1 - \sigma)). Rearranging the y = 0 case from the mixture model above we can see that this gives the same density:

(1 - p_s) \sigma + (1 - p_u) (1 - \sigma) \\ = \sigma - p_s \sigma + (1 - \sigma) - p_u (1 - \sigma) \\ = 1 - \sigma + \sigma - p_s \sigma - p_u (1 - \sigma) \\ = 1 - p_s \sigma - p_u (1 - \sigma) \\ = 1 - (p_s \sigma + p_u (1 - \sigma)) \\

So your approach is the same as the mixture model approach.

For completeness, here is the code for the mixture model version. The data and priors are the same, all we’ve done is remove the definition of p_i_j and change the distribution for tea_j to be a mixture. As we’d expect, this gives the same results.

pi_i = s_i * p_s + (1 - s_i) * p_u

distribution(s_i) = bernoulli(sigma)
distribution(tea_i) = bernoulli(pi_i)
distribution(tea_j) = mixture(bernoulli(p_s),
                              bernoulli(p_u),
                              weights = c(sigma, 1 - sigma))

Thanks for the confirmation. Greta is really slick.

2 Likes