Bayesian Logistic Regression in greta

I am trying to implement Bayesian Logistic Regression in greta, but I’m stuck at defining the likelihood. Here is the code so far,

library(greta)

# data
summary(iris)
x = within(iris, rm(Species))
y = iris$Species

# prior
w_mean = t(integer(5))
w_sigma = diag(5)
prior = multivariate_normal(w_mean, w_sigma)

# likelihood
distribution(y) = # how to define logistic regression likelihood
    
# model
m = model(prior)

# sampling
draws <- mcmc(m, n_samples = 1000)

Because iris has 4 predictors, I have taken prior as multivariate normal distribution with 5 means(one extra for bias) and covariance as 5x5 identity matrix.

Now I need to define likelihood, which is given as mentioned here. I am confused as to how I would implement this.

The main source of confusion is how the prior gets multiplied with posterior, and how the posterior itself is optimized(how gradient descent happens). Tried defining some functions like this,

sigmoid = function(wt, x){
result = 1 / (1 + exp(-1 * wt * x))
return(result)
}

likelihood = function(wt, dataset){
temp = 1
for(data in dataset){
    summary(data)
    xn = within(data, rm(Species))
    tn = data$Species
    yn = sigmoid(wt, xn)
    temp = temp * ((yn ^ tn) * (1 - yn)^(1 - tn))
}
return(temp)
}


distribution(y) = likelihood(w, iris) # what would be w

But I can’t figure out what w would be(it needs to be find out via gradient descent), and how prior plays it’s role here.

There’s an example of logistic regression on the website here: https://greta-stats.org/articles/example_models.html#logistic-regression

I think I can see where your confusion is coming from. greta (like Stan, JAGS, BUGS etc.) doesn’t require the user to compute the posterior, it just requires them to write out the generative model for the data. So it’s more like the statistical notation you would use to define a model, and not like the mathematical notation you would use to define the posterior.

In greta, you set the prior when creating the variable (e.g. alpha <- normal(0, 2)), and you set the likelihood when you state a distribution over observed variables (e.g. distribution(y) <- bernoulli(p)). Then greta does the work of combining them into a function that maps from the parameter values to the posterior density, which can be used for inference.

2 Likes

Thanks a lot! This clarifies most of my doubts.

However, this is the first time I have come across such notation. I am wondering if this is mathematically equivalent to the other notational form, i.e

posterior \propto likelihood \times prior

where we define prior and likelihood, and likelihood is maximized by the library? Is this just notational change, and the underlying computation is equivalent?

Right, I’d call that mathematical notation about Bayes theorem. greta does that bit for you, and lets you sample or maximise the posterior (or the likelihood if you don’t set priors and tell greta::opt() not to include Jacobian adjustments)

By statistical notation, I mean the equations commonly seen for defining a statistical model, which really describe the hypotthesised generative process for the data. E.g. the notation for that Bayesian logistic regression example would look something like:

y_i \sim Bernoulli(p_i) \\ logit(p_i) = \alpha + X_i \beta \\ \alpha \sim N(0, 10) \\ \beta_k \sim N(0, 10)

where X_i denotes row i of design matrix X. This implicitly defines both the priors and the likelihood. You can pull them out and combine them into the posterior manually, but greta will do that bit for you.

Here’s the code again:

alpha <- normal(0, 10)
beta <- normal(0, 10, dim = n_env)
linear_predictor <- alpha + env %*% beta
p <- ilogit(linear_predictor)
distribution(occupancy) <- bernoulli(p)

Note some differences between the greta code and the notation:

  • the lines are in the opposite order (it’s code, so we need to create objects before we use them)
  • the greta code vectorises across the elements of p and y, rather than iterating over i
  • we use the inverse logit link on the right hand side, rather than logit link on the left hand side (again, because it’s code)
1 Like

Am I right that when we say X_i \beta , greta is either taking transpose of one of them, or when we define normal(0, 1, dim=n) then it’s making a column vector?

Second, when I tried to extend this to multivariate normal distribution, I ended up with code like this,

n_env = 10 # some constant
alpha = normal(0, 1)
b_mean = integer(n_env)
b_sigma = diag(n_env)
beta = multivariate_normal(t(b_mean), b_sigma, n_realisations = 1)
linear_predictor = alpha + x %*% t(beta)

Not sure if this is even right though, or if it even makes sense.

In the examples here, we have

mu <- normal(0, 1)
Sig <- diag(4) 
theta <- multivariate_normal(t(rep(mu, 4)), Sig)

Guess this is for example purpose only. I don’t think there is a need to have mean as a distribution itself, when we would use the multivariate distribution, or is it necessary?

Right, by default greta arrays created with one dimension are made as column vectors, so X_i is a row vector and \beta is a column vector.

The multivariate normal code makes sense. That formulation would enable you specify prior covariances/correlations between regression coefficients (change b_sigma to be non-diagonal), but with a diagonal covariance it’s the same as having a vector of independent variables. Specifically, it’s equivalent to this prior:

beta <- normal(0, 1, dim = c(1, n_env))

(note I explicitly create a row vector here).


The other example is different because it’s a type of hierarchical model, where information is shared between the regression coefficients. E.g. if the data implies that the first regression coefficient should be very large, the model will make the rest of the regression coefficients larger (AKA partial pooling).

Like you say, that’s just an example to show what you can do. There aren’t many cases where that is a sensible model.

1 Like