Gaussian process regression introduced how we can use Gaussian processes for regression modelling if we assume normally distributed data. In that scenario, derivation of the posterior GP was especially easy, because it has a closed form solution. In this notebook we will extend the GP framework to binary classification. Most of the material is based on Rasmussen and Williams (2006), but I also recommend Betancourt (2020) as a resource. Throughout this notebook, we will use Stan to fit GPs. Feedback and comments are welcome!

suppressMessages({
  library(tidyverse)
  library(ggthemes)
  library(colorspace)

  library(rstan)
  library(bayesplot)
})

set.seed(23)
color_scheme_set("darkgray")
options(mc.cores = 4L)

Latent GPs

To make GPs fit into a classification framework we merely need to adopt the observation model. Hence, the GP prior itself will stay the same, i.e., the GP prior os fully specified by a mean function \(m\) and a covariance function \(k\):

\[\begin{align*} f(\mathbf{x}) & \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) \end{align*}\]

For classification we will use data set \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\) with \(y_i \in \{0, 1\}\).

To put the latent GP into the domain of \(Y\) we, exactly as for GLMs, squash it through an inverse link function. Most commonly we use a logit link function, for which the inverse is the logistic function:

\[\begin{equation} \pi(x) = \frac{1}{1+\exp(-x)} \end{equation}\]

Data

Let’s generate some data and try to fit it using Stan. First we define some parameters:

n.init <- 1000
alpha  <- 5
rho    <- .1
x.init <- seq(-1, 1, length.out=n.init)

We’ll use a binomial observation model and aforementioned logit link:

\[\begin{align*} p(\mathbf{y} \mid f, \mathbf{x}) = \prod_i^n \mathcal{Bernoulli}\left(\text{logit}^{-1} \left(f \right) \right), \end{align*}\]

Then we load a model file that generates data for us.

obs.model <- "_models/gp_generate_binary.stan"
cat(readLines(obs.model), sep = "\n")
data {
  int<lower=1> n;
  real x[n];

  real<lower=0> alpha;
  real<lower=0> rho;
}

transformed data {
  matrix[n, n] K = cov_exp_quad(x, alpha, rho)
      + diag_matrix(rep_vector(1e-10, n));
  matrix[n, n] L_K = cholesky_decompose(K);
}

parameters {}
model {}

generated quantities {
  vector[n] f = multi_normal_cholesky_rng(rep_vector(0, n), L_K);
  vector[n] y;
  for (i in 1:n) {
    y[i] = bernoulli_logit_rng(f[i]);
  }
}
yf.init <- stan(
  obs.model,
  data=list(n=n.init, x=x.init, alpha=alpha, rho=rho),
  iter=1,
  chains=1,
  algorithm="Fixed_param"
)

We get the data from Stan, by extracting them by their name. In addition to the latent function and the observations, we also compute the latent means, i.e., a sample from the GP squashed through our inverse link function. Conveniently R already provides a couple of famlies and their respective link functions:

binomial.family <- binomial()

f.init <- extract(yf.init)$f[1,]
y.init <- extract(yf.init)$y[1,]
mu.init <- binomial.family$linkinv(f.init)

linear.predictor.df <- data.frame(x=x.init, mu=mu.init)

Having the data as well as the latent GP sampled, we put everything in a data frame and visualize the data as well as the true latent GP function.

set.seed(23L)

n <- 100L
idxs <- sort(sample(seq(f.init), n, replace=FALSE))

x <- x.init[idxs]
y <- y.init[idxs]

D <- data.frame(x=x, y=y)
ggplot() +
  geom_point(data=D, aes(x, y), size=1) +
  geom_line(data=linear.predictor.df, aes(x, mu)) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    strip.text = element_text(colour = "black", size = 15)
  ) +
  xlab(NULL) +
  ylab(NULL)

Posterior GP

Since the prior is not conjugate to the likelihood, the posterior does not have an analytical form and thus needs to be either approximated deterministically, e.g using a Laplace approximation, expectation propagation or variationally, or stochastically using sampling.

Conveniently, using Stan, we can do full Bayesian inference of the GP as well as the kernel hyperparameters. The Stan code to do all of this for us is shown below:

posterior.model <- "_models/gp_posterior_binomial.stan"
cat(readLines(posterior.model), sep = "\n")
functions {
 vector reparam(vector f_tilde, real[] x, real alpha, real rho)  {
    int n = num_elements(x);
    vector[n] f;
    matrix[n, n] L_K;
    matrix[n, n] K = cov_exp_quad(x, alpha, rho)
        + diag_matrix(rep_vector(1e-10, n));
    L_K = cholesky_decompose(K);
    f = L_K * f_tilde;
    return f;
 }
}

data {
  int<lower=1> n;
  real x[n];
  int y[n];

  int<lower=1> n_star;
  real x_star[n_star];
}


parameters {
  vector[n] f_tilde;
  real<lower=0> alpha;
  real<lower=0> rho;
}

transformed parameters {
  vector[n] f = reparam(f_tilde, x, alpha, rho);
}

model {
  f_tilde ~ std_normal();
  rho ~ inv_gamma(25, 5);
  alpha ~ normal(0, 3);

  y ~ bernoulli_logit(f);
}

Let’s run it and see if our inferences are accurate.

posterior <- stan(
  posterior.model,
  data=list(n=n, x=x, y=y, n_star=n.init, x_star=x.init),
  iter=2000,
  chains=4, 
  seed = 23
)
print(posterior, pars=c("alpha", "rho", "lp__"))
Inference for Stan model: gp_posterior_binomial.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
alpha   4.65    0.04 1.41   2.31   3.63   4.49   5.53   7.74  1401 1.00
rho     0.15    0.00 0.02   0.12   0.14   0.15   0.16   0.19   847 1.01
lp__  -73.82    0.20 7.51 -89.07 -78.70 -73.67 -68.53 -59.76  1483 1.00

Samples were drawn using NUTS(diag_e) at Wed Mar 24 23:01:18 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The inferences look good: high-effective sample sizes as well as \(\hat{R}\)s of one. Let’s plot the traces as well as the histograms of the posteriors of the kernel hyperparameters.

bayesplot::mcmc_trace(posterior, pars=c("rho", "alpha")) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    strip.text = element_text(colour = "black", size = 15)
  ) +
  scale_color_discrete_sequential(l1 = 1, l2 = 60) +
  xlab(NULL) +
  ylab(NULL) +
  guides(color=FALSE)

bayesplot::mcmc_hist_by_chain(posterior, pars=c("rho", "alpha")) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    strip.text = element_text(colour = "black", size = 15)
  ) +
  xlab(NULL) +
  ylab(NULL)

Let’s also have a look at the mean of the posterior of the latent GP.

samples <- extract(posterior, "f")$f

mu.posterior <-  binomial.family$linkinv(samples)
mu.posterior.mean <- apply(mu.posterior, 2, mean)
mu.posterior.quantiles <- apply(mu.posterior, 2, quantile, prob=c(0.05, 0.95))

mu.posterior <- data.frame(
    x=x,
    mu=mu.posterior.mean,
    mu.quantiles = t(mu.posterior.quantiles)
  ) %>%
  set_names(c("x", "mu", "lower", "upper"))

ggplot() +
  geom_ribbon(data=mu.posterior, aes(x, ymin=lower, ymax=upper), fill="#A1A6C8") +
  geom_point(data=D, aes(x, y), size=1) +
  geom_line(data=mu.posterior, aes(x, mu, color="darkblue")) +
  geom_line(data=linear.predictor.df, aes(x, mu, colour="black")) +
  scale_colour_manual(
    values=c("black", "darkblue"), 
    breaks=c("black", "darkblue"), 
    labels=c("True latent GP", "Posterior mean")) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    legend.text = element_text(colour = "black", size = 15),
    legend.title = element_blank()
  ) +
  xlab(NULL) +
  ylab(NULL)

Predictive posterior

In the end let’s also make some predictions. Computing the predictive posterior of a GP classification model can be a bit awkward in Stan. The manual suggests to model the predictive posterior distribution as

\[ p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{x}^* , \mathbf{x}) = \frac{ p(\mathbf{y}^* , \mathbf{y} \mid \mathbf{x}^*, \mathbf{x})}{p(\mathbf{y} \mid \mathbf{x})} \propto p(\mathbf{y}^* , \mathbf{y} \mid \mathbf{x}^*, \mathbf{x}) \] i.e., jointly with the observed data. That is a bit inconvenient in some cases, because sometimes we only want to compute the posterior once, and then use it for multiple predictions. Hence, here, we first compute the posterior over the hyperparameters and the latent GP and then use this for prediction as:

\[ p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{x}^* , \mathbf{x}) = \int p(\mathbf{y}^* \mid f^*) p(f^* \mid f, \mathbf{x}^*, \mathbf{x}) p(f \mid \mathbf{y}, \mathbf{x}) \] So we first compute \(p(f \mid \mathbf{y}, \mathbf{x})\) (which we did above) and now compute \(p(f^* \mid f, \mathbf{x}^*, \mathbf{x})\). The model code is below.

predictive.posterior.model <- 
  "_models/gp_predictive_posterior_binomial.stan"
cat(readLines(predictive.posterior.model), sep = "\n")
functions {
  vector f_star_rng(vector f, real[] x, real[] x_star, real alpha, real rho) {
    int n = size(x);
    int n_star = size(x_star);
    matrix[n, n] K = cov_exp_quad(x, alpha, rho)
        + diag_matrix(rep_vector(1e-10, n));
    matrix[n, n] L_K = cholesky_decompose(K);

    matrix[n_star, n] K_star = cov_exp_quad(x_star, x, alpha, rho);
    matrix[n_star, n_star] K_star_star = cov_exp_quad(x_star, alpha, rho);
    matrix[n, n_star] A_star = mdivide_left_tri_low(L_K, K_star');

    vector[n_star] f_star_mean = K_star * inverse(K) * f;
    matrix[n_star, n_star] f_star_cov =  K_star_star
        - A_star' * A_star
        + diag_matrix(rep_vector(1e-10, n_star));

    vector[n_star] f_star = multi_normal_cholesky_rng(
        f_star_mean,
        f_star_cov
    );
    return f_star;
  }
}

data {
  int<lower=1> n;
  real x[n];

  int<lower=1> n_star;
  real x_star[n_star];

  vector[n] f;
  real<lower=0> alpha;
  real<lower=0> rho;
}

generated quantities {
  vector[n_star] f_star;
  f_star = f_star_rng(f, x, x_star, alpha, rho);
}

For the computation of \(p(f^* \mid f, \mathbf{x}^*, \mathbf{x})\) we need to compute the covariances \(K(\mathbf{x}^* , \mathbf{x})\) and \(K(\mathbf{x}^* , \mathbf{x}^* )\) for which we need kernel hyperparameters. We can either do that by iterating over every posterior sample \(f, \alpha, \rho \sim p(f, \alpha, \rho \mid D)\) and then sample from the posterior predictive \(p(f^* \mid f, \mathbf{x}^*, \mathbf{x})\), or we just compute the posterior means and plug them in, which is what we do here. From the posterior means, we compute the mean \(\mathbb{E}[f^* \mid f, \mathbf{x}^*, \mathbf{x}]\) and variance \({Var}[f^* \mid f, \mathbf{x}^*, \mathbf{x})]\) and then sample from \(p(f^* \mid f, \mathbf{x}^*, \mathbf{x})\).

samples <- extract(posterior)

alpha.posterior.mean <- mean(samples$alpha)
rho.posterior.mean   <- mean(samples$rho)
f.posterior.mean     <- apply(samples$f, 2, mean)

The model doesn’t contain any parameters anymore, hence we use one chain and “Fixed_param”.

predictive.posterior <- stan(
  predictive.posterior.model,
  data=list(
    n=n, x=x, y=y, n_star=n.init, x_star=x.init,
    f=f.posterior.mean, alpha=alpha.posterior.mean, rho=rho.posterior.mean),
  iter=1000,
  warmup=0,
  chains=1,
  algorithm="Fixed_param"
)

We then compute the mean of the predictive posterior, plot it (red) and compare it to the true latent function (black).

samples <- extract(predictive.posterior)$f_star

mu.p.posterior <-  binomial.family$linkinv(samples)
mu.p.posterior.mean <-  binomial.family$linkinv(
  apply(samples, 2, mean)
)
mu.p.posterior.quantiles <- binomial.family$linkinv(
  apply(samples, 2, quantile, prob=c(0.025, 0.975))
)

mu.p.posterior <- data.frame(
  x=x.init,
  mu=mu.p.posterior.mean,
  mu.quantiles = t(mu.p.posterior.quantiles)
) %>%
  set_names(c("x", "mu", "lower", "upper"))

ggplot() +
  geom_point(data=D, aes(x, y), size=1) +
  geom_line(data=mu.p.posterior, aes(x, mu, color="darkred")) +
  geom_line(data=mu.posterior, aes(x, mu, color="darkblue")) +
  geom_line(data=linear.predictor.df, aes(x, mu, colour="black")) +
  scale_colour_manual(
    values=c("black", "darkblue" , "darkred"), 
    breaks=c("black", "darkblue", "darkred"), 
    labels=c("True latent GP", "Posterior mean", "Posterior predictive\nmean")) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    legend.text = element_text(colour = "black", size = 15),
    legend.title = element_blank()
  ) +
  xlab(NULL) +
  ylab(NULL)

Session info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bayesplot_1.7.2      rstan_2.21.2         StanHeaders_2.21.0-5
 [4] colorspace_1.4-1     ggthemes_4.2.4       forcats_0.5.0       
 [7] stringr_1.4.0        dplyr_1.0.1          purrr_0.3.4         
[10] readr_1.3.1          tidyr_1.1.1          tibble_3.0.3        
[13] ggplot2_3.3.2        tidyverse_1.3.0     

loaded via a namespace (and not attached):
 [1] httr_1.4.2         jsonlite_1.7.0     modelr_0.1.8       RcppParallel_5.0.2
 [5] assertthat_0.2.1   stats4_4.0.2       blob_1.2.1         cellranger_1.1.0  
 [9] yaml_2.2.1         pillar_1.4.6       backports_1.1.8    glue_1.4.1        
[13] digest_0.6.25      rvest_0.3.6        htmltools_0.5.0    plyr_1.8.6        
[17] pkgconfig_2.0.3    broom_0.7.0        haven_2.3.1        scales_1.1.1      
[21] processx_3.4.3     generics_0.0.2     farver_2.0.3       ellipsis_0.3.1    
[25] withr_2.2.0        cli_2.0.2          magrittr_1.5       crayon_1.3.4      
[29] readxl_1.3.1       evaluate_0.14      ps_1.3.4           fs_1.5.0          
[33] fansi_0.4.1        xml2_1.3.2         pkgbuild_1.1.0     tools_4.0.2       
[37] loo_2.3.1          prettyunits_1.1.1  hms_0.5.3          lifecycle_0.2.0   
[41] matrixStats_0.56.0 V8_3.2.0           munsell_0.5.0      reprex_0.3.0      
[45] callr_3.4.3        compiler_4.0.2     rlang_0.4.7        grid_4.0.2        
[49] ggridges_0.5.2     rstudioapi_0.11    labeling_0.3       rmarkdown_2.6     
[53] gtable_0.3.0       codetools_0.2-16   inline_0.3.15      DBI_1.1.0         
[57] curl_4.3           reshape2_1.4.4     R6_2.4.1           gridExtra_2.3     
[61] lubridate_1.7.9    knitr_1.29         stringi_1.4.6      parallel_4.0.2    
[65] Rcpp_1.0.5         vctrs_0.3.2        dbplyr_1.4.4       tidyselect_1.1.0  
[69] xfun_0.16         

References

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. The MIT Press. http://www.gaussianprocess.org/gpml.