Many real-world problems involve optimization of functions which are difficult, or costly, to evaluate. For instance, in deep learning, finding the optimal hyperparameters and architecture of a neural network is a cumbersome process that should ideally be automated with as little function evaluations (i.e. training the networks) as possible. For problems like these, Bayesian optimization (BO) offers a unifying framework where the function to evaluate is approximated using a surrogate model.

In this notebook, we introduce the basics of BO. More specifically, we are interested in optimizing some function \(f: \mathcal{X} \rightarrow \mathbb{R}\) on some set \(\mathcal{X} \subset \mathbb{R}^p\), i.e., to find

\[\begin{equation} \mathbf{x}_{\min} = \arg \max_{x \in \mathcal{X}} f(\mathbf{x}) \end{equation}\]

We assume that we are allowed to query \(f\) at \(\mathbf{x}_i\) which will yield us a noisy observation \(y_i\) for every query \(i\). We will approximate \(f\) using a probabilistic model, namely a Gaussian process. To optimize \(f\), we will fit a GP to the currently available data \(\mathcal{D} = \{\mathbf{x}_i, y_i\}_{i=1}^n\), then find the next point \(\mathbf{x}_{n + 1}\) with some appropriate acqusition function, query it to receive \(y_{n + 1}\) and then repeat the procedure for a certain number of iterations \(N\).

Throughout this notebook, we will use Stan to fit the surrogate model. Some introductory references on Bayesian optimization include Snoek, Larochelle, and Adams (2012), Shahriari et al. (2015) and Frazier (2018). Feedback and comments are welcome!

suppressMessages({
  library(ggplot2)
  library(cowplot)
  library(pracma)
  library(rdist)

  library(rstan)
})

set.seed(23)

For convenience, we define a method to plot the GP fit and data.

scatterplot <- function(x.true, y.true, f_star = NULL, f_star_var = NULL, data = NULL) {
  df <- data.frame(x = x.true, y = y.true)

  g <- ggplot() +
    theme(
      axis.line.x = element_line(color = "black", size = .25),
      axis.line.y = element_line(color = "black", size = .25),
      axis.text = element_text(size = 12),
      axis.title = element_text(size = 13),
      legend.text = element_text(size = 13)
    )

  if (!is.null(f_star) && !is.null(f_star_var)) {
    df_star <- data.frame(
      x = x.true,
      y = f_star,
      lower = f_star - sqrt(f_star_var),
      upper = f_star + sqrt(f_star_var)
    )
    g <- g +
      geom_ribbon(
        data = df_star,
        aes(x, ymin = lower, ymax = upper, fill = "#DCBCBC")
      ) +
      geom_line(data = df_star, aes(x, y, color = "darkred"))
  }

  if (!is.null(data)) {
    g <- g + geom_point(data = data, aes(x, y, color = "darkred"))
  }

  g +
    geom_line(data = df, aes(x, y, color = "darkgrey")) +
    scale_color_manual(
      breaks = c("darkgrey", "darkred", "black"),
      labels = c("Function to optimize", "Posterior", "Acquisition function"),
      values = c("darkgrey", "darkred", "black")
    ) +
    scale_fill_manual(
      breaks = c("lightgrey", "darkred", "#DCBCBC"),
      values = c("lightgrey", "darkred", "#DCBCBC")
    ) +
    labs(color = NULL) +
    guides(fill = FALSE) +
    theme_cowplot()
}

Bayesian optimization

Usually the function to maximize is difficult/infeasible to query and the number of functions evaluations and we want to minimize the number as much as possible. In this example, we demonstrate Bayesian optimization on the function below.

f <- function(x) cos(4 * x) + exp(-(x**2) / 2)

Without loss of generality, we constrain the optimization on a set of \(m\) points in the interval \(x \in [-5, 5]\), such that we don’t need to write to much boilerplate code, i.e., we just evaluate the points and take maxima or minima instead of optiming the function itself.

m <- 1000

x.init <- seq(-5, 5, length.out = m)
y.init <- f(x.init)

The function to optimize is shown below:

scatterplot(x.init, y.init)

To optimize this function, we define functions to fit a surrogate model and make a predict using the model fit. As mentioned above, we use Stan to implement the GP. The Stan code is fairly short. It takes the data, a set of points for which predictions should be made and kernel hyperparameters. The model and parameters blocks are actually empty, since here we confine ourselves to not fitting posterior of the kernel parameters.

surrogate.model.file <- "_models/bo-surrogate.stan"
cat(readLines(surrogate.model.file), sep = "\n")
data {
  int<lower=1> N;
  real x[N];
  vector[N] y;

  int<lower=1> N_star;
  real x_star[N_star];

  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}

parameters {}

model {}

generated quantities {
  vector[N_star] f_star;
  vector[N_star] f_star_cov;

  {
    matrix[N, N] K =  cov_exp_quad(x, alpha, rho)
       + diag_matrix(rep_vector(square(sigma), N));
    matrix[N, N] L_K = cholesky_decompose(K);

    vector[N] L_K_div_y = mdivide_left_tri_low(L_K, y);
    vector[N] K_div_y = mdivide_right_tri_low(L_K_div_y', L_K)';

    matrix[N, N_star] k_x_x_star = cov_exp_quad(x, x_star, alpha, rho);

    matrix[N, N_star] v_pred = mdivide_left_tri_low(L_K, k_x_x_star);
    matrix[N_star, N_star] cov_f2 = cov_exp_quad(x_star, alpha, rho) -
        v_pred' * v_pred
        + diag_matrix(rep_vector(1e-10, N_star));

    f_star = (k_x_x_star' * K_div_y);
    f_star_cov = diagonal(cov_f2);
  }
}

Stan requires us to compile the model once such that we can use it. We do that like this:

gp <- stan_model(file = surrogate.model.file)

The predict function takes the data, the points to prediction, and the compiled Stan model.

predict.gp <- function(gp, x.d, y.d, x.star) {
  dat <- list(
    N = length(x.d),
    x = array(x.d),
    y = array(y.d),
    N_star = length(x.star),
    x_star = array(x.star),
    rho = 1,
    alpha = 1,
    sigma = 1
  )

  pred <- rstan::sampling(
    gp, dat,
    chains = 1, algorithm = "Fixed_param", iter = 1, show_messages = FALSE, refresh = 0
  )

  ext <- rstan::extract(pred)
  f.star <- as.vector(ext$f_star)
  f.star.var <- as.vector(ext$f_star_cov)


  list(f.star, f.star.var)
}

Acquisition function

As acquisition function we use the upper confidence bound (see Snoek, Larochelle, and Adams (2012)) which is computed from the predictive posterior mean \(\mu(\mathbf{x})\) and variance \(\sigma^2(\mathbf{x})\) of the surrogate model:

\[\begin{equation} a(\mathbf{x}) = \mu(\mathbf{x}) + \kappa \sigma(\mathbf{x}) \end{equation}\]

where \(\kappa\) can be tuned to balance betwen exploitation and exploration.

acquisition.function <- function(gp, x.d, y.d, x.init, conf) {
  preds <- predict.gp(gp, x.d, y.d, x.init)
  f.star <- preds[[1]]
  f.star.var <- preds[[2]]
  ucb <- f.star + conf * sqrt(f.star.var)
  ucb
}

Finally, we define a function that proposes the next point to evaluate:

acquire <- function(x.d, y.d, x.init, conf = 2.0) {
  preds <- predict.gp(gp, x.d, y.d, x.init)
  f.star <- preds[[1]]
  f.star.var <- preds[[2]]
  ucb <- acquisition.function(gp, x.d, y.d, x.init, conf)
  x.next <- x.init[which.max(ucb)]

  list(
    x.next = x.next,
    ucb = ucb,
    f.star = f.star,
    f.star.var = f.star.var
  )
}

Use case

We start with a random point on the interval defined above and query it against the function that we want to optimize.

set.seed(23)

x.d <- runif(1, -5, 5)
y.d <- f(x.d)

Then we train the surrogate model, and use the aquisition function to propose a new point.

iter <- acquire(x.d, y.d, x.init)
x.n <- iter$x.next
ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.var

For the acquisition of the first point, we first fit the data (consisting of one observation) to the GP. The posterior mean is shown in red, and the posterior variance in lightred. We then compute the acquisition function on a set of \(m\) points in the interval \(x \in [-5, 5]\) and take its maximizer. The entire acquisition function is shown as black line. Its maximizer is the black point.

Note how the acquisition has a dent around the data \((x_1, y_1)\), since the variance (but also the mean in this case) at this point is reducing the function value.

scatterplot(
  x.init, y.init, f.star, f.star.var,
  data = data.frame(x = x.d, y = y.d)
) +
  geom_line(
    data = data.frame(x = x.init, y = ucb),
    aes(x, y, color = "black")
  ) +
  geom_point(
    data = data.frame(x = x.n, y = max(ucb)),
    aes(x, y)
  )

We then evaluate the point and add it to the data.

y.n <- f(x.n)
x.d <- c(x.d, x.n)
y.d <- c(y.d, y.n)

We can repeat this process of querying points and then acquiring new points as often as we want, until we think we have a good estimate of the maximum of \(f\), since we don’t have its functional form.

iter <- acquire(x.d, y.d, x.init)

x.n <- iter$x.next
ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.v

The data now consists of two observations, hence the acqusition functions has to dents here.

scatterplot(
  x.init, y.init, f.star, f.star.var,
  data = data.frame(x = x.d, y = y.d)
) +
  geom_line(
    data = data.frame(x = x.init, y = ucb),
    aes(x, y, color = "black")
  ) +
  geom_point(
    data = data.frame(x = x.n, y = max(ucb)),
    aes(x, y)
  )

Let’s repeat this procedure for a couple of times, i.e., propose a point, evaluate it, and repeat.

for (i in seq(20)) {
  y.n <- f(x.n)
  x.d <- c(x.d, x.n)
  y.d <- c(y.d, y.n)
  iter <- acquire(x.d, y.d, x.init)
  x.n <- iter$x.next
}

For the last iteration, we extract the acquisition function, and posterior mean and variance, and plot all iterations.

ucb <- iter$ucb
f.star <- iter$f.star
f.star.var <- iter$f.star.v

Then we plot the results of the last iteration.

scatterplot(
  x.init, y.init, f.star, f.star.var,
  data = data.frame(x = x.d, y = y.d)
) +
  geom_line(
    data = data.frame(x = x.init, y = ucb),
    aes(x, y, color = "black")
  ) +
  geom_point(
    data = data.frame(x = x.n, y = max(ucb)),
    aes(x, y)
  )

This worked nicely. While some of the points are spread around the entire domain of \(\mathcal{X}\), an increasing number centers around the function’s maximum.

Session info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.21.2         StanHeaders_2.21.0-5 rdist_0.0.5         
[4] pracma_2.3.3         cowplot_1.0.0        ggplot2_3.3.2       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         pillar_1.4.6       compiler_4.0.2     prettyunits_1.1.1 
 [5] tools_4.0.2        pkgbuild_1.1.0     digest_0.6.25      jsonlite_1.7.0    
 [9] evaluate_0.14      lifecycle_0.2.0    tibble_3.0.3       gtable_0.3.0      
[13] pkgconfig_2.0.3    rlang_0.4.7        cli_2.0.2          parallel_4.0.2    
[17] curl_4.3           yaml_2.2.1         xfun_0.16          loo_2.3.1         
[21] gridExtra_2.3      withr_2.2.0        dplyr_1.0.1        stringr_1.4.0     
[25] knitr_1.29         generics_0.0.2     vctrs_0.3.2        stats4_4.0.2      
[29] grid_4.0.2         tidyselect_1.1.0   inline_0.3.15      glue_1.4.1        
[33] R6_2.4.1           processx_3.4.3     fansi_0.4.1        rmarkdown_2.6     
[37] farver_2.0.3       callr_3.4.3        purrr_0.3.4        magrittr_1.5      
[41] codetools_0.2-16   matrixStats_0.56.0 ps_1.3.4           scales_1.1.1      
[45] ellipsis_0.3.1     htmltools_0.5.0    assertthat_0.2.1   colorspace_1.4-1  
[49] labeling_0.3       V8_3.2.0           stringi_1.4.6      RcppParallel_5.0.2
[53] munsell_0.5.0      crayon_1.3.4      

References

Frazier, Peter I. 2018. “A Tutorial on Bayesian Optimization.” http://arxiv.org/abs/1807.02811.

Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2015. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proceedings of the IEEE 104 (1): 148–75. https://doi.org/10.1109/JPROC.2015.2494218.

Snoek, Jasper, Hugo Larochelle, and Ryan P Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. https://papers.nips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html.