Simulation-based calibration (SBC; Talts et al. (2020)) is a method for visually validating Bayesian inferences. SBC is useful for detection of either misspecified models, inaccurate computation or bugs in the implementation of a probabilistic program. So SBC is not a validation for the inference itself, but the technical aspect of the program.

SBC

Consider a generative model:

\[\begin{align} \tilde{\theta} & \sim P(\theta) \\ \tilde{y} & \sim P(y \mid \tilde{\theta}) \\ \{ \theta_1, \dots, \theta_L \} & \sim P(\theta \mid \tilde{y}) \end{align}\]

The rank of a prior sample \(\tilde{\theta}\) in comparison to an exact posterior sample \(\{ \theta_1, \dots, \theta_L \}\):

\[\begin{align} r(\{ \theta_1, \dots, \theta_L \}, \tilde{\theta}) = \sum_l^L \mathbb{I}(\theta_l < \tilde{\theta}) + 1 \end{align}\]

is a discrete-uniform random variable in \([1, L + 1 ]\) (see the original paper for a proof). Thus we can use this as a testing procedure if our inferences work.

We follow Algorithm 2 from the original paper and implement SBC in Stan.

suppressMessages({
  library(tidyverse)
  library(ggthemes)
  library(colorspace)

  library(rstan)
})

set.seed(23)
options(mc.cores = parallel::detectCores())

Implementation

The idea behind SBC is fairly simple. For \(N\) iterations we run a while loop to create a posterior sample that has at least a target effective sample size \(n_{eff}\) (which is set by us). We do so by resampling and thinning until the posterior sample for every iteration reaches the target \(n_{eff}\) (within the while loop). Then we compute the rank for every parameter using the sum as defined above. That is it. If the inference worked, the ranks are uniformly distributed.

sbc <- function(model, data) {
  ranks <- matrix(0, N, 2, dimnames = list(NULL, c("mu", "sigma")))
  for (n in seq(N))
  {
    thin <- init_thin
    while (thin < max_thin) {
      fit <- suppressWarnings(
        sampling(model,
          data = data,
          chains = 1, iter = 2 * thin * L,
          thin = thin, control = list(adapt_delta = 0.99), refresh = 0
        )
      )
      n_eff <- summary(fit)$summary["lp__", "n_eff"]
      if (n_eff >= target_neff) break
      thin <- 2 * thin
    }
    ranks[n, ] <- apply(rstan::extract(fit)$idsim, 2, sum) + 1
  }
  ranks
}

Use cases

We start with a simple example where we set two parameters, generate data from them and then compare the posterior to these parameters. The comparison is done in the generated quantities block.

model.file <- "_models/sbc_1.stan"
cat(readLines(model.file), sep = "\n")
data {
    int<lower=1> n;
    vector[n] x;
}

transformed data {
    real beta_sim = normal_rng(0, 1);
    real<lower=0> sigma_sim = lognormal_rng(0, 1);
    
    vector[n] y_sim;
    for (i in 1:n)
        y_sim[i] = normal_rng(x[i] * beta_sim, sigma_sim);
}

parameters {
    real beta;
    real<lower = 0> sigma;
}

model {
    beta ~ normal(0, 1);
    sigma ~ lognormal(0, 1);
    y_sim ~ normal(x * beta, sigma);
}

generated quantities {
    int idsim[2] = { beta < beta_sim, sigma < sigma_sim };
}

We run the loop \(5000\) times and sample \(100\) times. We also set some other parameters that Stan or SBC needs, such as init_thin which specifies the period for saving samples or target_neff which is the effective sample size we want to have.

N <- 5000
L <- 100
init_thin <- 1
max_thin <- 64
target_neff <- .8 * L

We also define a histogram plotting method for the ranks.

plot.fit <- function(fit) {
  as.data.frame(fit) %>%
    tidyr::gather("param", "value") %>%
    ggplot() +
    geom_histogram(aes(value), bins = 30) +
    facet_grid(. ~ param) +
    theme_tufte() +
    theme(
        axis.text = element_text(colour = "black", size = 15),
        strip.text = element_text(colour = "black", size = 15)
    ) +
    xlab(NULL) +
    ylab(NULL)
}

Then we run SBC and plot the ranks. We create some artifical data for the model first.

n <- 10
x <- rnorm(n)

fit <- sbc(model, list(x = x, n = n))
plot.fit(fit)

Given the low number of trials, the histogram looks fairly uniform (as expected since the model was specified correctly). Next we test some pathological cases to see if the ranks change with mis-specified models.

model.file <- "_models/sbc_2.stan"
cat(readLines(model.file), sep = "\n")
data {
    int<lower=1> n;
    vector[n] x;
}

transformed data {
    real beta_sim = normal_rng(0, 1);
    real<lower=0> sigma_sim = lognormal_rng(0, 1);
    
    vector[n] y_sim;
    for (i in 1:n)
        y_sim[i] = normal_rng(x[i] * beta_sim, sigma_sim);
}

parameters {
    real beta;
    real<lower = 0> sigma;
}

model {
    beta ~ normal(0, 10);
    sigma ~ lognormal(0, 5);
    y_sim ~ student_t(10, x * beta, sigma);
}

generated quantities {
    int idsim[2] = { beta < beta_sim, sigma < sigma_sim };
}

And another pathological example:

model.file <- "_models/sbc_3.stan"
cat(readLines(model.file), sep = "\n")
data {
    int<lower=1> n;
    vector[n] x;
}

transformed data {
    real beta_sim = normal_rng(0, 11);
    real<lower=0> sigma_sim = lognormal_rng(0, 10);
    
    vector[n] y_sim;
    for (i in 1:n)
        y_sim[i] = student_t_rng(10, beta_sim, sigma_sim);
}

parameters {
    real beta;
    real<lower = 0> sigma;
}

model {
    beta ~ normal(0, 1);
    sigma ~ lognormal(0, 1);
    y_sim ~ normal(x * beta, sigma);
}

generated quantities {
    int idsim[2] = { beta < beta_sim, sigma < sigma_sim };
}

Session info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstan_2.21.2         StanHeaders_2.21.0-5 colorspace_1.4-1    
 [4] ggthemes_4.2.4       forcats_0.5.0        stringr_1.4.0       
 [7] dplyr_1.0.1          purrr_0.3.4          readr_1.3.1         
[10] tidyr_1.1.1          tibble_3.0.3         ggplot2_3.3.2       
[13] tidyverse_1.3.0     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         lubridate_1.7.9    prettyunits_1.1.1  ps_1.3.4          
 [5] assertthat_0.2.1   digest_0.6.25      V8_3.2.0           R6_2.4.1          
 [9] cellranger_1.1.0   backports_1.1.8    reprex_0.3.0       stats4_4.0.2      
[13] evaluate_0.14      httr_1.4.2         pillar_1.4.6       rlang_0.4.7       
[17] curl_4.3           readxl_1.3.1       rstudioapi_0.11    callr_3.4.3       
[21] blob_1.2.1         rmarkdown_2.6      labeling_0.3       loo_2.3.1         
[25] munsell_0.5.0      broom_0.7.0        compiler_4.0.2     modelr_0.1.8      
[29] xfun_0.16          pkgconfig_2.0.3    pkgbuild_1.1.0     htmltools_0.5.0   
[33] tidyselect_1.1.0   gridExtra_2.3      codetools_0.2-16   matrixStats_0.56.0
[37] fansi_0.4.1        crayon_1.3.4       dbplyr_1.4.4       withr_2.2.0       
[41] grid_4.0.2         jsonlite_1.7.0     gtable_0.3.0       lifecycle_0.2.0   
[45] DBI_1.1.0          magrittr_1.5       scales_1.1.1       RcppParallel_5.0.2
[49] cli_2.0.2          stringi_1.4.6      farver_2.0.3       fs_1.5.0          
[53] xml2_1.3.2         ellipsis_0.3.1     generics_0.0.2     vctrs_0.3.2       
[57] tools_4.0.2        glue_1.4.1         hms_0.5.3          parallel_4.0.2    
[61] processx_3.4.3     yaml_2.2.1         inline_0.3.15      rvest_0.3.6       
[65] knitr_1.29         haven_2.3.1       

References

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” http://arxiv.org/abs/1804.06788.