While reading some papers on copulas, I stumbled over the paper by Wilson and Ghahramani (2010), where they introduced copula processes. The paper introduces how copulas can be used to encode correlation between marginally non-Gaussian random variables using a Gaussian process. Conceptually, this seemed to me to be the same as a pushforward from a GP measure to another one. After reading the paper, I was a bit puzzled what the advantage of a copula process would be in comparison to modelling the random variates conditional on a latent Gaussian process, which ultimately lead to the creation of this notebook.

Hence in this notebook, we try to reproduce the Gaussian Copula Process Volatility model Wilson and Ghahramani (2010) introduced in their paper and compare it to a model that uses a latent GP for parameterization directly. Since working with latent correlated variables can easily frustrate our analysis, we use tbe Hilbert-space approximation to the GP which was recently introduced in Solin and Särkkä (2020) and Riutort-Mayol et al. (2020) .

We begin by shortly reviewing copulas and then implement the two models in the probabilistic programming language Stan. Feedback and comments are welcome!

suppressMessages({
  library(tidyverse)
  library(ggthemes)
  library(colorspace)
  
  library(cmdstanr)
  library(posterior)
  library(bayesplot)
})

set.seed(42)
color_scheme_set("darkgray")

Copula processes

Following the introduction by Wilson and Ghahramani (2010), consider a multivariate random variable \(Y = (Y_1, \dots, Y_Q)\) with cumulative distribution function (cdf) \(F(y_1, \dots y_Q)\) and marginal cdfs \(F_1, \dots, F_Q\). According to Sklar’s theorem there exists an Q-copula \(C: [0, 1]^Q \rightarrow [0, 1]\) such that

\[\begin{equation} F(y_1, \dots y_Q) = C(F_1(y_1), \dots, F_Q(y_Q)) \end{equation}\]

Hence, with continuous marginals \(C\) is given via \(Q\) probability integral transforms \(u_i = F_i(y_i)\), i.e.,

\[\begin{equation} C(u_1, \dots, u_Q) = P(U_1 \le u_1, \dots U_Q \le u_Q) \end{equation}\]

where \(U_i \sim \text{Uniform}(0, 1) \, \forall i = 1 \dots Q\). For instance, we can construct a Gaussian copula from a set of Gaussian random variables \(Y\) via

\[\begin{align} C(u_1, \dots, u_Q) &= \phi(\Phi^{-1} \circ u_1, \dots, \Phi^{-1} \circ u_Q) \\ &= \phi(\Phi^{-1} \circ F_1(y_1), \dots, \Phi^{-1} \circ F_Q(y_Q)) \end{align}\]

where \(\phi\) is the cdf of a multivariate Gaussian (with appropriate mean and covariance) and \(\Phi^{-1}\) is the quantile function of a (univariate) standard normal.

The generalization from a Gaussian copula to a Gaussian copula process is then straightforward by assuming a Gaussian process measure instead of a Gaussian measure. Specifically, if there exists a mapping \(\Psi\), such that

\[\begin{align} \Psi(Y) \sim GP(\cdot, K) \end{align}\]

is a Gaussian process with some covariance function \(K\), then we call

\[\begin{align} Y \sim GCP(\Psi, \cdot, K) \end{align}\]

a Gaussian copula process (for instance, \(\Psi = \Phi^{-1} \circ F\) for some cdf \(F\)). For more details on copulas and copula processes please be referred to Wilson and Ghahramani (2010).

Stochastic volatility

Consider the data generating process \(y_t \sim \mathcal{N}(0, \sigma^2_t)\) for \(t = 1, \dots, 3\) for which we want to accurately estimate the standard deviations \(\sigma_t\) and their correlation structure. We first simulate the data and visualize it.

N <- 500
times  <- seq(1, 3, length.out=N)
sigmas <- sin(times) * cos(times**2) + 1
y      <- rnorm(n=length(sigmas), 0, sigmas)
data.frame(y=y, t=times, sigma=sigmas) %>% 
  tidyr::pivot_longer(cols = c(y, sigma)) %>%
  dplyr::mutate(name = factor(name, levels = c("y", "sigma"))) %>%
  ggplot() +
  geom_line(aes(t, value), color="black") +
  facet_grid(. ~ name) +
  theme_tufte() +
  theme(
    axis.text = element_text(colour = "black", size = 15),
    strip.text = element_text(colour = "black", size = 15)
  ) +
  xlab(NULL) +
  ylab(NULL)

Fitting a copula process

Following Wilson and Ghahramani (2010), we model the standard deviations as Gaussian copula process. We use the following generative model for the data set:

\[\begin{align} f & \sim GP(0, K)\\ \sigma & \sim g(f, \omega) \\ y_t & \sim \mathcal{N}(0, \sigma_t) \end{align}\]

where \(g\) is a warping function with parameters \(\omega = (\beta, \gamma)\):

\[\begin{align} g(f, \omega) = \sum_i^M \beta_i^ 2 \log \left(\exp(f + \gamma_i) + 1.0 \right) \end{align}\]

and \(K = k(t, t')\) is a covariance function defined on the time points. Note that we changed the definition of the warping function a bit from the definition in the paper to avoid potential weak identifiability issues when we don’t have much data. The standard deviations above are indeed GCP distributed, since (following the paper):

\[\begin{align} P(\sigma_1 \le a_1, \dots, \sigma_Q \le a_Q) &= P\left(g^{-1}(\sigma_1) \le g^{-1}(a_1), \dots, g^{-1}(\sigma_Q) \le g^{-1}(a_Q)\right) \\ &= \phi \left( g^{-1}(a_1), \dots, g^{-1}(a_Q) \right) \\ &= \phi \left( \Phi^{-1} \circ F(a_1), \dots, \Phi^{-1} \circ F(a_Q) \right) \\ &= C(u_1, \dots, u_Q) \end{align}\]

which we recogize as the copula defined above.

Low-rank approximation

Fitting this model in Stan can be frustratingly slow and inefficient due to the difficulty in sampling a correlated latent variable and the high memory-footprint. For that reason we approximate the latent variable \(f\) using the low-rank representation recently introduced by Solin and Särkkä (2020) and Riutort-Mayol et al. (2020). For the model above, we will use an exponentiated quadratic covariance function

\[\begin{align} k(x, x') = \alpha^2 \exp \left(-0.5 \frac{(x - x')^2}{\rho^2} \right) \end{align}\]

Solin and Särkkä (2020) observed that this kernel (or rather any stationary kernel I believe) can be approximated as

\[\begin{align} k(x, x') \approx \sum_j^J S\left(\sqrt{\lambda_j} \right) \xi_j(x) \xi_j(x') \end{align}\]

where \(S(\cdot)\) is the spectral density of the covariance function, \(\lambda_j\) is the \(j\)th eigenvalue and \(\xi_j\) is the eigenfunction of the Laplace operator. For the exponentiated quadratic covariance the spectral density is

\[ S(x) = \alpha^2 \sqrt{2 \pi} \rho \exp \left(-0.5 (\rho x)^2 \right) \] the \(j\)th eigenvalue

\[ \lambda_j = \left(\frac{j\pi}{2L}\right)^2 \] and the \(j\)th eigenfunction

\[ \xi_j(x) = \frac{1}{\sqrt{L}} \sin\left(\sqrt{\lambda_j}(x + L)\right) \] Riutort-Mayol et al. (2020) use this result to approximate the latent variable \(f\) as

\[\begin{align} f &\approx \sum_j^J \sqrt{ S\left(\sqrt{\lambda_j} \right) } \xi_j(x) \beta_j \\ & = \Xi(x) \left( \sqrt{ S\left(\sqrt{\lambda} \right) } \odot \beta \right) \end{align}\]

where \(\beta\) is a vector of standard normals, \(\Xi\) is the matrix of eigenfunctions \(\xi_j\) and \(\sqrt{ S(\sqrt{\lambda}) }\) is a vector of all \(J\) spectral densities.

Fitting the model

We implement the model using Stan and fit the posterior using its dynamic Hamiltonian Monte Carlo sampler. The model code of the copula process using the Hilbert space approximation by Solin and Särkkä (2020) is shown below.

cp.stan.file <- "./_models/cp_volatility_model.stan"
cat(readLines(cp.stan.file), sep="\n")
functions {
    vector warp(vector f, real[] beta, real[] gamma, int K) {
        vector[size(f)] y = rep_vector(0, size(f));
        for (i in 1:K) {
            y += (beta[i] ^ 2) * log(exp(f + gamma[i]) + 1.0);
        }
        return y;
    }

    real lambda(real L, int j) {
        return ((j * pi()) / (2 * L)) ^ 2;
    }

    real S(real w, real alpha, real rho) {
        return alpha^2 * sqrt(2 * pi()) * rho * exp(-0.5 * (rho * w)^2);
    }

    vector phi(vector x, real L, int j) {
        return 1 / sqrt(L) * sin(sqrt(lambda(L, j)) * (x + L));
    }
}

data {
    int<lower=1> N;
    vector[N] times;
    vector[N] y;
    int<lower=1> Q;
    real<lower=0> L;
    int<lower=1> K;
}


transformed data {
    matrix[N, Q] Pmat;

    for (j in 1:Q) {
        Pmat[, j] = phi(times, L, j);
    }
}

parameters {
    vector[Q] beta_spd;

    real<lower=0> beta[K];
    real<lower=0> gamma[K];
    real<lower=0> rho;
}


transformed parameters {
    vector[N] f;
    vector<lower=0>[N] sigma;
    {
        vector[Q] spd_diag;
        for(j in 1:Q) {
            spd_diag[j] = sqrt(S(sqrt(lambda(L, j)), 1.0, rho));
        }
        f = Pmat * (spd_diag .* beta_spd);
    }

   sigma = warp(f, beta, gamma, K);
}

model {
    beta_spd ~ std_normal();
    beta ~ std_normal();
    gamma ~ inv_gamma(5, 5);
    rho ~ inv_gamma(5, 5);

    y ~ normal(0, sigma);
}

Fitting this model using cmdstanr is done like this:

m <- cmdstanr::cmdstan_model(cp.stan.file)

fit <- m$sample(
  data=list(N = length(y),
            times = times,
            y = y,
            Q = 50,
            L = 5 / 2 * max(times),
            K = 1),
  seed=123,
  chains=4,
  parallel_chains=4
)

Fitting the model was thanks to the approximation fairly fast. The diagnostics also show that the sampler didn’t have any problems and the chains coverged.

fit$cmdstan_diagnose()
Processing csv files: /tmp/RtmpbDK62H/cp_volatility_model-202106202310-1-81f151.csv, /tmp/RtmpbDK62H/cp_volatility_model-202106202310-2-81f151.csv, /tmp/RtmpbDK62H/cp_volatility_model-202106202310-3-81f151.csv, /tmp/RtmpbDK62H/cp_volatility_model-202106202310-4-81f151.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

To assess the quality of the fit, let’s compare the true standard deviations with the posterior estimates. First, we extract the posterior from the fit object:

sigmas.hat <- fit$draws("sigma")
sigmas.hat.mean <- apply(as_draws_matrix(sigmas.hat), 2, mean)
sigmas.hat.ints <- t(apply(as_draws_matrix(sigmas.hat), 2, quantile, c(0.05, 0.95)))

Then we define a function to plot everything:

plot.sigmas <- function(df) { 
    ggplot(df) +
    geom_ribbon(
      aes(x=time, ymin = sigma.hat.lower, ymax = sigma.hat.upper), fill="#A1A6C8"
    ) +
    geom_line(aes(time, sigma, color="black")) +
    geom_line(aes(time, sigma.hat, color="darkblue"))  +
    scale_color_manual(values=c("black", "darkblue"),
                       labels=c("Sigma", expression(paste(widehat("Sigma"))))) +
    theme_tufte() +
    theme(
      axis.text = element_text(colour = "black", size = 15),
      strip.text = element_text(colour = "black", size = 15),
      legend.text = element_text(colour = "black", size = 15),
      legend.title = element_blank()
    ) +
    xlab(NULL) +
    ylab(NULL)
}

The figure shows the true standard deviations in black, the posterior means in blue and the 90% posterior invervals as contours.

tibble(time=times,
       sigma=sigmas,
       sigma.hat=sigmas.hat.mean,
       sigma.hat.lower=sigmas.hat.ints[, 1],
       sigma.hat.upper=sigmas.hat.ints[, 2]) %>%
  plot.sigmas()

In the end let’s look at the MSE of the standard deviations and their estimated posterior means.

mean((sigmas - sigmas.hat.mean)**2)
[1] 0.008700944

Fitting a latent Gaussian process

One alternative to modelling this data set is to model the standard deviations explicitely, i.e., by endowing them with a prior, and using a latent GP to model the mean of this prior.

\[\begin{align} f & \sim GP(0, K)\\ \sigma & \sim \text{Normal}^+(f, \tau) \\ y_t & \sim \mathcal{N}(0, \sigma_t) \end{align}\]

To help Stan a bit with sampling from the posterior, we use a non-centered parameterization for \(\sigma\). The model code is shown below:

gp.stan.file <- "./_models/gp_volatility_model.stan"
cat(readLines(gp.stan.file), sep="\n")
functions {
    real lambda(real L, int j) {
        return ((j * pi()) / (2 * L)) ^ 2;
    }

    real S(real w, real alpha, real rho) {
        return alpha^2 * sqrt(2 * pi()) * rho * exp(-0.5 * (rho * w)^2);
    }

    vector phi(vector x, real L, int j) {
        return 1 / sqrt(L) * sin(j  * pi()/(2 * L) * (x + L));
    }
}

data {
    int<lower=1> N;
    vector[N] times;
    vector[N] y;
    int<lower=1> Q;
    real<lower=0> L;
}


transformed data {
    matrix[N, Q] Pmat;

    for (j in 1:Q) {
        Pmat[, j] = phi(times, L, j);
    }
}

parameters {
    vector[Q] beta_spd;
    real<lower=0> alpha;
    real<lower=0> rho;
    real gamma;
    real<lower=0> sigma_sigma;

    vector<lower=0>[N] sigma_tilde;
}


transformed parameters {
    vector[N] f;
    vector<lower=0>[N] sigma;
    {
        vector[Q] spd_diag;
        for(j in 1:Q) {
            spd_diag[j] = sqrt(S(sqrt(lambda(L, j)), alpha, rho));
        }
        f = Pmat * (spd_diag .* beta_spd);
    }
    sigma = exp(f + gamma) + sigma_sigma * sigma_tilde;
}


model {
    beta_spd ~ std_normal();
    alpha ~ std_normal();
    rho ~ inv_gamma(5, 5);
    gamma ~ normal(0, 1);

    sigma_sigma  ~ std_normal();
    sigma_tilde  ~ std_normal();

    y ~ normal(0, sigma);
}

We fit the model as above:

m <- cmdstanr::cmdstan_model(gp.stan.file)

fit <- m$sample(
  data=list(N = length(y),
            times = times,
            y = y,
            Q = 50,
            L = 5 / 2 * max(times)),
  seed=123,
  chains =4,
  parallel_chains=4
)

The diagnostics of the fit show that the sampler didn’t have many problems. However, we observe a divergence.

fit$cmdstan_diagnose()
Processing csv files: /tmp/RtmpbDK62H/gp_volatility_model-202106202311-1-044a2f.csv, /tmp/RtmpbDK62H/gp_volatility_model-202106202311-2-044a2f.csv, /tmp/RtmpbDK62H/gp_volatility_model-202106202311-3-044a2f.csv, /tmp/RtmpbDK62H/gp_volatility_model-202106202311-4-044a2f.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
1 of 4000 (0.025%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete.

Let’s also have a look at the posteriors again as above.

sigmas.hat <- fit$draws("sigma")
sigmas.hat.mean <- apply(as_draws_matrix(sigmas.hat), 2, mean)
sigmas.hat.ints <- t(apply(as_draws_matrix(sigmas.hat), 2, quantile, c(0.05, 0.95)))

tibble(time=times,
       sigma=sigmas,
       sigma.hat=sigmas.hat.mean,
       sigma.hat.lower=sigmas.hat.ints[, 1],
       sigma.hat.upper=sigmas.hat.ints[, 2]) %>%
  plot.sigmas()

Finally, let’s have a look at the MSE of the standard deviations and their estimated posterior means gain.

mean((sigmas - sigmas.hat.mean)**2)
[1] 0.02125182

Discussion

Interestingly the fit using the GP is significantly worse than the fit using the copula process, i.e., the variance around 1 is drastically overestimated. This result is insofar surprising as for the CP model the warping function was not fine-tuned and the priors weren’t elicitated properly, but chosen semi-arbitraily. Fitting the GP model required a bit of more fine tuning, specifically parameterizing the standard deviations via a latent mean seemed bit awkward.

Session info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_CH.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_CH.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bayesplot_1.7.2     posterior_0.1.6     cmdstanr_0.4.0.9000
 [4] colorspace_2.0-1    ggthemes_4.2.4      forcats_0.5.1      
 [7] stringr_1.4.0       dplyr_1.0.1         purrr_0.3.4        
[10] readr_1.3.1         tidyr_1.1.1         tibble_3.1.2       
[13] ggplot2_3.3.4       tidyverse_1.3.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           lubridate_1.7.9      ps_1.6.0            
 [4] assertthat_0.2.1     digest_0.6.27        utf8_1.2.1          
 [7] plyr_1.8.6           R6_2.5.0             cellranger_1.1.0    
[10] ggridges_0.5.3       backports_1.2.1      reprex_0.3.0        
[13] evaluate_0.14        httr_1.4.2           pillar_1.6.1        
[16] rlang_0.4.11         readxl_1.3.1         data.table_1.14.0   
[19] rstudioapi_0.13      blob_1.2.1           checkmate_2.0.0     
[22] rmarkdown_2.6        labeling_0.4.2       munsell_0.5.0       
[25] broom_0.7.0          compiler_4.0.2       modelr_0.1.8        
[28] xfun_0.16            pkgconfig_2.0.3      htmltools_0.5.1.1   
[31] tidyselect_1.1.0     tensorA_0.36.2       fansi_0.5.0         
[34] crayon_1.4.1         dbplyr_1.4.4         withr_2.4.2         
[37] grid_4.0.2           distributional_0.2.2 jsonlite_1.7.2      
[40] gtable_0.3.0         lifecycle_1.0.0      DBI_1.1.1           
[43] magrittr_2.0.1       scales_1.1.1         cli_2.5.0           
[46] stringi_1.4.6        farver_2.1.0         fs_1.5.0            
[49] xml2_1.3.2           ellipsis_0.3.2       generics_0.1.0      
[52] vctrs_0.3.8          tools_4.0.2          glue_1.4.2          
[55] hms_1.1.0            processx_3.5.2       abind_1.4-5         
[58] yaml_2.2.1           rvest_1.0.0          knitr_1.29          
[61] haven_2.3.1         

References

Riutort-Mayol, Gabriel, Paul-Christian Bürkner, Michael R Andersen, Arno Solin, and Aki Vehtari. 2020. “Practical Hilbert Space Approximate Bayesian Gaussian Processes for Probabilistic Programming.” arXiv Preprint arXiv:2004.11408.

Solin, Arno, and Simo Särkkä. 2020. “Hilbert Space Methods for Reduced-Rank Gaussian Process Regression.” Statistics and Computing 30 (2): 419–46.

Wilson, Andrew G, and Zoubin Ghahramani. 2010. “Copula Processes.” In Advances in Neural Information Processing Systems, edited by J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta. Vol. 23.