## Warning: package 'dplyr' was built under R version 3.5.1

Some notes on MRFs and CRFs. The notes are collected from these references :

  • Koller and Friedman (2009)
  • Murphy (2012)
  • Sutton and McCallum (2012)
  • Wainwright and Jordan (2008)
  • Prince (2012)
  • Barber (2012)
  • Rue and Held (2005)
  • Wasserman (2013)

I do not take warranty for the correctness or completeness of this document.

Markov random fields

Markov random fields represent the joint probability distribution of a multivariate random variable \(Y\) as an undirected graph \(G = (\mathcal{V}, \mathcal{E})\), where every node \(i\) represents a component \(Y_i\). The JPD of \(Y\) factorizes over the (maximal) cliques of the graph as:

\[ P(Y) \propto \prod_C^{\mathcal{C}} \psi_C(\mathbf{y}_{C}), \] where \(|\mathcal{C}|\) is the respective set. \(\psi_C\) are called potentials and \(Y_{C}\) are the random variables defined by clique \(C\). A clique \(C\) is a fully connected subgraph of the entire vertex set \(\mathcal{V}\), such that for any set of edges \((i, j) \in \mathcal{E} \ \forall i, j \in C\).

For instance, in the graph above we would have one maximal clique \(C = \{X, Y, Z\}\) and three cliques \(C_1 = \{X, Y\}\), \(C_2 = \{X, Z\}\) and \(C_3 = \{Y, Z\}\) which are each defined over the edges of the graph.

A potential is an affinity function \(\phi_C: ( \otimes_{i \in C} Y_{i}) \rightarrow \mathbb{R}_+\). Notice that we are factorizing over potentials rather than local probability distributions as in BNs, because we can’t just use the chain rule. The set \(\mathcal{C}\) of cliques in most textbooks is the set of all maximal cliques, but we can also decompose it, such that smaller cliques are also contained, because we can always redefine the potential functions of maximal cliques to be the product of the potentials of smaller cliques.

Just as in directed graphical models, undirected graphical models (UGMs) encode conditional independence (CI) relations. The CIs seem to be more natural in UGMs, at a loss of interpretability of the parameters. Estimation of these seems in general computationally harder than in DAG models. Note that for general undirected graphs the potentials do not relate to conditional or marginal distributions as in Bayesian Networks. For more details I recommend Wainwright and Jordan (2008) and Barber (2012) as (great) resources.

Conditional independence relationships in an MRF are defined as:

  • pairwise Markov property: \(Y_i \ \bot \ Y_j \mid {Y}_{-i,-j} \Leftrightarrow e_{ij} = 0\),
  • local Markov property: \(Y_i \ \bot \ Y \setminus \{\text{MB}(i) \cup X_i\} \mid \text{MB}(i)\),
  • global Markov property: \(Y_{A} \ \bot \ Y_{B} \mid Y_{C}\), where \(A\), \(B\) and \(C\) are sets of nodes (ant not necessarily cliques).

The Hammersley-Clifford theorem formalizes this:

Theorem: A positive distribution \(P(Y) > 0\) satisfies the CI properties of an undirected graph \(G\) iff \(P\) can be represented as a product of factors, one per maximal clique.

What a fantastic result! The proof can for instance be found in Koller and Friedman (2009).

The same theorem implies that any probability measure that satisfies these Markov properties defines also a Gibbs distribution, i.e.:

\[\begin{align} P(Y ; \boldsymbol \theta) & \propto \exp \left(-\sum_{C} E(\mathbf{y}_{C} ; \boldsymbol \theta_{C} ) \right), \end{align} \] when \(E(\mathbf{y}_{C} ; \boldsymbol \theta_{C}) > 0\) for all cliques.

For specific parametrizations (and when we don’t have higher order interactions), we are free to instead factorize over the edges of the graph, rather than the maximal cliques (pairwise MRF) which will be treated later. We also interchangeably switched between the two definitions and parametrizations of models, i.e. we always use some $ $ to denote parameters.

Log-linear models

For discrete variables we can define the potentials using probability tables. A general solution is to define the potentials as, for instance, log-linear functions: \[ \log \phi_c(\mathbf{y}_{C} ; \boldsymbol \theta_{C}) = f(\mathbf{y}_{C})^T \boldsymbol \theta_{C} \] Models parametrized like this are called log-linear models.

Random field priors

In some scenarios we are interested in using a Markov random field as a suitable prior \(P(Y)\) over a parameter of some obervation model \(P(X \mid Y)\). Our goal is then to infer the posterior \(P(Y \mid X)\), or at least the mode of the posterior.

For the rest of this chapter we will use pairwise Markov random fields, i.e. when cliques \(C\) consist of edges only. Furthermore we will assume that labels \(Y \in \{-1, 1\}\). The energy \(E\) of a clique \(C\) is defined as \(\theta_{ij}(y_i, y_j)\), where \(i\) and \(j\) denote the nodes in the graph. The energy between two adjacent nodes is usually defined to be symmetrical: \[ E(\mathbf{y}_C; \theta_C) = \theta_{ij}(y_i, y_j) = \theta_{jj}(y_j, y_i) > 0. \] For discrete \(Y\) the observation model has usually the form of a mixture: \[ P(X \mid Y) = \sum_i \pi_i f_i(X \mid Y, \theta_i). \] For instance if we are using binary data: \[ \begin{align} P(X \mid Y = 0) &= \text{Bernoulli}(\theta_0),\\ P(X \mid Y = 1) &= \text{Bernoulli}(\theta_1). \end{align} \] In the case of continuous pixel data, a normal distribution sould be a better choice.

Having specified the observation model, we maximize the posterior like this: \[ \begin{align} \hat{Y} & = \arg\max_{X} P(Y \mid X) \\ &= \arg\max_{Y} P(X \mid Y) P(Y) \\ & = \arg\max_{Y} \sum_i^N \log P(X_i \mid Y_i) + \sum_C \log P(Y_C) \\ & = \arg\min_{Y} \left[ - \log P(X_i \mid Y_i) + \sum_{(i, j) \in \mathcal{E}} \theta_{ij}(y_i, y_j) \right]. \end{align} \] Note that we can omit the partition function here, because it does not influence the maximization. Our pairwise MRF thus consists of two terms: one for the local evidence, i.e. the data and one for the smoothing between the nodes.

In the special case of MRFs with submodular costs for different combinations of adjacent labels, we are able to efficiently do MAP inference using graph cuts (Prince 2012). Pairwise potentials are submodular iff \[ \theta_{-1,1} + \theta_{1, -1} - \theta_{1,1} -\theta_{-1,-1} \ge 0 \] Otherwise the problem is in general NP-hard.

The Ising model

The Ising model is an example of a pairwise Markov random field from statistical physics. In its original form it was used to model the spin-behaviour of atoms, i.e. to represent states \(Y_i \in \{ -1, 1\}, i \in 1, \dots, n\) in a grid. The Ising model defines pairwise potentials as: \[ \psi_{ij} = \begin{pmatrix} \exp(w_{ij})& \exp(-w_{ij}) \\ \exp(-w_{ij}) & \exp(w_{ij}) \end{pmatrix}, \] where we often take all the edge to have the same weight. If we take all edge weights to be positive \(w_{ij} > 0\), neighboring random variables tend to have the same value. In that case computation of the partition function \(Z\) can be done in polynomial time (as we have seen before). If we enforce adjacent nodes to have different values, i.e. \(w_{ij} < 0\), computation of partition function is generally NP-hard. See Murphy (2012) for this result.

In a denoising task we can combine the Ising prior \(P(Y)\) with an additional data term \(P(X \mid Y)\) representing some local evidence such that, \[\begin{align} P(Y \mid X) & \propto P(X \mid Y) P(Y)\\ & \propto \left[ \prod_i^N P(X_i \mid Y_i) \right] \left[ \prod_{(i,j) \in \mathcal{E}} \psi_{ij} \right] \end{align} \] If we want to factor in local evidence, we need to specify a suitable observation model. As discussed this can for instance be a normal model. Posterior inference of such a model can for instance be done using Gibbs sampling, such that the conditional posterior is \[ P(Y_i \mid Y_{-i}, X) \propto P(X_i \mid Y_i) \prod_{j \in \text{MB(i)}} \psi_{ij} \] That derivation is relatively straight-forward. We only need to factorize over the potentials defined by the Markov blanket of the node \(Y_i\), which are in the undirected setting just its neighbors, and the node potential itself. I’ll quickly demonstrate how the posterior of a Ising model can be inferred using Gibbs sampling. The demo is largely based on Murphy (2012).

set.seed(23)

im <- t(apply(x_train[1,,], 2, rev))
im[im > 0] <- 1
im[im == 0] <- -1

m <- mean(im)
im2 <- (im > m) + -1 * (im < m)
noisy.im <- im2 + matrix(runif(28*28), 28, 28)

sigmoid = function(x) {
   1 / (1 + exp(-x))
}

N <- nrow(noisy.im)
M <- ncol(noisy.im)
J <- .01

neg.norm <- dnorm(as.vector(noisy.im), -1, 1)
pos.norm <- dnorm(as.vector(noisy.im), 1, 1)
local.evidence <- matrix(c(neg.norm, pos.norm), ncol=2)
most.likely <- apply(local.evidence, 1, which.max)

X <- matrix(1, N, M)
X[most.likely == 1] <- -1

set.seed(23)
for (i in seq(100))
{
  for (ix in seq(N))
  {
    for (iy in seq(M))
    {
      pos <- iy + M * (ix - 1)
      neighbors <- pos + c(-1, 1, -M, M)
      if (iy == 1) neighbors <- neighbors[!neighbors %in% (pos - 1)]
      if (iy == N) neighbors <- neighbors[!neighbors %in% (pos + 1)]
      if (ix == 1) neighbors <- neighbors[!neighbors %in% (pos - M)]
      if (ix == M) neighbors <- neighbors[!neighbors %in% (pos + M)]
      edge.potential <- 2 * J * X[pos] * sum(X[neighbors])
      node.potential <- local.evidence[pos, 2] - local.evidence[pos, 1]
      p  <- sigmoid(edge.potential - node.potential)
      X[pos] <- ifelse(p >= .5, -1, 1)
    }
  }
}

The Potts model

The Potts model generalizes the Ising model to the Multinoulli case, i.e. to discrete states \(X_i \in \{1, \dots, K\}\) with a potential function: \[ \psi_{ij} = \begin{pmatrix} \exp(w_{ij}) & 0 & 0 \\ 0 & \exp(w_{ij}) & 0 \\ 0 & 0 & \exp(w_{ij}) \end{pmatrix}. \] For edge weights \(w_{ij} >0\) the Potts model, just as the Ising model, encourages neighboring random variables to have the same value. The Potts model often finds usage in image segmentation. The random field defined by the Potts model here serves as a prior \(P(Y)\) on the true image labels. The image itself has again an observational model \(P(X \mid Y)\). Thus the posterior is:

\[\begin{align} P(Y \mid X) & \propto P(X \mid Y) P(Y) \\ & \propto \left[ \prod_i P(X_i \mid Y_i) \right] \left[ \prod_{(i,j) \in \mathcal{E}} \psi_{ij} \right], \end{align} \] where \(P(Y = y \mid X = k)\) is the probability of observing the image pixel \(Y = y\) given that its true label is \(X=k\).

Gaussian graphical models

In Gaussian graphical models (Koller and Friedman 2009, Rue and Held (2005), Murphy (2012)) the underlying joint distribution is a multivariate Gaussian:

\[\begin{align} P(Y = \mathbf{y}) & \propto \left[ \prod_{(i, j) \in \mathcal{E}} \psi_{ij}(y_i, y_j) \right] \left[ \prod_i \psi_i(y_i) \right]\\ & = \exp \left( -\frac{1}{2} \mathbf{y}^T \boldsymbol \Lambda \mathbf{y} + (\boldsymbol \Lambda \boldsymbol \mu)^T \mathbf{y} \right), \end{align} \] where \(\boldsymbol \Lambda\) is the precision matrix and \(\boldsymbol \mu\) the mean vector of the Gaussian. The undirected graph that encodes the conditional independence relations is in the Gaussian case naturally interpreted, since an existing edge \(e_{ij}\) encodes that the partial correlation \(\Lambda_{ij} \ne 0\). That means: \[ Y_i \ \bot \ Y_j \ \mid Y_{-ij} \Leftrightarrow \Lambda_{ij} = 0 \]

Rue and Held (2005) introduces GGMs as an autoregressive process: \[\begin{align} y_t = \phi y_{t_1} + \epsilon_t, \\ |\phi| < 1,\\ \epsilon_t \sim \mathcal{N}(0, 1). \end{align} \] GGMs are fairly commonly used in computational biology. Let’s try to estimate the structure and parameters, i.e. the partial correlations, of a multivariate Gaussian using CXVR. The estimation is adopted from its documentation.

We first compute the empirical covariance matrix \(\boldsymbol S\) of some data set \(\boldsymbol X\).

S <- cov(X)

Then we set up the semi-definite program: the graphical Lasso (J. Friedman, Hastie, and Tibshirani (2008)), in order to estimate the precision matrix \(\boldsymbol \Lambda\). The graphical Lasso tries to maximize the following \(\ell_1\)-regularized log-likelihood: \[ \log \det \boldsymbol \Lambda - \text{tr}(S \boldsymbol \Lambda) - \alpha || \boldsymbol \Lambda ||_1 \]

In CVXR this is formalized as follows

alpha   <- 4
Lambda  <- Semidef(n)
obj     <- Maximize(log_det(Lambda) - matrix_trace(Lambda %*% S))
constraints <- list(sum(abs(Lambda)) <= alpha)

prob   <- Problem(obj, constraints)
result <- solve(prob)

Lambda.hat <- result$getValue(Lambda)

Let’s see how the two matrices look.

Conditional random fields

A conditional random field (Koller and Friedman 2009, Sutton and McCallum (2012)) is a conditional distribution \(P(Y \mid X)\) over a graph with nodes \(Y \cup X\). That means we want to avoid putting a probability measure on \(X\) (since \(X\) is always observed anyways). Consequently we disallow potentials that only contain random variables from \(X\).

\[\begin{align} P(Y \mid X) & = \frac{1}{P(X)} P(Y, X) \\ & \propto \frac{1}{Z(X)} \prod_C^{|\mathcal{C}|} \psi_C(Y_C, X_C) \end{align} \]

where \(\mathcal{C}\) is the set of factors. Note that the partition function now depends on \(X\) and not as before in the definition of the MRF at the beginning and thus may be tractable to compute.

References

Barber, David. 2012. Bayesian Reasoning and Machine Learning.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics 9 (3). Oxford University Press: 432–41.

Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques.

Murphy, Kevin. 2012. Machine Learning: A Probabilistic Perspective.

Prince, Simon. 2012. Computer Vision: Models, Learning, and Inference.

Rue, Havard, and Leonhard Held. 2005. Gaussian Markov Random Fields: Theory and Applications.

Sutton, Charles, and Andrew McCallum. 2012. “An Introduction to Conditional Random Fields.” Foundations and Trends in Machine Learning.

Wainwright, Martin J, and Michael I Jordan. 2008. “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning.

Wasserman, Larry. 2013. All of Statistics: A Concise Course in Statistical Inference.