This document briefly explains the EM algorithm. More specificially I hope I can shed light on the following things:

  • how we infer the parameters of a probability distribution,
  • how we infer the parameters for a mixture of probability distributions,
  • why we want to use the EM algorithm for the latter scenario,
  • how we use the EM algorithm,
  • how we derive the E and M steps.

I briefly explain statistical inference for a data set of Poisson random variables in the first section. The second section covers an explanation how we can do inference in a scenario where we have a mixture of Poisson disctributions. The last chapter explains the derivation of the EM as an iterative procedure of maximizing the lower bound to an observed data likelihood function.

At the end of the document you find the explanation of the notation we are using.

I do not take warranty for the correctness or completeness of this document.

An introduction to inference

When analysing a data set, we are often interested in finding a suitable data generating process and the parameters that govern this process.

For instance, consider a data set \(D\) like this:

## [1] 10  8 12 15 13 13

Let’s visualize this using a histogram:

  df <- data.frame(X=D)
  ggplot(df, aes(x = X)) +
    geom_histogram(position = "identity",  fill="darkgrey", binwidth=.5) +
    scale_y_discrete("Frequency") +

We are apparently observing discrete data that has a tail on the right. For a start of our analysis we can, for instance, assume a Poisson distribution, which would mean: \[\begin{align} X \sim \text{Pois}(\lambda), \end{align}\] i.e. \(X\) is a random variable with \(X \in \mathbb{N}\) and \[\begin{align} P(X=x) = \frac{\lambda^x}{x!}\exp(-\lambda) \end{align}\]

NOTE: assuming a Poisson distribution is our choice. Other choices are often also fine. We are trying to find a suitable model that describes our data well. While sometimes we are aware of the data generating process, or at least we think we are, in some cases we can only take a guess in the beginning, then fit a model, and afterwards check if our assumptions were justified.

After defining the data generating process (i.e. the Poisson distribution) we are interested in finding the best parameter \(\lambda\) for it. Generally, there can be multiple defintions of best, however, we are usually talking about the estimator that maximizes the probability of the data, the maximum likelihood estimate. That means we want to find: \[\begin{align} \hat{\lambda} = \arg \max_{\lambda} P(D ; \lambda) \end{align}\]

So we try to find the parameter with which the probability of the data is maximal. Moreover, we put a hat on the \(\lambda\) in order to signal that it is an estimate and not the true parameter (this we don’t know and will probably never know. In fact, we don’t even know if our data is Poisson. Actually we know nothing.).

The probability of the data given the parameter is called likelihood, which is defined as: \[\begin{align} L({\theta}) &= P(D ; \theta) = \prod_i^N P(X_i ; \theta) \\ \ell({\theta}) &= \log P(D ; \theta) = \log \prod_i^N P(X_i ; \theta) = \sum_i^N \log P(X_i ; \theta) \end{align}\]

Here we replaced \(\lambda\) with \(\theta\). This is just convention and makes it easier for us and others to read. For the rest of this document, we will use \(\theta\) to describe the parameters of interest.

So the likelihood is a function of the parameters, and not the data, even though the conditional probability might suggest that. For the sake of maximum likelihood estimation, the data is regarded as a constant.

NOTE: the likelihood function is not a density, i.e. it does generally not sum to one.

In some cases, we can find an analytical solution for \(\hat{\theta}\), i.e. we can compute \(\frac{\partial L}{\partial \theta}\), set it to \(0\) and solve for \(\theta\). However, often we cannot do that and need to find the best estimate numerically.

Let us first visualize what we try to find, the maximum of the likelihood:

We will try to find this numerically by first defining the log likelihood function:

  poisson.log.likelihood <- function(theta, D)
    sum(log(dpois(x=D, theta)))

Then we use a numerical optimizer that comes with R to find the maximum:

  (theta.num <- optimize(poisson.log.likelihood,
                         interval=c(0, 1000),
## [1] 9.998085
This is the same as the analytical solution: \[\begin{align} \frac{\partial\ell}{\partial \theta} &= \frac{\sum_i^N \log \frac{\lambda^{X_i}}{X_i!}\exp(-\lambda)}{\partial \theta} \\ &= \frac{\sum_i^N -\theta + X_i \log (\theta) - \log(X_i!)}{\partial \theta} \\ &= -N + \frac{\sum_i^N X_i}{\theta} \\ \hat{\theta} &= \frac{\sum_i^N X_i}{N}, \end{align}\]

which is

  (theta.ana <- sum(D)/length(D))
## [1] 9.9981

So the maximum likelihood estimate, i.e. the parameter that describes our data best, is roughly \(\hat{\theta} = 10\). The likelihood function is concave in this scenario, which makes it generally easy to optimize.

To sum this up, here we were interested in

  • proposing a suitable model for a data set,
  • estimating the parameters that best describe that data by computing a maximum likelihood estimate.

Inference with hidden data

Now, consider a scenario like this, where we have another data set \(D\):

## [1] 13 11  8 14 14 12

As before, we are observing count data. Let’s visualize this again:

  df <- data.frame(X=D)
  ggplot(df, aes(x = X)) +
    geom_histogram(position = "identity",  fill="darkgrey", binwidth=.5) +
    scale_y_discrete("Frequency") +

Here it is not so clear anymore which distribution we should use. However, we notice that there is a peak at \(3\) and a peak at around \(10\) and that the data is discrete.

A natural choice for something like this could be a mixture model, i.e. a distribution that consists of multiple components. For these data we have two peaks, so we might assume a mixture of two Poisson distributions. Again, the choice for this is by the modeller and for that matter partially subjective. For data like shown above, other mixtures might possibly fit equally well.

If we assume a mixture of Poisson distributions we could indeed fit the data quite well:

A mixture of two Poisson distributions is formulated like this: \[\begin{align} X_0 &\sim \text{Pois}(\theta_0) = P_0\\ X_1 &\sim \text{Pois}(\theta_1) = P_1\\ X &= (1 -\Delta) \cdot X_0 + \Delta \cdot X_1\\ P(X ; \theta, \phi) &= \phi_0 P_0(X ; \theta_0) + \phi_1 P_1(X ; \theta_1), \end{align}\]

where \(\Delta \in \{0, 1 \}\). As before \(X_0\) denotes data generated by a Poisson distribution \(P_0\) with parameter \(\theta_0\). \(X_1\) are points coming from another Poisson distribution \(P_1\) with parameter \(\theta_1\). The actual observed data \(X\) is either generated by \(P_0\), then it would take on value \(X_0\) with probability \(P(\Delta = 0) = \phi_0\), or by \(P_1\), then it would take on value \(X_1\) wih probability \(P(\Delta = 1) = \phi_1\).

In R data from a Poisson mixture could be generated by:

  phis    <- c(0.3, 0.7)
  thetas  <- c(3, 10)
  N      <- 1000
  X0 <- rpois(N * phis[1], thetas[1])
  X1 <- rpois(N * phis[2], thetas[2])
  X  <- c(X0, X1)

So \(X\) is a mixture of variables \(X_0\) and \(X_1\), as you can see by the colour of the bars above.

As above the goal is to maximize the (observed data) likelihood, i.e. to estimate the best parameters: \[\begin{align} L(\theta) &= \prod_i^N \phi_0 P_0(X ; \theta_0) + \phi_1 P_1(X ; \theta_1) \\ \ell(\theta) &= \sum_i^N \log \left[ \phi_0 P_0(X ; \theta_0) + \phi_1 P_1(X ; \theta_1) \right] \end{align}\]

In the example above, we estimated the optimal \(\theta\) (or \(\lambda\)) numerically (and computed the solution analytically). We could also do this here, however, generally we need to include many constraints in the optimization, such as:

  • probabilities and mixing weights \(\phi_0\) and \(\phi_1\) need to sum to one
  • covariance matrices need to be positive definite

Maximizing the above likelihood is also more complex than for a single Poisson, because of the sum within the logarithm. So, while an optimizer, such as gradient ascent, could do the job, it is in general not straightforward to implement.

The EM algorithm

It turns out that the problem can be easily solved using the Expectation Maximization algorithm. The EM algorithm is an iterative, alternating procedure of two steps.

  • E-step: do a soft assingment \(r_{ik}\) of a data point \(X_i\) to all models \(P_k\)
  • M-step: use the soft assignment to estimate the parameters of a likelihood function

A soft assignment \(r_{ik}\) of a datum \(X_i\) to a component of a mixture model \(P_k\) is the probability of \(X_i\) being generated by \(P_k\). We also call this the responsibility of model \(P_k\) for observation \(X_i\).

These steps are repeated until the likelihood converges, i.e. until we don’t see an improvement of the likelihood any more. More specifically that means, that the likelihood has approximately the same value at iteration \(t\) as it has at iteration \(t-1\). In that case we assume we have reached the maximum of the likelihood and estimated the optimal parameter values.

Consider a set of latent random variables \(Z_i\) taking values \(0\) or \(1\) (just like \(\Delta\) above). We add these random variables to our original data set, such that our new data set consists of pairs \(\{(X_i, Z_i)\}\). If \(Z_i = 0\) then \(X_i\) comes from \(P_0\), otherwise from \(P_1\). We can represent this graphically as:

## Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
## Please use the non-prefixed form or `!!` instead.
##   # Bad:
##   rlang::expr(mean(rlang::UQ(var) * 100))
##   # Ok:
##   rlang::expr(mean(UQ(var) * 100))
##   # Good:
##   rlang::expr(mean(!!var * 100))
## This warning is displayed once per session.

The process of introducing a new variable to our data is called data augmentation. The optimization of the observed data likelihood is, as discussed, difficult, but by enlarging our data with latent variables it can become easier. Here the latent variables represent the membership of a datum \(X_i\) to a component of the mixture.

From the graphical representation we can factorize \(P(X, Z)\) as: \[\begin{align} P(X, Z) = P(Z) P(X | Z), \end{align}\]

since the distribution of \(X\) depends on the value of \(Z\) and \(Z\) is conditionally independent of \(X\). So the arrow means that the value of \(Z\) influences the value of \(X\).

Complete data log likelihood

The log likelihood for this model is:

\[\begin{align} \ell(\theta, \phi) &= \log P(X , Z ; \theta, \phi) \\ &=\sum_i^N \log \left[ \left(\phi_0 P_0(X_i ; \theta_0)\right)^{Z_i} \cdot \left( \phi_1 P_1(X_i ; \theta_1)\right)^{1-Z_i} \right] \\ &\stackrel{P(X | Z)P(Z)}{=} \sum_i^N \log \left[ \underbrace{ P_0(X_i ; \theta_0)^{Z_i} \cdot P_1(X_i; \theta_1)^{1-Z_i}}_{P(X_i \mid Z_i ; \theta)} \right] + \log \left[ \underbrace{ \phi_0^{Z_i} \cdot \phi_1^{1-Z_i} }_{P(Z_i ; \phi)} \right] \\ &= \sum_i^N Z_i \log P_0(X_i ; \theta_0) + (1-Z_i) \log P_1(X_i ; \theta_1) + \sum_i^N Z_i \log \phi_0 + (1-Z_i) \log \phi_1\\ &= \sum_i^N \sum_{k \in \{0, 1 \} } \mathbb{I}(Z_i = k) \log P_k(X_i ;\theta_k) + \sum_i^N \sum_{k \in \{0, 1 \} } \mathbb{I}(Z_i = k) \log \phi_k\\ \end{align}\] However, this we also cannot solve, because \(Z\) is not observed. But we can optimize the expected value of the likelihood (with respect to \(P(Z \mid X; \theta, \phi)\)): \[\begin{align} \hat{\theta}, \hat{\phi} &= \arg \max_{\theta, \phi} \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] \\ \end{align}\]

The expected log-likelihood has the following form. It looks a little scary at first. The expectations of \(Z_i\) are the only expectations we have to compute. The expectation of \(P(X_i \mid ...)\) are constant.

\[\begin{align} \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] &= \mathbb{E} \left[ \sum_i^N Z_i \log P_0(X_i ; \theta_0) + (1-Z_i) \log P_1(X_i ; \theta_1) + \sum_i^N Z_i \log \phi_0 + (1-Z_i) \log \phi_1 \right] \\ &= \mathbb{E}\left[ \sum_i^N Z_i \log P_0(X_i ; \theta_0) + (1-Z_i) \log P_1(X_i ; \theta_1)\right] + \mathbb{E}\left[ \sum_i^N Z_i \log \phi_0 + (1-Z_i) \log \phi_1 \right] \\ &= \sum_i^N \mathbb{E}[Z_i] \log P_k(X_i ; \theta_0) + \mathbb{E}[1 - Z_i] \log P_1(X_i ; \theta_1) + \sum_i^N \mathbb{E}[Z_i] \log \phi_0 + \mathbb{E}[1 - Z_i] \log \phi_1 \\ &= \sum_i^N \sum_{k \in \{0, 1 \} } \mathbb{E}[\mathbb{I}(Z_i = k)] \log P_k(X_i ;\theta_k) + \sum_i^N \sum_{k \in \{0, 1 \} } \mathbb{E}[\mathbb{I}(Z_i = k)] \log \phi_k\\ &= \sum_i^N \sum_{k \in \{0, 1 \} } r_{ik} \log P_k(X_i ; \theta_k) + \sum_i^N \sum_{k \in \{0, 1 \} } r_{ik} \log \phi_k, \end{align}\]

where \(K=2\) since we assume two components. Here we wrote sometimes \(\mathbb{E}\) for brevity when we meant \(\mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)]\).


In order to estimate \(\theta\) and \(\phi\) here, we first need compute the expected values \(\mathbb{E}[\mathbb{I}(Z_i = k)]\) (\(r_{ik}\)) with respect to \(P(Z \mid X; \theta, \phi)\). This is quite easy: \[\begin{align} r_{ik} &=\mathbb{E}_{Z \mid X, \theta, \phi}[\mathbb{I}(Z_i = k)] \\ &= P(Z_i = k \mid X_i; \theta, \phi) \mathbb{I}(Z_i = k) \\ &= P(Z_i =k \mid X_i; \theta, \phi) \\ &\stackrel{\text{Bayes' theorem}}{=} \frac{ P(Z_i = k ; \phi) P(X_i \mid Z_i=k; \theta)}{P(X_i)} \\ &= \frac{\phi_k P(X_i ; \theta_k) }{\sum_{k'} \phi_k' P(X_i ; \theta_{k'})} \\ \end{align}\]

If we want to compute the expectation above, we need to have the parameters \(\theta\) though. So in order to estimate \(\mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)]\) we need \(\mathbb{E}[\mathbb{I}(Z_i = k)]\) and in order to estimate \(\mathbb{E}[\mathbb{I}(Z_i = k)]\), we need \(\theta\) and \(\phi\).

Auxiliary function

To solve this problem, let us define: \[\begin{align} Q(\theta^t, \phi^t , \theta^{t-1}, \phi^{t-1}) &= \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta^t, \phi^t)] \\ &= \sum_i^N \sum_{k \in \{0, 1 \} } r_{ik} \log P_0(X_i ; \theta_k^t) + \sum_i^N \sum_k^K r_{ik} \log \phi_k^t \\ &= \sum_i^N \sum_{k \in \{0, 1 \} } \frac{\phi_k^{t-1} P(X_i ; \theta_k^{t-1}) }{\sum_{k'} \phi_{k'}^{t-1} P(X_i ; \theta_{k'}^{t-1})} \log P_0(X_i ; \theta_k^{t}) + \sum_i^N \sum_{k \in \{0, 1 \} } \frac{\phi_k^{t-1} P(X_i ; \theta_k^{t-1}) }{\sum_{k'} \phi_{k'}^{t-1} P(X_i ; \theta_{k'}^{t-1})} \log \phi_k^t \end{align}\]

We first compute the responsibilities \(r_{ik}\) and then use these to maximize \(\mathbb{E}[\ell(\theta^t, \phi^t)]\) (M-step). At the beginning we start with random, reasonable parameter values for \(\theta^{0}\) and \(\phi^{0}\) and then iterate over these two steps until convergence. Since \(\phi\) are probabilities, a reasonable value should be in the interval \([ 0, 1 ]\) and the two parameters need to sum to one. For \(\theta\) we know that they need to be positive. In order for the EM to work, though, we need to pick two different values.


Let’s finish doing the math. For the M-step, we need to maximize \(Q(\theta^t, \phi^t , \theta^{t-1}, \phi^{t-1})\) with respect to \(\phi\) and \(\theta\). The maximum likelihood estimate \(\hat{\theta}_k\) is: \[\begin{align} \frac{\partial Q}{\partial \theta_k} &= \frac{\partial \sum_i^N r_{ik} \log P(X_i ; \theta_k)}{\partial \theta_k}\\ &=\frac{\partial \sum_i^N r_{ik} \log \frac{\theta_k^{X_i}}{X_i !} \exp(-\theta_k)}{\partial \theta_k} \\ &= \frac{\partial \sum_i^N r_{ik} \left( -\theta_k + X_i \log(\theta_k) -\log(X_i!) \right) }{\partial \theta_k} \\ &= \sum_i^N r_{ik} \left(-1 + \frac{X_i}{\theta_k} \right) \\ &= - \sum_i^N r_{ik} + \frac{1}{\theta_k} \sum_i^N r_{ik} X_i \\ \hat{\theta}_k &= \frac{\sum_i^N r_{ik} X_i}{\sum_i^N r_{ik}} \\ &= \frac{\sum_i^N r_{ik} X_i}{\sum_i^N r_{ik}} \end{align}\] For the maximum likelihood estimate \(\hat{\phi}_k\) we need to consider that \(\phi\) needs to sum to one as they are mixing weights. To enforce this we add a Langrange multiplier to the auxiliary function like this: \[\begin{align} Q(\theta^t, \phi^t , \theta^{t-1}, \phi^{t-1}) &= \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta^t, \phi^t)] + \color{\red}{\lambda (\sum_k \phi_k - 1)}. \end{align}\] Since for the partial derivative of \(Q\) with respect to \(\theta\) this constraint would not make a difference (as it would be removed when taking the derivate), it suffices to add it to the estimation of \(\hat{\phi}\) only: \[\begin{align} \frac{\partial Q}{\partial \phi_k} &= \frac{\partial \sum_i^N r_{ik} \log \phi_k + \lambda (\sum_k \phi_k - 1)}{\partial \phi_k} = \frac{\sum_i^N r_{ik}}{\phi_k} + \lambda\\ \hat{\phi}_k &= \frac{\sum_i^N r_{ik} }{-\lambda} \\\\ \end{align}\] For this we first need to estimate \(\hat{\lambda}\): \[\begin{align} \frac{\partial Q}{\partial \lambda} &= \frac{\partial \sum_i^N r_{ik} \log \phi_k + \lambda (\sum_k \phi_k - 1)}{\partial \lambda} = \sum_k \phi_k - 1 \\ &\stackrel{\text{Use }\hat{\phi}_k \text{ from above}}{=} \frac{\sum_i^N \sum_k r_{ik} }{-\lambda} - 1 \\ \hat{\lambda} &= -N \\\\ \end{align}\] Now we can plug \(\hat{\lambda}\) into the computation of \(\hat{\phi}\): \[\begin{align} \hat{\phi}_k &\stackrel{\text{Use }\hat{\lambda} \text{ from above}}{=} \frac{\sum_i^N r_{ik} }{N} \end{align}\]

Computational inference

Now let’s estimate these parameters computationally from the data that we have:

  K <- 2
  (phis <- rep(1/K, K))
## [1] 0.5 0.5
  (thetas <- sample(1:20, 2))
## [1]  2 11
  for(i in seq(10))
    # E-step
    re <- sapply(seq(K), function(k) phis[k] *  dpois(D, thetas[k]) )
    r_ik <- re / apply(re, 1, sum)
    Nk <- apply(r_ik, 2, sum)

    # M-step
    thetas <- sapply(seq(K), function(k) {
      sum(r_ik[,k] * D) / sum(r_ik[,k])
    phis  <- Nk / length(D)

  cat(paste0("Mixing weights: ", paste(phis, collapse=", "), "\n"))  
## Mixing weights: 0.495259207484427, 0.504740792515573
  cat(paste0("Thetas: ", paste(thetas, collapse=", "), "\n"))  
## Thetas: 2.94541992427469, 9.99692067179055

With these few lines we were able to estimate two mixing weights and the two parameters for the Posson distributions.

As in the introductory example, we show that we can also use a numerical solver for the M-step to find the optimal \(\hat{\theta}\) instead of finding the new parameters analytically. For that we first define the objective function we want to optimize. This is the left sum of the auxiliary function \(Q\). Here, we don’t show the optimization for \(\hat{\phi}\), because we would need to add a non-trivial constraint for it to sum to \(1\).

  m.step.thetas <- function(r_ik, D, K)
    # first sum of the auxiliary function that is optimized over
    Q <- function(x)
      thetas <- x[1:2]

      # sbplx is minimizing a function
      # In order to maximize Q we multiply -1 to it
      (-1) * sum(sapply(seq(K), function(k)
        r_ik[, k] * log(dpois(D, thetas[k]))

    # Use a numerical optimizer to solve for the thetas
    pa <- nloptr::sbplx(c(0, 0), Q, lower=c(0, 0), upper=NULL)

We use the SUBPLEX algorithm for finding the maximum. Since SUBPLEX per default computes the minimum of a function, we needed to multiply \(-1\) to our objective function. SUBPLEX can make use of box constraints, i.e. so that \(\hat{\theta}\) has to be larger than zero.

Then we apply the same procedure as before but replace the computation of \(\hat{\theta}\) with a numerical procedure:

  K <- 2
  phis <- rep(1/K, K)
  thetas <- sample(1:20, 2)

  for(i in seq(10))
    # E-step
    re <- sapply(seq(K), function(k) phis[k] *  dpois(D, thetas[k]) )
    r_ik <- re / apply(re, 1, sum)
    Nk <- apply(r_ik, 2, sum)

    # M-step
    # Compute thetas using a numerical solver
    thetas <- m.step.thetas(r_ik, D, K)
    phis   <- Nk / length(D)

  cat(paste0("Mixing weights with numerical solver: ", paste(phis, collapse=", "), "\n"))  
## Mixing weights with numerical solver: 0.492761289023676, 0.507238710976324
  cat(paste0("Thetas with numerical solver: ", paste(thetas, collapse=", "), "\n"))  
## Thetas with numerical solver: 2.93162208795547, 9.97559857368469

The results are nearly the same. For demonstration we replaced the maximization of the auxiliary function \(Q\) with respect to \(\theta\) by a numerical solver. In general finding the solutions analytically should always be preferred, because it is easier and usually faster.

Let’s try to match our prediction to the data. For that, we just show the original (empirical) data and compare it to sampled random variables using the estimated parameters:

  x <- 10000

  # sample data according to mixing weights and estimated thetas
  pois0 <- rpois(x * phis[1], thetas[1])
  pois1 <- rpois(x * phis[2], thetas[2])

  # create a data set for plotting
  df <- data.frame(
    X=c(D, pois0, pois1),
    S=c(rep("Empirical", length(D)),
        rep("Estimated mixture", length(pois0) + length(pois1))),
    C=c(rep("Empirical", length(D)),
        rep("Component 1", length(pois0)),
        rep("Component 2", length(pois1)))

  # create two plots
  g1 <-  ggplot(df, aes(x = X, fill = S)) +
    geom_histogram(position = "identity", alpha = 0.5, binwidth=.5) +
    scale_y_discrete("Frequency") +
    scale_fill_manual(values=c("black", "#65ADC2")) +
    theme_minimal() +
  g2 <-  ggplot(subset(df, S != "Empirical"), aes(x = X, fill = C), ) +
    geom_histogram(position = "stack", binwidth=.5, alpha = 0.75) +
    scale_y_discrete("Frequency") +
    scale_fill_manual(values=cols) +
    theme_minimal() +

  cowplot::plot_grid(g1, g2, ncol=2, align="vh")

The EM in general

Our goal in the model above was to compute \[\begin{align} \hat{\theta}, \hat{\phi} =\arg \max_{\theta, \phi} P(X ; \theta, \phi), \end{align}\] i.e. to maximize the observed data likelihood. As discussed, this is not an easy task. What is however easy, is to compute a lower bound of \(P(X \mid \theta, \phi)\): \[\begin{align} \ell( \theta, \phi; X) &= \log P(X ; \theta, \phi) \\ &= \log \sum_Z P(X, Z ;\theta, \phi) \\ &= \log \sum_Z q(Z) \frac{P(X, Z ; \theta, \phi)}{ q(Z) } \\ &= \log \mathbb{E}_{q(Z)} \left[ \frac{P(X, Z ; \theta, \phi)}{ q(Z) } \right] \\ &\stackrel{\text{Jensen's inequality}}{\ge} \mathbb{E}_{q(Z)} \left[ \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }\right] \\ &= \mathbb{E}_{q(Z)} \left[ \log P(X, Z ; \theta, \phi) \right] - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right] \\ &= \mathbb{E}_{q(Z)} \left[ \ell ( \theta, \phi) \right] - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right]. \end{align}\]

(We shall later denote the lower bound in the last line \(\text{LB}(\theta, \phi, q)\).) The unequality is of course only true, because the \(\log\) is a concave function (Jensen’s inequality). We can define \(q(Z)\) essentially how we want, but a natural choice is \(q(Z) = P(Z \mid X ; \theta, \phi)\), since we can easily compute these values and it turns out to be the tightest lower bound.

Thus, we arrive at the same optimization problem as above, which is the (iterative) maximization of the expected hidden log likelihood with respect to \(q(Z) = P(Z \mid X;\theta, \phi)\) , i.e. our auxiliary function \(Q\): \[\begin{align} \hat{\theta}, \hat{\phi} &= \arg \max_{\theta, \phi} Q(\theta, \phi) \\ &= \arg \max_{\theta, \phi} \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] \\ \end{align}\]

Increase of likelihood

Why, however, is the EM guaranteed to increase the likelihood in every iteration (or stay constant)? Let’s start from the beginning. We learned that we are maximizing a lower bound of the observed data log likelihood. Above we argued that we set \(q(Z) = P(Z \mid X ; \theta, \phi)\), because this is the tightest lower bound. Let’s derive this:

\[\begin{align} \ell( \theta, \phi; X) &\stackrel{\text{From above}}{=} \log \sum_Z q(Z) \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &\stackrel{\text{Jensen's inequality}}{\ge} \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }. \end{align}\] We denote this bound as: \[\begin{align} \text{LB}(\theta, \phi, q)&= \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &= \sum_Z q(Z) \log \frac{P(X ; \theta, \phi) P(Z \mid X ; \theta, \phi)}{ q(Z) }, \\ &= \sum_Z q(Z) \log P(X ; \theta, \phi) + \sum_Z q(Z) \log \frac{ P(Z \mid X ; \theta, \phi)}{ q(Z) }, \\ &= \underbrace{\log P(X ; \theta, \phi)}_{\color{red}{\text{observed data log likelihood}}} - \mathbb{K}\mathbb{L}(q(Z) \mid\mid P(Z \mid X ; \theta, \phi)).\\ \log P(X ; \theta, \phi) &= \text{LB}(\theta, \phi, q) + \mathbb{K}\mathbb{L}(q(Z) \mid\mid P(Z \mid X ; \theta, \phi), \end{align}\]

where \(\mathbb{K}\mathbb{L}\) is the Kullback-Leibler divergence which is always \(\ge 0\). So we see that our lower bound \(\text{LB}(\theta, \phi, q)\) and the observed data log likelihood, which we are trying to optimize, have a gap of the magnitude of \(\mathbb{K}\mathbb{L}\).

It is apparant that if we choose \(q(Z) = P(Z \mid X ; \theta, \phi)\) the Kullback-Leibler divergence would be maximized (as it will become \(0\)). Since for this we need estimates for \(\theta\) and \(\phi\), we need to take the estimates from the previous iteration, so \(q^{t-1}(Z) = P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\).

If we set \(q^{t-1}(Z) = P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\), the Kullback-Leibler divergence is zero, so the lower bound is tight: \[\begin{align} \log P(X ; \theta^{t-1}, \phi^{t-1}) = \text{LB}(\theta^{t-1}, \phi^{t-1}, q). \end{align}\] It is worth noting - if it hasn’t become clear - that the lower bound \(\text{LB}(\theta, \phi, q)\) is precisely the difference of the auxiliary function and the entropy of \(Z\): \[\begin{align} \text{LB}(\theta, \phi, q) &= \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &\stackrel{\text{see above}}{=} \mathbb{E}_{q(Z)} \left[ \ell ( \theta, \phi) \right] - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right], \\ &= Q( \theta, \phi, \theta^{t-1}, \phi^{t-1}) - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right], \end{align}\] where we used our definition of the auxiliary function from above: \[\begin{align} Q( \theta, \phi, \theta^{t-1}, \phi^{t-1}) &= \mathbb{E}_{q(Z)} \left[ \ell ( \theta, \phi) \right], \\ &= \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] \\, &= \sum_Z P(Z \mid X; \theta^{t-1}, \phi^{t-1}) P(X, Z ; \theta, \phi). \end{align}\]

Computing \(P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\) is by the way exactly the E-step.

So the lower bound and the observed data likelihood have the same value after the E-step (given the current parameter estimates \(\theta^{t-1}\) and \(\phi^{t-1}\)). Next, in the M-step, we maximize the lower bound and by that also increase the observed data likelihood. As a consequence of this the observed data likelihood and the lower bound might have a gap again. So, we again use the E-step, to close this gap. Thus, the EM guarantees to either increase its lower bound or converge to a local maximum. This is a surprisingly good tool for debugging, because it you do not see this monotonic increase you certainly have a bug in your code.


Symbol Meaning
\(X \sim P\) \(X\) is distributed according to distribution \(P\)
\(P(X)\) probability mass/density function of \(X\)
\(P(X, Y)\) joint probability distribution of \(X\) and \(Y\)
\(P(X = x)\) probability of observing \(X = x\)
\(P(X_i = x_i)\) probability of observing datum \(X_i\) with value \(X_i = x_i\)
\(P(X_i)\) short form of the above
\(\mathbb{E}(X)\) expected value of the random variable \(X\)
\(\mathbb{E}_Z(X)\) expected value of the random variable \(X\) with respect to \(P(Z)\)
\(\mathbb{E}_{Z \mid X}(X)\) expected value of the random variable \(X\) with respect to \(P(Z \mid X)\)
\(\mathbb{E}_{q(Z)}(X)\) expected value of the random variable \(X\) with respect to \(q(Z)\)
\(P(X; \theta )\) probability of \(X\) given parameter \(\theta\)
\(P(X \mid Z )\) probability of \(X\) given random variable \(Z\)
\(P(X \mid Z; \theta )\) probability of \(X\) given random variable \(Z\) and parameter \(\theta\)
\(L(\theta)\) likelihood function
\(\ell(\theta)\) log likelihood function
\(P(D ; \theta)\) probability of observing data \(D\) given parameter \(\theta\) (same as likelihood function)
\(\frac{\partial Q}{ \partial \theta}\) partial derivative of \(Q\) with respect to \(\theta\)
\(\mathbb{I}(x)\) indicator function, \(\mathbb{I}(x) = 1\) if \(x\) is true, else \(\mathbb{I}(x) = 0\)