This document briefly explains the EM algorithm. More specificially I hope I can shed light on the following things:

- how we infer the parameters of a probability distribution,
- how we infer the parameters for a mixture of probability distributions,
- why we want to use the EM algorithm for the latter scenario,
- how we use the EM algorithm,
- how we derive the E and M steps.

I briefly explain statistical inference for a data set of Poisson random variables in the first section. The second section covers an explanation how we can do inference in a scenario where we have a mixture of Poisson disctributions. The last chapter explains the derivation of the EM as an iterative procedure of maximizing the lower bound to an observed data likelihood function.

At the end of the document you find the explanation of the notation we are using.

**I do not take warranty for the correctness or completeness of this document.**

When analysing a data set, we are often interested in finding a suitable data generating process and the parameters that govern this process.

For instance, consider a data set \(D\) like this:

` head(D)`

`## [1] 10 8 12 15 13 13`

Let’s visualize this using a histogram:

```
df <- data.frame(X=D)
ggplot(df, aes(x = X)) +
geom_histogram(position = "identity", fill="darkgrey", binwidth=.5) +
scale_y_discrete("Frequency") +
theme_minimal()
```

**NOTE**: assuming a Poisson distribution is our choice. Other choices are often also fine. We are trying to *find* a suitable model that describes our data well. While sometimes we are aware of the data generating process, or at least we think we are, in some cases we can only take a guess in the beginning, then fit a model, and afterwards check if our assumptions were justified.

So we try to find the parameter with which the probability of the data is maximal. Moreover, we put a hat on the \(\lambda\) in order to signal that it is an estimate and not the true parameter (this we don’t know and will probably never know. In fact, we don’t even know if our data is Poisson. Actually we know nothing.).

The probability of the data given the parameter is calledHere we replaced \(\lambda\) with \(\theta\). This is just convention and makes it easier for us and others to read. For the rest of this document, we will use \(\theta\) to describe the parameters of interest.

So the likelihood is a function of the parameters, and *not* the data, even though the conditional probability might suggest that. For the sake of maximum likelihood estimation, the data is regarded as a constant.

**NOTE**: the likelihood function is not a density, i.e. it does generally **not** sum to one.

In some cases, we can find an analytical solution for \(\hat{\theta}\), i.e. we can compute \(\frac{\partial L}{\partial \theta}\), set it to \(0\) and solve for \(\theta\). However, often we cannot do that and need to find the best estimate numerically.

Let us first visualize what we try to find, the maximum of the likelihood:

We will try to find this numerically by first defining the log likelihood function:

```
poisson.log.likelihood <- function(theta, D)
{
sum(log(dpois(x=D, theta)))
}
```

Then we use a numerical optimizer that comes with `R`

to find the maximum:

```
(theta.num <- optimize(poisson.log.likelihood,
interval=c(0, 1000),
D=D,
maximum=TRUE)$maximum)
```

`## [1] 9.998085`

This is the same as the analytical solution:
\[\begin{align}
\frac{\partial\ell}{\partial \theta} &= \frac{\sum_i^N \log \frac{\lambda^{X_i}}{X_i!}\exp(-\lambda)}{\partial \theta} \\
&= \frac{\sum_i^N -\theta + X_i \log (\theta) - \log(X_i!)}{\partial \theta} \\
&= -N + \frac{\sum_i^N X_i}{\theta} \\
\hat{\theta} &= \frac{\sum_i^N X_i}{N},
\end{align}\]
which is

` (theta.ana <- sum(D)/length(D))`

`## [1] 9.9981`

So the maximum likelihood estimate, i.e. the parameter that describes our data best, is roughly \(\hat{\theta} = 10\). The likelihood function is concave in this scenario, which makes it generally easy to optimize.

To sum this up, here we were interested in

- proposing a suitable model for a data set,
- estimating the parameters that best describe that data by computing a
*maximum likelihood estimate*.

(We shall later denote the lower bound in the last line \(\text{LB}(\theta, \phi, q)\).) The unequality is of course only true, because the \(\log\) is a concave function (Jensen’s inequality). We can define \(q(Z)\) essentially how we want, but a natural choice is \(q(Z) = P(Z \mid X ; \theta, \phi)\), since we can easily compute these values and it turns out to be the tightest lower bound.

Thus, we arrive at the same optimization problem as above, which is the (iterative) maximization of the expected hidden log likelihood with respect to \(q(Z) = P(Z \mid X;\theta, \phi)\) , i.e. our auxiliary function \(Q\): \[\begin{align} \hat{\theta}, \hat{\phi} &= \arg \max_{\theta, \phi} Q(\theta, \phi) \\ &= \arg \max_{\theta, \phi} \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] \\ \end{align}\]Why, however, is the EM guaranteed to increase the likelihood in every iteration (or stay constant)? Let’s start from the beginning. We learned that we are maximizing a lower bound of the observed data log likelihood. Above we argued that we set \(q(Z) = P(Z \mid X ; \theta, \phi)\), because this is the tightest lower bound. Let’s derive this:

\[\begin{align} \ell( \theta, \phi; X) &\stackrel{\text{From above}}{=} \log \sum_Z q(Z) \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &\stackrel{\text{Jensen's inequality}}{\ge} \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }. \end{align}\] We denote this bound as: \[\begin{align} \text{LB}(\theta, \phi, q)&= \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &= \sum_Z q(Z) \log \frac{P(X ; \theta, \phi) P(Z \mid X ; \theta, \phi)}{ q(Z) }, \\ &= \sum_Z q(Z) \log P(X ; \theta, \phi) + \sum_Z q(Z) \log \frac{ P(Z \mid X ; \theta, \phi)}{ q(Z) }, \\ &= \underbrace{\log P(X ; \theta, \phi)}_{\color{red}{\text{observed data log likelihood}}} - \mathbb{K}\mathbb{L}(q(Z) \mid\mid P(Z \mid X ; \theta, \phi)).\\ \log P(X ; \theta, \phi) &= \text{LB}(\theta, \phi, q) + \mathbb{K}\mathbb{L}(q(Z) \mid\mid P(Z \mid X ; \theta, \phi), \end{align}\]where \(\mathbb{K}\mathbb{L}\) is the Kullback-Leibler divergence which is always \(\ge 0\). So we see that our lower bound \(\text{LB}(\theta, \phi, q)\) and the observed data log likelihood, which we are trying to optimize, have a gap of the magnitude of \(\mathbb{K}\mathbb{L}\).

It is apparant that if we choose \(q(Z) = P(Z \mid X ; \theta, \phi)\) the Kullback-Leibler divergence would be maximized (as it will become \(0\)). Since for this we need estimates for \(\theta\) and \(\phi\), we need to take the estimates from the previous iteration, so \(q^{t-1}(Z) = P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\).

If we set \(q^{t-1}(Z) = P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\), the Kullback-Leibler divergence is zero, so the lower bound is tight: \[\begin{align} \log P(X ; \theta^{t-1}, \phi^{t-1}) = \text{LB}(\theta^{t-1}, \phi^{t-1}, q). \end{align}\] It is worth noting - if it hasn’t become clear - that the lower bound \(\text{LB}(\theta, \phi, q)\) is precisely the difference of the auxiliary function and the entropy of \(Z\): \[\begin{align} \text{LB}(\theta, \phi, q) &= \sum_Z q(Z) \log \frac{P(X, Z ; \theta, \phi)}{ q(Z) }, \\ &\stackrel{\text{see above}}{=} \mathbb{E}_{q(Z)} \left[ \ell ( \theta, \phi) \right] - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right], \\ &= Q( \theta, \phi, \theta^{t-1}, \phi^{t-1}) - \mathbb{E}_{q(Z)} \left[ \log q(Z) \right], \end{align}\] where we used our definition of the auxiliary function from above: \[\begin{align} Q( \theta, \phi, \theta^{t-1}, \phi^{t-1}) &= \mathbb{E}_{q(Z)} \left[ \ell ( \theta, \phi) \right], \\ &= \mathbb{E}_{Z \mid X; \theta, \phi}[\ell(\theta, \phi)] \\, &= \sum_Z P(Z \mid X; \theta^{t-1}, \phi^{t-1}) P(X, Z ; \theta, \phi). \end{align}\]Computing \(P(Z \mid X ; \theta^{t-1}, \phi^{t-1})\) is by the way exactly the E-step.

So the lower bound and the observed data likelihood have the same value after the E-step (given the current parameter estimates \(\theta^{t-1}\) and \(\phi^{t-1}\)). Next, in the M-step, we maximize the lower bound and by that also increase the observed data likelihood. As a consequence of this the observed data likelihood and the lower bound might have a *gap* again. So, we again use the E-step, to close this gap. Thus, the EM guarantees to either increase its lower bound or converge to a local maximum. This is a surprisingly good tool for debugging, because it you do not see this monotonic increase you certainly have a bug in your code.

Symbol | Meaning |
---|---|

\(X \sim P\) | \(X\) is distributed according to distribution \(P\) |

\(P(X)\) | probability mass/density function of \(X\) |

\(P(X, Y)\) | joint probability distribution of \(X\) and \(Y\) |

\(P(X = x)\) | probability of observing \(X = x\) |

\(P(X_i = x_i)\) | probability of observing datum \(X_i\) with value \(X_i = x_i\) |

\(P(X_i)\) | short form of the above |

\(\mathbb{E}(X)\) | expected value of the random variable \(X\) |

\(\mathbb{E}_Z(X)\) | expected value of the random variable \(X\) with respect to \(P(Z)\) |

\(\mathbb{E}_{Z \mid X}(X)\) | expected value of the random variable \(X\) with respect to \(P(Z \mid X)\) |

\(\mathbb{E}_{q(Z)}(X)\) | expected value of the random variable \(X\) with respect to \(q(Z)\) |

\(P(X; \theta )\) | probability of \(X\) given parameter \(\theta\) |

\(P(X \mid Z )\) | probability of \(X\) given random variable \(Z\) |

\(P(X \mid Z; \theta )\) | probability of \(X\) given random variable \(Z\) and parameter \(\theta\) |

\(L(\theta)\) | likelihood function |

\(\ell(\theta)\) | log likelihood function |

\(P(D ; \theta)\) | probability of observing data \(D\) given parameter \(\theta\) (same as likelihood function) |

\(\frac{\partial Q}{ \partial \theta}\) | partial derivative of \(Q\) with respect to \(\theta\) |

\(\mathbb{I}(x)\) | indicator function, \(\mathbb{I}(x) = 1\) if \(x\) is true, else \(\mathbb{I}(x) = 0\) |