Advanced Machine Learning

17: Expectation Maximization

Outline for the lecture

  • Do we even need EM for GMM?
  • GMM estimation: a hack
  • MLE via EM

Do we even need EM for GMM?

Gaussian Mixture Model

Likelihood: $ \sum_{k=1}^K \pi_k\prob{N}{\vec{x}|\vec{\mu}_k, \bm{\Sigma}_k} $
\begin{align} \mbox{ for simplicity }\bm{\Sigma}_k & = \sigma^2 \bm{I}\\ \prob{p}{\vec{x}|y=k} & = \prob{N}{\vec{\mu}_k, \sigma^2 \bm{I}}\\ \prob{p}{y=k} & = \pi_k\\ \mbox{parameters: } & \vec{\mu}_1, \dots \vec{\mu}_K, \\ &\sigma^2, \\ & \pi_1, \dots, \pi_K \end{align}
But do we even need the hidden variables?

Maximum (Log) Likelihood Estimation

\begin{align} \ln{ \prob{p}{\bm{X}|\{\vec{\mu}_k\}, \{\bm{\Sigma}_k\}}} &= \sum_{n=1}^N\ln\{\sum_{k=1}^K \pi_k\prob{N}{\vec{x}_n|\vec{\mu}_k, \sigma_k\bm{I}} \} \end{align}

Difficult to optimize

  • Remember the exponential family? \[ \prob{p}{\vec{x}|\vec{\eta}} = \prob{h}{\vec{x}}\prob{g}{\vec{\eta}}e^{\vec{\eta}^T\prob{u}{\vec{x}}} \]
  • How easy the $\log$-likelihood was back then \begin{align} \log{\cal L} & = \sum \log \prob{p}{\vec{x}|\vec{\eta}}\\ & = \sum \log \prob{h}{\vec{x}} + \sum \log \prob{g}{\vec{\eta}} + \sum \vec{\eta}^T\prob{u}{\vec{x}} \end{align}
  • But $\sum \prob{p}{\vec{x}|\vec{\eta}}$ is not in the exponential familly
  • \begin{align} \ln{ \prob{p}{\bm{X}| \{\vec{\mu}\}, \{\bm{\Sigma}\}}} &= \sum_{n=1}^N\ln\{\sum_{k=1}^K \pi_k\prob{N}{\vec{x}_n|\vec{\mu}_k, \sigma_k\bm{I}} \} \end{align}

Another problem (common to MLE)

  • Suppose $K = 2$
  • Suppose one $\vec{\mu}_k = \vec{x}_i$
  • What's going to happen with our MLE? \begin{align} {\cal N}(x | \mu, \sigma_k) &= \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i-\mu_k)^2}{2\sigma_k^2}} \end{align}
MLE collapse
\begin{align} {\cal N}(x | \mu, \sigma_k) &= \frac{1}{\sqrt{2\pi}}\frac{1}{\sigma_k^2}\\ \sigma_k &\to 0 \end{align}

GMM estimation: a hack

Mixture of 2 Gaussians

  • \begin{align} \prob{p}{x_n|\mu_1, \mu_2, \sigma} &= \sum_{k=1}^2 \pi_k\prob{N}{x_n|\mu_k, \sigma} \end{align}
  • \begin{align} \prob{p}{x_n|\mu_1, \mu_2, \sigma} &= \sum_{k=1}^2 \pi_k\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{(x-\mu_k)^2}{2\sigma^2}\right)} \end{align}
  • Let's assume $\pi_1 = \pi_2 = 1/2$ and packing parameters to $\vec{\theta} = \{\mu_1, \mu_2, \sigma\}$

Mixture of 2 Gaussians

assuming known $\mu_k$ and $\sigma$

\begin{align} \prob{P}{k=1|x_n,\vec{\theta}} & = \frac{1}{1 + \exp{\left[-(w_1x_n+w_0)\right]}}\\ \prob{P}{k=2|x_n,\vec{\theta}} & = \frac{1}{1 + \exp{\left[+(w_1x_n+w_0)\right]}} \end{align}
recall logistic regression and softmax!

Mixture of 2 Gaussians

assuming known $\sigma$ but not $\mu_k$

  • \begin{align} \prob{p}{x_n|\mu_1, \mu_2, \sigma} &= \sum_{k=1}^2 \pi_k\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\frac{(x-\mu_k)^2}{2\sigma^2}\right)} \end{align}
  • \begin{align} \prob{p}{\bm{X}|\mu_1, \mu_2, \sigma} &= \underset{n}{\prod} \prob{p}{x_n|\mu_1, \mu_2, \sigma} \end{align}
  • As you'll show, log-likelihood derivative is \begin{align} \frac{\partial {\cal L}}{\partial \mu_k} &= \sum_n \prob{P}{k|x_n,\vec{\theta}} \frac{x-\mu_k}{\sigma^2} \end{align}

Mixture of 2 Gaussians

assuming known $\sigma$ but not $\mu_k$

  • As you'll show, log-likelihood derivative is \begin{align} \frac{\partial {\cal L}}{\partial \mu_k} &= \sum_n \prob{P}{k|x_n,\vec{\theta}} \frac{x-\mu_k}{\sigma^2} \end{align}
  • As you'll show next$^{*}$ \begin{align} \frac{\partial^2 {\cal L}}{\partial^2 \mu_k} &= -\sum_n \prob{P}{k|x_n,\vec{\theta}} \frac{1}{\sigma^2} \end{align}

Mixture of 2 Gaussians

assuming known $\sigma$ but not $\mu_k$

Using the good old Newton-Raphson update: $\mu = \mu - \frac{\partial {\cal L}}{\partial \mu_k}/\frac{\partial^2 {\cal L}}{\partial^2 \mu_k}$
You will show \[ \mu_k = \frac{\sum_n \left(\prob{P}{k|x_n,\vec{\theta}} x_n\right)}{\sum_n \prob{P}{k|x_n,\vec{\theta}} } \]

Compare to soft k-means: Iterations

Update
  • Re-estimate the $K$ cluster centers (aka the centroid or mean), by assuming the memberships found above are correct. \begin{align} \hat{\mu}_k &= \frac{\underset{n}{\sum}r_k^n \vec{x}_n}{R_k}\\ R_k &= \underset{n}{\sum} r_k^n \end{align}

GMMs and k-means

  • responsibilities are posteriors over latents
  • update is maximization of the likelihood
Both "hacky" approaches to solve a latent variable problem use the same general technique
Expectation Maximization: a meta-algorithm

too important to simply skim!

EM paper

Convexity

Convexity

Shades of Convex

Convex
convex
Strictly convex
strictly convex
Strongly convex
strongly convex

Convexity

Theorem:
if $f(x)$ is twice differentiable on $[a,b]$ and $f^{\prime \prime}(x) \ge 0$ on $[a,b]$ then $f(x)$ is convex on $[a,b]$.

Convexity of logarithm

Theorem:
$-\ln(x)$ is strictly convex on $(0, \infty)$

Jensen's inequality

Theorem:
Let $f$ be a convex function on an interval $I$. If $x_1, x_2, \dots, x_n \in I$ with $\lambda_1, \lambda_2, \dots, \lambda_n \ge 0$ and $\sum_{i=1}^n\lambda_i=1$ \begin{align} f\left(\sum_{i=1}^n\lambda_ix_i\right) & \le \sum_{i=1}^n \lambda_i f(x_i) \end{align}

Thanks to Jensen's inequality

\begin{align} \ln\left(\sum_{i=1}^n\lambda_ix_i\right) & \le \sum_{i=1}^n \lambda_i \ln{(x_i)} \end{align}
Now we are ready

Derivation of EM algorithm

  • Our goal is to maximize the likelihood function: \[ \prob{L}{\vec{\theta}} = \ln \prob{P}{\bm{X}|\vec{\theta}} \]
  • Equivalently if at step $n$ we have $\vec{\theta}_n$, we want such $\vec{\theta}$ \[ \prob{L}{\vec{\theta}} \gt \prob{L}{\vec{\theta}_n} \]
  • Yet, equivalently we want to maximize \[ \prob{L}{\vth} - \prob{L}{\vec{\theta}_n} = \ln\prob{P}{\bm{X}|\vec{\theta}} - \ln\prob{P}{\bm{X}|\vec{\theta}_n} \]
  • Looks difficult!

Introducing random variables $\vec{z}$

  • \[ \prob{P}{\bm{X}|\vec{\theta}} = \sum_{\vec{z}} \prob{P}{\bm{X}|\vec{z}, \vth}\prob{P}{\vec{z}|\vth} \]
  • \[ \prob{L}{\vth} - \prob{L}{\vec{\theta}_n} = \ln\prob{P}{\bm{X}|\vec{\theta}} - \ln\prob{P}{\bm{X}|\vec{\theta}_n} \]
  • Becomes \[ \prob{L}{\vth} - \prob{L}{\vec{\theta}_n} = \ln \sum_{\vec{z}} \prob{P}{\bm{X}|\vec{z}, \vth}\prob{P}{\vec{z}|\vth} - \ln\prob{P}{\bm{X}|\vec{\theta}_n} \]

Modifying the objective

EM objective

Define the lower bound

\begin{align} {\cal L}(\vth|\vth_n) & \def \prob{L}{\vth_n} + \Delta(\vth|\vth_n)\\ \prob{L}{\vth} & \ge {\cal L}(\vth|\vth_n)\\ \end{align}

When $\prob{L}{\vth} = {\cal L}(\vth|\vth_n)$

EM lower bound

What happens when we optimize $\cal L$

EM figure iterative

What happens when we optimize $\cal L$

EM formally

Kullback-Leibler divergence

KLD

Yet another view of EM (ELBO)