Gaussian Mixture Models
We’ve discussed Gaussians a few times on this blog. In particular, recently we explored Gaussian process regression, which is personally a post I really enjoyed writing because I learned so much while studying and writing about it. Today, we will continue our exploration of the Gaussian world with yet another machine learning model that bears the name of Gauss: Gaussian mixture models. After watching yet another inspiring video by mathematicalmonk on YouTube, I meant to write about Gaussian mixture models for quite some time, and finally here it is. I would also like to thank ritvikmath for a great beginner-friendly explanation on GMMs and Expectation Maximization, as well as fiveMinuteStats for a wonderful exposition on the intuition behind the EM algorithm.
Without further ado, let’s jump right into it.
Gaussian Mixture Models
The motivating idea behind GMMs is that we can model seemingly complicated distributions as a convex combination of Gaussians each defined by different parameters. One visual analogy I found particularly useful is imagining Gaussians as some sort of hill or mountain on a contour map. If we have multiple hills adjacent to one another, we can essentially model the topography of the region as a combination of Gaussians. At peaks, we would see circular contour lines, but where the hills meet, we might see different patterns, most likely circular patterns overlapping with each other.
Latent Data Generation
The key point here is that the combination is convex; in other words, the mixing coefficient for each Gaussian should add up to one. If we consider GMM to be a generative model, then we can imagine the generating process as follows:
- Sample a class index from a categorical distribution to determine the class in which the data point will belong
- Sample a data point from the Gaussian distribution that corresponds to the sampled class index
At this point, for clarity’s sake, let’s introduce some notations and concretize what has been elaborated above. First, we define a categorical distribution that will represent the mixing coefficients described above. Let $\pi$ be an $m$-dimensional vector that parametrizes this categorical distribution. In other words,
\[\pi = (\pi_1, \pi_2, \cdots, \pi_m), \, \sum_{n = 1}^m \pi_n = 1 \tag{1}\]This means that we assume the data to have $m$ different clusters. Each $\pi_n$ is then a mixing coefficient that establishes the convexness of the linear combinations of the underlying Gaussians. Now we can sample the cluster index, denoted as $Z$, from the categorical distribution as shown below. Note that we use $z$ for the cluster index, since it is considered a latent variable—we don’t observe it directly, yet it is deeply involved in the data generation process.
\[P(Z_i = k) = \pi_k \tag{2}\]Quite simply, the probability that a data point belongs to the $k$th cluster is represented by $\alpha_k$. The last step to data generation, as outlined in the bullet points above, is sampling a point from the corresponding Gaussian.
\[\mathcal{N}(\mu_k, \Sigma_k)\]where $\mu_k$ and $\Sigma_k$ are the mean and covariance matrices that parameterize the $k$th Gaussian in the mixture model.
Marginal Probability
Now that we have an idea of how GMMs can serve as a generative model that describes the underlying data generation process, let’s think about the marginal probability—that is, the probability that some point $x$ in the sample space was generated by the GMM. After all, what we observe is the final output, not the latent variables. Therefore, it would be convenient to be able to come up with some expression for the marginal distribution.
We can come up with a simple expression using the law of total probability and marginalization.
\[p(x) = \sum_{z \in Z} p(x \vert z) p(z) \tag{3}\]where $z$ can take values between 1 and $m$. Notice that we can thus simplify $p(z)$, as this is simply the categorical distribution for the mixing coefficients. In other words,
\[p(x) = \sum_{i = 1}^m \pi_i \, p(x \vert i) \tag{4}\]We can also simplify the condition probability expression, since we already know that it follows a normal distribution. Therefore, we end up with
\[p(x) = \sum_{i = 1}^m \pi_i \mathcal{N}(x \vert \mu_i, \Sigma_i) \tag{5}\]And there we have it, the density function of the Gaussian mixture model! We have a convex combination of $m$ different Gaussian PDFs that compose the Gaussian mixture model.
Parameter Estimation
In the context of machine learning, the goal is to find the parameters of the model that best describe some given set of data. In the case of GMMs, this is no different. The parameters that we have to estimate are
\[\pi, \mu, \Sigma\]Of course, $\mu$ and $\Sigma$ are not single vectors or matrices, but $m$ collection of such objects. But for notational simplicity, I opted to write them as such as shown above.
Maximum Likelihood Estimation
Given a collection of $n$ data points, denoted as $X$, we can now come up with the following expression:
\[\mathcal{L}(\theta \vert X) = \prod_{j = 1}^n \left( \sum_{i = 1}^m \pi_k \mathcal{N}(x_j \vert \mu_i, \Sigma_i) \right) \tag{6}\]$\mathcal{L}$ denotes the likelihood function, and $\theta$ is a collection of all parameters that define the mixture model. In other words,
\[\theta = \{\mu_1, \ldots, \mu_m, \Sigma_1, \ldots, \Sigma_m, \pi_1, \ldots, \pi_m \}\]All this is doing is that we are using the marginal distribution expression we derived earlier and applying that to a situation in which we have $N$ data points instead of just one, which is the context we have been operating in so far. We multiply the probabilities given the assumption of independence.
From the perspective of maximum likelihood estimation, the goal then would be to maximize this likelihood to find the optimal parameters for $\pi, \mu$ and $\Sigma$.
Before we attempt MLE with the likelihood function, let’s first try to calculate the log likelihood, as this often makes MLE much easier by decoupling products as summations. Let $\mathcal{l}(\theta) = \log( \mathcal{L}(\theta \vert X))$.
\[\mathcal{l}(\theta) = \sum_{j = 1}^n \log \left( \sum_{i = 1}^m \pi_i \mathcal{N}(x_j \vert \mu_i, \Sigma_i) \right) \tag{7}\]And now we see a problem: the log is not applied to the inside of the function due to the summation. This is bad news since deriving this expression by $\theta$ will become very tedious. The result is for sure not going to look pretty, let’s try deriving the log likelihood by $\mu_k$ and set it equal to zero.
At least one good news is that, for now, we can safely ignore the outer summation given the linearity of derivatives. We ultimately end up with the expression below:
\[\sum_{j = 1}^n \frac{\pi_k \mathcal{N}(x_j \vert \mu_k, \Sigma_k) \Sigma_k^{-1} (x_j - \mu_k)}{\sum_{i = 1}^m \pi_i \mathcal{N}(x_j \vert \mu_i, \Sigma_i)} = 0 \tag{8}\]This is not pretty at all, and it seems extremely difficult, if not impossible, to solve analytically for $\mu_k$. This is why we can’t approach MLE the way we usually do, by directly calculating derivatives and setting them equal to zero. This is where Expectation Maximization, or EM algorithm comes in.
Expectation Maximization
Before we get into the details of what the EM algorithm is, it’s perhaps best to provide a very brief overview of how the EM algorithm works. A very simple way to understand EM is to think of the Gibbs sampler. Simply put, Gibbs sampling is a way of approximating some joint distribution given conditional distributions. The underlying idea was to sample from one distribution and use that sampled result to in turn generate another sample from the next conditional distribution. One might visualize this as a chain of circular dependence: if we obtain a sample from past samples, the new sample can then be used to generate the next sample. Repeat this process until convergence, and we are done.
Turns out that the Gibbs sampler can be considered a specific flavor of the EM method. Although I am still far away from fully understanding the inner-workings of the EM algorithm, the underlying idea is clear: given some sort of dependence relationship, much like we saw in the case of Gibbs sampling above, we can generate some sample in one iteration and use that sample in the next. As we will see in this section, such a relationship can clearly be established, which is why EM is so commonly used in the context of Gaussian mixture models.
Let’s begin by defining some quantities, the first being the posterior distribution. In other words, given some data, what is the probability that it will belong to a certain class? Using the definition of conditional probability, we can arrive at the following conclusion:
\[P(Z = k \vert x_a) = \frac{\pi_k \mathcal{N}(\mu_k, \Sigma_k)}{\sum_{i = 1}^m \pi_i \mathcal{N}(\mu_i, \Sigma_i)} = \gamma_{a, k} \tag{9}\]This represents the probability that, given some point $x_a \in X$, the point belongs in the $k$th cluster. With this result, we can now rewrite the MLE calculation that we were performing earlier. Using the new $\gamma$ notation, we can thus simplify the result down to
\[\sum_{j = 1}^n \gamma_{j, k} \Sigma_k^{-1} (x_j - \mu_k) = 0 \tag{10}\]We can then simplify this expression to derive an expression for $\mu_k$. An important trick is here to use the fact that the covariance matrix is positive semi-definite. Therefore, the covariance matrix plays no role in driving the value down to zero.
\[\sum_{j = 1}^n \left( \gamma_{j, k} x_j - \gamma_{j, k} \mu_k \right) = 0 \tag{11}\]With some algebraic manipulations, we arrive at
\[\hat{\mu_k} = \frac{\sum_{j = 1}^n \gamma_{j, k} x_j}{\sum_{j = 1}^n \gamma_{j, k}} \tag{12}\]Let’s introduce another notational device to simplify the expresison even further. Let
\[N_k = \sum_{j = 1}^n \gamma_{j, k} \tag{13}\]Recall that $\gamma_{a, k}$ was defined to be the posterior probability that a given point $x_a$ belongs to the $k$th cluster. Then, since we are essentially summing up this quantity across the entire $n$ data points in the dataset $X$, we can interpret $N_k$ to effectively be the number of points in the dataset that are assinged to the $k$th cluster. Then, we can now simplify the MLE estimate of the mean as
\[\hat{\mu_k} = \frac{1}{N_k}\sum_{j = 1}^n \gamma_{j, k} x_j \tag{14}\]But we can now observe something interesting. Notice that a$\mu_k$ depend on $\gamma$. In turn, $\gamma$ is defined in terms of $\mu_k$. This is the very circular dependency that we discussed earlier as we were introducing the EM algorithm and comparing it with the Gibbs sampler. Now it becomes increasingly apparent why the EM algorithm is needed to find a converging solution for the MLE estimates.
We can take a similar approach to calculate the MLE of the other two remaining paramters, namely $\Sigma_k$ and $\pi_k$. The derivation is more complicated since $\Sigma$ is a matrix; $\pi$ is subject to constraints that apply to any categorical distribution: all elements must be positive and must sum up to one. For my own personal reference and the curious-minded, here is a link to a resource that contains the full derivation. But the fundamental idea is that we would commence from the log likelihood function and derive our way to the solution. The solutions are presented below:
\[\begin{align} \hat{\Sigma_k} &= \frac{1}{N_k} \sum_{j=1}^n \gamma_{j, k} (x_j - \mu_k)^\top (x_j - \mu_k) \\ \hat{\pi_k} &= \frac{N_k}{n} \end{align} \tag{15}\]So the full picture is now complete: given the inter-dependence of derived quantities, we seek to optimize them using the Expectation Maximization algorithm. Specifically, the EM method works as follows:
- Initialize the parameters with some value
- E-step: Estimate the posterior probability $\gamma$ using the other two parameters
- M-step: Estimate $\mu$ and $\Sigma$ using $\gamma$ calculated in the previous step
- Repeat until convergence
Conclusion
In today’s post, we took a deep dive into Gaussian mixture models. I find GMMs to be conceptually very intuitive and interesting at the same time. I’m also personally satisfied and glad that I have learned yet another ML/mathematical concept that starts with the word Gaussian. Much like how I felt when learning about Gaussian process regression, now I have an even greater respect for the Gaussian distribution, although I should probably be calling it normal instead, just like everybody else.
I’m also happy that I was able to write this blog post in just a single day. Of course, this is in large part due to the fact that I had spent some time a few weeks ago studying this material, but nonetheless I think I’m starting to find the optimal balance between intern dev work and self-studying of math and machine learning.
I hope you’ve enjoyed reading this post. In a future post, I will be implementing Gaussian mixture models in Python from scrach. Stay tuned for more!