MLE and KL Divergence

6 minute read

These days, I’ve been spending some time trying to read published research papers on neural networks to gain a more solid understanding of the math behind deep learning. This is a rewarding yet also a very challenging endeavor, mostly because I have not studied enough math to really understand all of what is going on.

While reading the groundbreaking research paper Wasserstein GAN by Martin Arjovsky, I came across this phrase:

… asymptotically, maximum likelihood estimation amounts to minimizing the Kullback-Leibler divergence…

I was particularly interested in the last portion of this sentence, that MLE amounts to minimizing KL divergence. We discussed MLE multiple time on this blog, including this introductory post and a related post on MAP. Neither is KL divergence an entirely unfamiliar topic. However, I had not thought about these two concepts together in one setting. In this post, let’s try to hash out what the quote from the paper means.

Revisiting Definitions

Let’s start with a very quick review of what MLE and KL divergence each are. After all, it’s been a while since I’ve written the linked posts, and for a fruitful, substantive discussion on this topic, it’s necessary to make sure that we have a solid grasp of what MLE and KL divergence are.

Maximum Likelihood Estimation

MLE is a technique used to find the optimal parameter of a distribution that best describes a set of data. To cut to the chase, this statement can be expressed as follows:

\[\theta_{MLE} = \mathop{\rm arg\,max}\limits_{\theta} P(x \vert \theta) \tag{1}\]

From here, we can start making assumptions, such as that observations in $x$ are i.i.d, which is the assumption that we make to build models such as naïve Bayes, and so on. For now, it suffices to clarify that the goal of maximum likelihood estimation is to find the optimal parameter of a distribution that best captures some given data.

Kullback-Leibler Divergence

KL divergence is a concept that arises from the field of information theory that is also heavily applied in statistics and machine learning. KL divergence is particularly useful because it can be used to measure the dissimilarity between to probability distributions.

The familiar equation for KL divergence goes as follows:

\[\begin{align} D_{KL}(P∥Q) &= \mathbb{E}_{x \sim P(x)}\left[\log \frac{P(x)}{Q(x)}\right] \\ &= \int_{- \infty}^\infty P(x) \log \frac{P(x)}{Q(x)} \, dx \end{align} \tag{2}\]

In Bayesian terms, KL divergence might be used to compare the prior and the posterior distribution, where $p$ represents the posterior and $q$, the prior. In machine learning, $p$ is often the true distribution which we seek to model, and $q$ is the approximation of that true distribution, which is also the prediction generated by the model.

Note that KL divergence is not a true measure of distance, since it is asymmetric. In other words,

\[D_{KL}(P∥Q) \neq D_{KL}(Q∥P)\]

The focus of this post is obviously not on distance metrics, and I plan on writing a separate post devoted to this topic. But as a preview of what is to come, here is an appetizer to get you interested. An alternative to KL divergence that satisfies the condition of symmetry is the Jensen-Shannon Divergence, which is defined as follows:

\[\text{JSD}(P∥Q) = \frac12 D_{KL}(P∥M) + \frac12 D_{KL}(M∥P) \tag{3}\]


\[M = \frac12(P + Q)\]

One can intuit JSD as being a measurement that somewhat averages the two asymmetric quantities of KL divergence. We will revisit JSD in the future when we discuss the mathematics behind GANs. But for now, it suffices to know what KL divergence is and what it measures.

The Proof

Now that we have reviewed the essential concepts that we need, let’s get down to the proof.


Let’s start with the statement of the parameter $\theta$ that minimizes the KL divergence between the two distribution $P(x \vert \theta^*)$ and the approximate distribution $P(x \vert \theta)$:

\[\begin{align} \theta_\text{min KL} &= \mathop{\rm arg\,min}\limits_{\theta} D_{KL}\left[P(x \vert \theta^*)∥P(x \vert \theta)\right] \\ &= \mathop{\rm arg\,min}\limits_{\theta} \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \frac{P(x \vert \theta^*)}{P(x \vert \theta)}\right] \\ &= \mathop{\rm arg\,min}\limits_{\theta} \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log P(x \vert \theta^*) - \log P(x \vert \theta)\right] \end{align} \tag{4}\]

Not a lot has happened in this step, except for substituting the $D_{KL}$ expression with its definition as per (2). Observe that in the last derived expression in (4), the term $\log P(x \vert \theta^*)$ does not affect the argument of the minima, which is why it can safely be omitted to yield the following simplified expression:

\[\begin{align} \theta_\text{min KL} &= \mathop{\rm arg\,min}\limits_{\theta} \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[- \log P(x \vert \theta)\right] \\ &= \mathop{\rm arg\,max}\limits_{\theta} \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log P(x \vert \theta)\right] \end{align} \tag{5}\]

We can change the argument of the minima operator to the maxima given the negative sign in the expression for the expected value.

Law of Large Numbers

To proceed further, it is necessary to resort to the Law of Large Numbers, or LLN for short. The law states that the average of samples obtained from a large number of repeated trials should be close to the expected value of that random variable. In other words, the average will approximate the expected value as more trials are performed.

More formally, LLN might be stated in the following fashion. Suppose we perform an experiment involving the random variable $X$ and repeat it $n$ times. Then, we would obtain a set of indecent and identically distributed (i.i.d) samples as shown below:

\[X_1, X_2, \cdots , X_n\]

Then, LLN states that

\[\lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n X_i = \mu \tag{6}\]

A more precise statement of the law uses Chebyshev’s inequality:

\[P\left(\left \vert \frac{\sum_{i=1}^n X_i}{n} - \mu \right \vert \geq \epsilon \right) \leq \frac{\sigma^2}{n \epsilon^2} \tag{7}\]

For the curious, here is the general formulation of Chebyshev’s inequality outside the context of LLN:

\[P(\lvert X - \mu \rvert \geq k \sigma^2) \leq \frac{1}{k^2} \tag{8}\]

For the purpose of this post, it is not necessary to go into how Chebyshev’s inequality is derived or what it means. However, it isn’t difficult to see how one might reformulate (8) to derive (7) to prove the Law of Large Numbers. All that the inequality is saying is that no more than a certain fraction of samples can fall outside more than a certain distance away from the mean of the distribution.

With this understanding in mind, let’s return to the original problem and wrap up the proof.

Application to the Problem

Let’s apply the Law of Large Numbers to modify the expected value expression sitting in (5):

\[\begin{align} \theta_\text{min KL} &= \mathop{\rm arg\,max}\limits_{\theta} \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log P(x \vert \theta)\right] \\ &= \mathop{\rm arg\,max}\limits_{\theta} \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n \log P(x_i \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \log P(x \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} P(x \vert \theta) \\ &= \theta_{MLE} \end{align} \tag{9}\]

Voila! We have shown that minimizing the KL divergence amounts to finding the maximum likelihood estimate of $\theta$. This was not the shortest of journeys, but it is interesting to see how the two concepts are related. Indeed, it sort of makes intuitive sense to think that minimizing the distance between the true and approximated distribution is best done through maximum likelihood estimation, which is a technique used to find the parameter of the distribution that best describes given data.

I personally find little derivations and proofs like these to be quite interesting, which is why I plan on doing more posts on the mathematics of deep learning and its related concepts in the future. Thanks for reading, and catch you up in the next one.