Maximum A Posteriori Estimation

5 minute read

In a previous post on likelihood, we explored the concept of maximum likelihood estimation, a technique used to optimize parameters of a distribution. In today’s post, we will take a look at another technique, known as maximum a posteriori estimation, or MAP for short. MLE and MAP are distinct methods, but they are more similar than different. We will explore the similar mathematical underpinnings behind the methods to gain a better understanding of how distributions can be tweaked to best fit some given data. Let’s begin!

Maximum Likelihood Estimation

Before we jump right into comparing MAP and MLE, let’s refresh our memory on how maximum likelihood estimation worked. Recall that likelihood is defined as

\[\mathcal{L}(\theta \vert x) = P(X = x \vert \theta) \tag{1}\]

In other words, the likelihood of some model parameter $\theta$ given data observations $X$ is equal to the probability of seeing $X$ given $\theta$. Thus, likelihood and probability are inevitably related concepts that describe the same landscape, only from different angles.

The objective of maximum likelihood estimation, then, is to determine the values for a distribution’s parameters such that the likelihood of observing some given data is maximized under that distribution. In the example in the previous post on likelihoods, we showed that MLE for a normal distribution is equivalent to setting $\mu$ as the sample mean; $\sigma$, sample variance. But this convenient case was specific only to the Gaussian distribution. More generally, maximum likelihood estimation can be expressed as:

\[\begin{align} \theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} P(X \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \prod_i P(x_{i = 1}^n \vert \theta) \end{align} \tag{2}\]

It is not difficult to see why trying to compute this quantity may not be as easy as it seems: because we are dealing with probabilities, which are by definition smaller than 1, their product will quickly diverge to 0, which might cause arithmetic underflow. Therefore, we typically use log likelihoods instead. Maximizing the log likelihood amounts to maximizing the likelihood function since log is a monotonically increasing function.

\[\begin{align} \theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_{i = 1}^n P(x_i \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i = 1}^n \log P(x_i \vert \theta) \end{align} \tag{3}\]

Finding the maximum could be achieved multiple ways, such as through derivation or gradient descent.

Maximum A Posteriori

As the name suggests, maximum a posteriori is an optimization method that seeks to maximize the posterior distribution in a Bayesian context, which we dealt with in this post. Recall the Bayesian analysis commences from a number of components, namely the prior, likelihood, evidence, and posterior. Concretely,

\[\begin{align} P(\theta \vert X) &= \frac{P(X \vert \theta) P(\theta)}{P(X)} \propto P(X \vert \theta) P(\theta) \end{align} \tag{4}\]

The objective of Bayesian inference is to estimate the posterior distribution, whose probability distribution is often intractable, by computing the product of likelihood and the prior. This process could be repeated multiple times as more data flows in, which is how posterior update can be performed. We saw this mechanism in action with the example of a coin flip, given a binomial likelihood function and a beta prior, which are conjugate distribution pairs.

Then what does maximizing the posterior mean in the context of MAP? With some thinking, we can convince ourselves that maximizing the posterior distribution amounts to finding the optimal parameters of a distribution that best describe the given data set. This can be seen by simply interpreting the posterior from a conditional probability point of view: the posterior denotes the probability of the value of the model parameter is $\theta$ given data $X$. Put differently, the value of $\theta$ that maximizes the posterior is the optimal parameter value that best explains the sample observations. This is why at its heart, MAP is not so much different from MLE: although MLE is frequentist while MAP is Bayesian, the underlying objective of the two methods are fundamentally identical. And indeed, this similarity can also be seen through math.

\[\begin{align} \theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} P(X \vert \theta) P(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \log P(X \vert \theta) + \log P(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \log \prod_{i = 1}^n P(x_i \vert \theta) + \log P(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i = 1}^n \log P(x_i \vert \theta) + \log P(\theta) \end{align} \tag{5}\]

And we see that (5) is almost identical to (3), the formula for MLE! The only part where (5) differs is the inclusion of an additional term in the end, the log prior. What does this difference intuitively mean? Simply put, if we specify a prior distribution for the model parameter, the likelihood is no longer just determined by the likelihood of each data point, but also weighted by the specified prior. Consider the prior as an additional “constraint”, construed in a loose sense. The optimal parameter not only has to conform to the given data, but also not deviate too much from the established prior.

To get a more intuitive hold of the role that a Bayesian prior plays in MAP, let’s assume the simplest, most uninformative prior we can consider: the uniform distribution. A uniform prior conveys zero beliefs about the distribution of the parameter, i.e. all values of $\theta$ are equally probable. The implication of this decision is that the prior collapses to a constant. Given the nature of the derived MAP formula in (5), constants can safely be ignored as it will not contribute to argument maximization in any way. Concretely,

\[\begin{align} \theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i = 1}^n \log P(x_i \vert \theta) + \log P(\theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i = 1}^n \log P(x_i \vert \theta) + C \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i = 1}^n \log P(x_i \vert \theta) \end{align} \tag{6}\]

Therefore, in the case of a uniform prior, we see that MAP essentially boils down to MLE! This is an informative result that tells us that, at their core, MLE and MAP seek to perform the same operation. However, MAP, being a Bayesian approach, takes a specified prior into account, whereas the frequenting MLE simply seeks to dabble in data only, as probabilities are considered objective results of repeated infinite trials instead of subjective beliefs as a Bayesian statistician would purport.

I hope you enjoyed reading this post. See you in the next one!