An Introduction to Markov Chain Monte Carlo

17 minute read

Finally, here is the post that was promised ages ago: an introduction to Monte Carolo Markov Chains, or MCMC for short. It took a while for me to understand how MCMC models work, not to mention the task of representing and visualizing it via code. To add a bit more to the excuse, I did dabble in some other topics recently, such as machine learning models or information theory, which is also partially why this post was delayed quite a bit. Nevertheless, it’s finally here and ready to go. In this post, we will take a look at the Metropolis-Hastings algorithm, the simplest variant among the family of MCMC models. Let’s see what the Bayesian hype is all about.

Refresher on Bayesian Analysis

It’s been a while since we last discussed Bayesian inference, so it’s probably a good idea to start with a brief recap.

Bayesian Analysis

Bayesian statistics commences from Bayes’ theorem, a seminal statement in probability theory that can be expressed as follows

\[P(A \vert B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B \vert A) P(A)}{P(B)} \tag{1}\]

Alternatively, Bayes’ theorem can be stated more generally in the context of some partitioned sample space,

\[P(A_i \vert B) = \frac{P(A_i \cap B)}{P(B)} = \frac{P(B \vert A_i) P(A_i)}{\sum_j P(B \vert A_j) P(A_j)} \tag{2}\]

For simplicity, I omit the equation for the case involving continuous random variables. Simply put, the summation experssion in the denominator would simply be replaced with that involving integration.

The power of the proposition underlying Bayes’s theorem really comes into light when we consider it in the context of Bayesian analysis. The objective of Bayesian statistical analysis is to update our beliefs about some probability, known as the posterior, given a preestablished belief, called the prior, and a series of data observations, which might be decomposed into likelihood and evidence. Concretely,

\[P(\theta \vert x) = \frac{P(x \vert \theta) P(\theta)}{P(x)} \tag{3}\]

This statement is equivalent to

\[P_\text{Posterior} = \frac{\mathcal{L} \cdot P_\text{Prior}}{P_\text{Evidence}} \tag{4}\]

where $\mathcal{L}$ denotes the likelihood function. In plain language, Bayesian statistics operates on the assumption that all probabilities are reflections of subjective beliefs about the distribution of some random variable. A prior expectation or belief we might have about this distribution is referred to as the prior. Then, we can update our prior belief based on sample observations, resulting in a posterior distribution. Roughly speaking, the posterior can be considered as the “average” between the prior and the observed data. This process, which we went over in detail in this post, is at the heart of Bayesian inference, a powerful tool through which data and distributions can be understood.

Problem Statement

Theoretically, everything should work fine: given some prior and some sample observation data, we should be able to derive a posterior distribution for the random variable of interest. No big deal.

Or so it seems. If we take a look again at equation (3), we will realize that there is an evidence term that we have to calculate sitting in the denominator. The formula for evidence can be expressed as follows:

\[P(x) = \int_\Theta P(x, \theta) \, \mathrm{d}\theta \tag{5}\]

Computing this quantity is not as easy as it may appear. Indeed, this is one of the reasons why the Bayesian way of thinking was eschewed for so long by statisticians: prior to the advent of calculators and computers, mathematicians had trouble deriving the closed-form expression for the evidence term with just pen and paper. We might consider options other than direct calculation, such as Monte Carlo approximation or deriving a proportionality experssion by assuming evidence to be a constant. Indeed, the latter is the approach we took in this post on Bayesian inference. Using a beta prior and binomial likelihood, we used the property of conjugacy to derive a formula for the posterior. However, this raises yet another question: what if the prior and likelihood do not have a conjugate relationship? What if we have a very messy prior or complicated likelihood function, so convoluted to the point that calculating the posterior is near impossible? Simple Monte Carlo approximation might not do because of a problem called the curse of dimensionality: the volume of the sample space increases exponentially with the number of dimensions. In high dimensions, the brute force Monte Carlo approach may not be the most appropriate.

Markov Chain Monte Carlo seeks to solve this conundrum of posterior derivation in high dimensions sample space. And indeed, it does a pretty good job of solving it.

Markov Chain Monte Carlo

How does Markov Chain Monte Carlo get around the problem outlined above? To see this, we need to understand the two components that comprise Markov Chain Monte Carlo: Markov chains and Monte Carlo methods.

Markov Chains

We covered the topic of Markov chains on two posts, one on PageRank and the other on the game of chutes and ladders. Nonetheless, some recap would be of help. [Wikepedia] defines Markov chains as follows:

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

In other words, a Markov chain is a method of generating a sequence of random variables where the current value of that random variable probabilistically dpends on its prior value. By recursion, this means that the next value of that random variable only depends on its current state. To put this into context, we used Markovian analysis to assess the probability that a user on the internet would move to one site to another in the context of analyzing Google’s PageRank algorithm. Markov chains also popped up when we dealt with chutes and ladders, since the next position of the player in game only depends on their current position on the game board. These examples all demonstrate the Markov property, also known as memorylessness.

Later on, we will see how Markov chains come in handy when we decide to “jump” from one number to the next when sampling from the posterior distribution to derive an approximation of the parameter of interest.

Monte Carlo

We also explored Monte Carlo in some detail here on this blog. For those of you who haven’t already, I highly recommend reading the post, as we developed a good intuition of when Monte Carlo simulations can come in handy to deal with tasks of varying difficulty. To cut to the chase, Monte Carlo methods are used to solve intractable problems, or problems that require expensive computing. Instead of systematically deriving a closed-forrm solution, we can alternatively opt for a scheme of random sampling and hope that, with a sufficient sample size, we would eventually be able to derive an approximation of the parameter. Although this seems stupid at first, it is an incredibly powerful approach to solving many problems, including the one presented here involving posterior calculation in Bayesian inference.

Metropolis-Hastings

The [Metropolis-Hastings algorithm] is one of the first Markov Chain Monte Carlo model that was developed in the late 20th century to simulate particle movement. More advanced MCMC models have been introduced since; however, the Metrapolis-Hastings algorithm still deserves our attention as it demonstrates the basis of how many Markov Chain Monte Carlo models operate.

Let’s get into the details of the model. At the heart of Metrapolis-Hastings is the proposal distribution,

\[\theta_\text{Proposed} \sim \mathcal{N}(\theta \vert \theta_0, \sigma_0)\]

which we use to simulate the Markov chain random walk part of the model. Setting the parameters for this proposal distribution can be done arbitrarily, i.e. we can set it to be any random numbers. Theoretically, regardless of the parameters of the proposal distribution, the MCMC model would give us the same result after infinite iterations of sampling. In the Metrapolis-Hastings model, the proposal distribution is assumed as normal.

Next, we have to decide if the current value of $\theta$ is a value to accept or not. Accepting a randomly sampled value means adding it to our list of historic observations—if we draw a histogram of the entries in this list, it is our hope that we would end up with a close approximation of the posterior distribution. Accepting this value is often referred to as a “jump,” because we can visualize this process as a random walk in the posterior sample space from point $\theta_\text{Current}$ to $\theta_\text{Proposed}$. The question is, how do we decide to jump or not? The answer lies in Bayes’ theorem:

\[P(\theta \vert x) = \frac{P(x \vert \theta) P(\theta)}{P(x)}\]

If, for example,

\[P(\theta_\text{Proposed} \vert x) > P(\theta_\text{Current} \vert x) \tag{5}\]

we should accept the value and perform the jump, because this means that the new proposed parameter does a better job of explaining the data than does the current one.

But recall the dilemma we discussed earlier: how do we compute the posterior? After all, the complexity of calcualting evidence was the very reason why scientists came up with MCMC in the first place. Well, here’s a clever trick that might be of use: rearrange (5) in fractional form to get rid of the evidence term in the denominator. In other words, (5) can be reexpressed as

\[\frac{P(\theta_\text{Proposed} \vert x)}{P(\theta_\text{Current} \vert x)} > 1\]

which means

\[\frac{\frac{P(x \vert \theta_\text{Proposed}) P(\theta_\text{Proposed})}{P(x)}}{\frac{P(x \vert \theta_\text{Current}) P(\theta_\text{Current})}{P(x)}} = \frac{P(x \vert \theta_\text{Proposed}) P(\theta_\text{Proposed})}{P(x \vert \theta_\text{Current}) P(\theta_\text{Current})} > 1 \tag{6}\]

The evidence term nicely disappears, giving us an expression that we can easily evaluate! This is how Metropolis-Hastings resolves the dilemma of evidence computation—very simple yet some surprisingly effective algebra.

But before we move into code, there is something that should be corrected before we move on. In (6), we derived an experssion for the jump condition. The jump condition is not wrong per se, yet a slight modification has to be made to fully capture the gist of Metrapolis-Hastings. The precise jump condition for the sampler goes as follows:

\[\frac{P(x \vert \theta_\text{Proposed}) P(\theta_\text{Proposed})}{P(x \vert \theta_\text{Current}) P(\theta_\text{Current})} > \alpha \tag{7}\]

where

\[\alpha \sim U(0, 1)\]

This simply means that we accept the prorposed pararmeter if the quantity calculated in (7) is larger than a random number between 0 and 1 sampled from a uniform distribution. This is why MCMC models involve a form of random walk—while leaving room for somewhat unlikely parameters to be selected, the model samples relatively more from regions of high posterior probability.

Implementation in Python

Now that we have some understanding of how Markov Chain Monte Carlo and the Metropolis-Hastings algorithm, let’s implement the MCMC sampler in Python. As per convention, listed below are the dependencies required for this demonstration.

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
sns.set_style('darkgrid')
np.random.seed(123)

Let’s start by generating some toy data for our analysis.

data = np.random.randn(20)
data
array([-1.0856306 ,  0.99734545,  0.2829785 , -1.50629471, -0.57860025,
        1.65143654, -2.42667924, -0.42891263,  1.26593626, -0.8667404 ,
       -0.67888615, -0.09470897,  1.49138963, -0.638902  , -0.44398196,
       -0.43435128,  2.20593008,  2.18678609,  1.0040539 ,  0.3861864 ])

It’s always a good idea to plot the data to get a sense of its shape.

ax = plt.subplot()
sns.distplot(data, kde=False, ax=ax, color="skyblue")
_ = ax.set(xlabel='x', ylabel='Count')

The data looks roughly normal. This is because we created the toy data using the numpy.random.randn function, which generates random numbers from a normal distribution centered around 0. The task for this tutorial, given this data, is going to be estimating the mean of the posterior distribution, assuming we know its standard deviation to be 1.

Analytical Posterior

The benefit of working with this dumb example is that we can analytically derive a closed-form exprerssion for the posterior distribution. This is because a normal prior is conjugate with a normal likelihood of known variance, meaning that the posterior distribution for the mean will also turn out to be normal. If you are wondering if this property of conjugacy is relevant to the notion of conjugacy discussed above with Bayesian inference, you are exactly correct: statisticians have a laundry list of distributions with conjugate relationships, accessible on this Wikipedia article. The bottom line is that we can calculate the posterior analytically, which essentially gives us an answer with which we can evaluate our implementation of the Metropolis-Hastings algorithm. The equation for the posterior is presented below.

\[\mu \sim \mathcal{N}\left(\frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}} \left(\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i = 1}^n x_i}{\sigma^2} \right) , \left(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2} \right)^{-1} \right) \tag{8}\]

This assumes that the data is normally distributed with known variance

\[x_i \vert \mu \sim \mathcal{N}(\mu, \sigma) \tag{9}\]

and that the prior is normal, representable as

\[\mu_0 \sim \mathcal{N}(\mu_0, \sigma_0) \tag{10}\]

For a complete mathematical derivation of (8), refer to this document.

As we will see later on, we use (9) to calculate the likelihood and (10) to calculate the prior. For now, however, let’s focus on the analyticial derivation of the posterior. This can be achieved by translating the equation in (8) into code as shown below.

def normal_posterior(data, x, mu_prior, sigma_prior, sigma=1):
    n = len(data)
    sigma_posterior = (1/sigma_prior**2 + n/sigma)**-1
    mu_posterior = sigma_posterior * (mu_prior/sigma_prior**2 + data.sum()/sigma**2)
    return norm(mu_posterior, np.sqrt(sigma_posterior)).pdf(x)

Let’s see what the posterior looks like given the prior $\mu_0 \sim \mathcal{N}(0, 1)$. For sample observations, we use the toy data set data we generated earlier.

ax = plt.subplot()
x = np.linspace(-1, 1, 100)
posterior = normal_posterior(data, x, 0, 1)
ax.plot(x, posterior, color="skyblue")
ax.set(xlabel=r'$\mu$', ylabel='Posterior Belief')
sns.despine()
plt.show()

There we have it, the posterior distribution given sample observations and a normal prior. As expected, we see that the result is normal, a result due to the property of conjugacy. Another observation we might make is that the posterior mean is seems to be slightly larger than 0. This is an expected result given that the mean of the numbers in our toy data set was larger than 0.

np.mean(data)
0.11441773195529023

Because the posterior can be intuited as an “average” between the observed data set and the prior, we would expect the posterior to be centered around a value greater than zero, which is indeed the case.

Now that we have a full answer key to our problem, it’s time to build the Metropolis-Hastings sampler from scratch and compare the estimation generated by the sampler with the true analytical posterior we have just derived.

Metropolis-Hastings Sampler

Let’s go back to equation (7), which is the crux of the Metropolis-Hastings sampler. To recap, the MCMC sampler works by assuming some value sampled from the proposal distribution, calculating the likelihood and posterior, and seeing if the new proposed value is worth accepting, i.e. if it is worth making a jump in the random walk. All of this sounds pretty abstract when written in words, but it is a simple idea encapsulated by (7). Let’s build the Metropolis-Hastings sampler by implementing the algorithm described above in code as shown below. I looked at Thomas Wiecki’s implementation for reference and modified it to suit the purposes of this post.

def sampler(data, iter_num, mu_init=.5, proposal_width=.5, mu_prior_mu=0, mu_prior_sd=1, sigma=1):
    mu_current = mu_init
    posterior = [mu_current]
    for i in range(iter_num):
        mu_proposal = norm(mu_current, proposal_width).rvs()
        likelihood_current = np.prod(norm(mu_current, sigma).pdf(data))
        likelihood_proposal = np.prod(norm(mu_proposal, sigma).pdf(data))
        prior_current = norm(mu_prior_mu, mu_prior_sd).pdf(mu_current)
        prior_proposal = norm(mu_prior_mu, mu_prior_sd).pdf(mu_proposal)
        p_current = likelihood_current * prior_current
        p_proposal = likelihood_proposal * prior_proposal
        accept = p_proposal/p_current > np.random.rand() # Compare with random number sampled from [0, 1]
        if accept:
            mu_current = mu_proposal
        posterior.append(mu_current)
    return np.array(posterior)

Although Markov Chain Monte Carlo sounds complicated, really it is achieved by this single block of code. Of course, this code is limited in that is only applicable to a very specific situation, namely the task of deriving the posterior given a normal prior and a normal likelihood with known variance. Nevertheless, we can glean so much insight from this fascinating function.

Let’s quickly test the sampler function by making it sample five estimations of the mean parameter.

sampler(data, 5)
array([ 0.5       ,  0.5       ,  0.5       , -0.12694033, -0.12694033,
       -0.19697469])

As expected, the sampler starts from the 0, which is the default argument mu_prior_mu and took a jump at the second sample. After that jump, the sampler rejects the next three values sampled from the proposal distribution, as it stayed dormant at the value 0.17414333. However, with more iterations, we would expect the function to make more jumps, gradually painting a picture of what the posterior should look like. In fact, we can create what is called the trace plot to see which values were sampled by the Metropolis-Hastings algorithm. Trace plots are important because they tell us whether or not our model is well-calibrated, i.e. a sufficient amount of state changes occur.

posterior = sampler(data, 15000, mu_init=1)
fig, ax = plt.subplots()
ax.plot(posterior, color="skyblue", label='Proposal Width = 0.5')
_ = ax.set(xlabel='Samples', ylabel=r'$\mu$')

The trace plot contains the trace of 15000 accepted values sampled from the proposal distribution. We see that there are some fluctuations, indicating that state transitions occurred, but also that there seems to be values that the sampler preferred over others. Eyeballing the trace plot, the “comfort zone” seems to be slightly above 0, as we expect.

To illustrate the importance of trace plots, let’s see an example involving a bad setup involving a proposal distribution with too small a variance. The trace plot below shows that, although the MCMC model does manage to sample many values, it likes to stay too much in its current state, thus making taking much longer for the sampler to properly estimate the posterior by sampling a wide range of values. The bottom line is that setting the right proposal distribution is important, and that trace plots are a good place to start to check if the proposal distribution is set up properly.

posterior_small = sampler(data, 15000, mu_init=1, proposal_width=.01)
fig, ax = plt.subplots()
ax.plot(posterior_small, color='skyblue', label='Proposal Width = 0.01')
_ = ax.set(xlabel='Samples', ylabel=r'$\mu$')
plt.legend()
plt.show()

Now, it’s time to look at the answer key and see if our sampler has done well. Let’s plot posterior sampled by our Metropolis-Hastings sampler with the analytic posterior to see if they roughly match.

ax = plt.subplot()
sns.distplot(posterior[2000:], ax=ax, label='Estimated Posterior', color='skyblue')
x = np.linspace(-1, 1, 500)
posterior_dist = normal_posterior(data, x, 0, 1)
ax.plot(x, posterior_dist, color='silver', alpha=0.7, label='Analytic Posterior')
ax.set(xlabel=r'$\mu$', ylabel='Posterior Belief')
ax.legend()
plt.show()

Fantastic! Although the estimated posterior is not exactly equal to the analytic posterior, the two are quite similar to each other. We could quantify how similar or different they are by using metrics such as [KL divergence], but for simplicity’s sake, let’s contain the post within the realm of Bayes as we have done so far.

This goes to show just how useful and powerful Markov Chain Monte Carlo can be: even if a complicated likelihood function in high dimensional space, we would be able to use a similar sampling sequence to estimate the posterior. What’s even more fascinating about Markov Chain Monte Carlo is that, regardless of the value we start off with in the proposal distribution, we will eventually be able to approximate the posterior. This is due to the Markov chain part of MCMC: one of the most interesting properties of Markov chains is that, no matter where we start, we end up in the same [stationary distribution]. Together, these properties makes MCMC models like Metropolis-Hastings incredible useful for solving intractable problems.

PyMC3

pymc3 is a library made specifically for Bayesian analysis. Of course, it includes functions that implement Markov Chain Monte Carlo models. Although building the Metropolis-Hastings algorithm from scratch was a worthy challenge, we can’t build models from scratch every time we want to conduct from Bayesian analysis involving an intractable posterior, which is why packages like pymc3 always come in handy. With just a few lines of code, we can perform the exact same operation we performed above. In fact, let’s compare our Metropolis-Hastings sampler with the built-in function in the pymc3 library.

import pymc3 as pm

with pm.Model():
    mu = pm.Normal('mu', 0, 1)
    sigma = 1.
    returns = pm.Normal('returns', mu=mu, sd=sigma, observed=data)
    step = pm.Metropolis()
    trace = pm.sample(15000, step)
    
sns.distplot(trace[2000:]['mu'], label='PyMC3', color='silver')
sns.distplot(posterior[2000:], label='Manual', color='skyblue')
plt.xlabel(r'$\mu$') ;plt.ylabel('Posterior Belief')
plt.legend()
plt.show()
Multiprocess sampling (4 chains in 4 jobs)
Metropolis: [mu]
Sampling 4 chains: 100%|██████████| 62000/62000 [00:05<00:00, 11199.61draws/s]
The number of effective samples is smaller than 25% for some parameters.

Pretty similar, wouldn’t you say?

Conclusion

Markov Chain Monte Carlo is a powerful method with which we can estimate intractable posterior distributions. It is undoubtedly one of the most important tools that a Bayesian statistician should have under their belt. And even if you are frequentist, I still think MCMC models are worth looking at because it’s cool to see just how easily we can estimate a distribution with little to no knowledge about the mathematics involved in calculating the posterior. It’s also fascinating to see how the marriage of two seemingly unrelated concepts that arose out of different contexts–Monte Carlo methods and Markov chains—can produce such a powerful algorithm.

In the next post, we will continue our journey down the Bayesian rabbit hole. Perhaps we will start with another application of Bayesian thinking in machine learning. If naive Bayes is your jam, make sure to tune in some other time.