Fisher Score and Information

10 minute read

Fisher’s information is an interesting concept that connects many of the dots that we have explored so far: maximum likelihood estimation, gradient, Jacobian, and the Hessian, to name just a few. When I first came across Fisher’s matrix a few months ago, I lacked the mathematical foundation to fully comprehend what it was. I’m still far from reaching that level of knowledge, but I thought I’d take a jab at it nonetheless. After all, I realized that sitting down to write a blog post about some concept forces me to study more, so it is a positive, self-reinforcing cycle. Let’s begin.

Fisher’s Score

Fisher’s score function is deeply related to maximum likelihood estimation. In fact, it’s something that we already know–we just haven’t defined it explicitly as Fisher’s score before.

Maximum Likelihood Estimation

First, we begin with the definition of the likelihood function. Assume some dataset $X$ where each observation is identically and independently distributed according to a true underlying distribution parametrized by $\theta$. Given this probability density function $f_\theta(x)$, we can write the likelihood function as follows:

\[p(x \vert \theta) = \prod_{i=1}^n f_{\theta}(x_i)\]

While it is sometimes the convention that the likelihood function be denoted as $\mathcal{L}(\theta \vert x)$, we opt for an alternative notation to reserve $\mathcal{L}$ for the loss function.

To continue, we know that the maximum likelihood estimate of the distribution’s parameter is given by

\[\begin{align} \theta_{MLE} &= \mathop{\rm arg\,max}\limits_{\theta} p(x \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \log p(x \vert \theta) \\ &= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i=1}^n \log f_{\theta}(x_i) \end{align}\]

This is the standard drill we already know. The next step, as we all know, is to take the derivative of the term in the argument maxima, set it equal to zero, and voila! We have found the maximum likelihood estimate of the parameter.

A quick aside that may become later is the fact that maximizing the likelihood amounts to minimizing the loss function.

Fisher’s Score

Now here comes the definition of Fisher’s score function, which really is nothing more than what we’ve done above: it’s just the gradient of the log likelihood function.

\[s(\theta) = \nabla_\theta \log p(x \vert \theta)\]

In other words, we have already been implicitly using Fisher’s score to find the maximum of the likelihood function all along, just without explicitly using the term. Fisher’s score is simply the gradient or the derivative of the log likelihood function, which means that setting the score equal to zero gives us the maximum likelihood estimate of the parameter.

Expectation of Fisher’s Score

An important characteristic to note about Fisher’s score is the fact that the score evaluated the true value of the parameter equals zero. Concretely, this means that given a true parameter $\theta_0$,

\[\mathbb{E}_{\theta_0}[s(\theta)] = 0\]

This might seem deceptively obvious: after all, the whole point of Fisher’s score and maximum likelihood estimation is to find a parameter value that would set the gradient equal to zero. This is exactly what I had thought, but there are subtle intricacies taking place here that deserves our attention. So let’s hash out exactly why the expectation of the score with respect to the true underlying distribution is zero.

To begin, let’s write out the full expression of the expectation in integral form.

\[\begin{align} \mathbb{E}_{\theta_0}[s(\theta)] &= \int_{- \infty}^\infty \nabla_\theta \log p(x \vert \theta) \cdot p(x \vert \theta_0) \, dx \\ &= \int_{- \infty}^\infty \frac{\nabla_\theta p(x \vert \theta)}{p(x \vert \theta)} \cdot p(x \vert \theta_0) \, dx \\ \end{align}\]

If we evaluate this integral at the true parameter, i.e. when $\theta = \theta_0$,

\[\begin{align} \int_{- \infty}^\infty \frac{\nabla_\theta p(x \vert \theta_0)}{p(x \vert \theta_0)} \cdot p(x \vert \theta_0) \, dx &= \int_{- \infty}^\infty \nabla_\theta p(x \vert \theta_0) \, dx \\ &= \nabla_\theta \int_{- \infty}^\infty p(x \vert \theta_0) \, dx \\ &= 0 \end{align}\]

The key part of this derivation is the use of the Leibniz rule, or sometimes known as Feynman’s technique or differentiation under the integral sign. I am most definitely going to write a post detailing in intuitive explanation behind why this operation makes sense in the future, but to prevent unnecessary divergence, for now it suffices to use that rule to show that the expected value of Fisher’s score is zero at the true parameter.

Fisher’s Information Matrix

Things start to get a little more interesting (and more complicated) as we move onto the discussion of Fisher’s Information Matrix. There are two sides of the coin that we will consider in this discussion: Fisher’s information as understood as the covariance matrix of the score function, and Fisher’s information as understood as a Hessian of the negative log likelihood. The gist of it is that there are two different ways of understanding the same concept, and that they provide intriguing complementary views on the information matrix.

Covariance

Before jumping into anything else, perhaps it’s instructive to review variance, covariance, and the covariance matrix. Here is a little cheat sheet to help you out (and my future self, who will most likely be reviewing this later as well).

An intuitive way to think about variance is to consider it as a measure of how far samples are from the mean. We square that quantity to prevent negative values from canceling out positive ones.

\[\begin{align} \text{Var}(X) &= \mathbb{E}[(X - \mu)^2] \\ &= \mathbb{E}[(X - \mathbb{E}[X])^2]\\ &= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \end{align}\]

Covariance is just an extension of this concept applied to a comparison of two random variables instead of one. Here, we consider how two variables move in tandem.

\[\begin{align} \text{Cov}[X, Y] &= \mathbb{E}[(X - \mu_x)(Y - \mu_y)] \\ &= \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ &= \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \end{align}\]

And the variance-covariance matrix is simply a matrix that contains information on the covariance of multiple random variables in a neat, compact matrix form.

\[\text{K} = \begin{pmatrix} \text{Cov}[X_1, X_1] & \text{Cov}[X_1, X_2]& \cdots & \text{Cov}[X_1, X_n] \\ \text{Cov}[X_2, X_1] & \text{Cov}[X_2, X_2]& \cdots & \text{Cov}[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}[X_n, X_1] & \text{Cov}[X_n, X_2]& \cdots & \text{Cov}[X_n, X_n] \\ \end{pmatrix}\]

A closed-form expression for the covariance matrix $K$ given a random vector $X$, which follows immediately from aforementioned definitions and some linear algebra, looks as follows:

\[\text{K} = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\top]\]

Enough of the prologue and review, now we’re ready to start talking about Fisher.

Fisher’s Information

The information matrix is defined as the covariance matrix of the score function as a random vector. Concretely,

\[\begin{align} \text{I}(\theta) &= \text{K}_{s(\theta)} \\ &= \mathbb{E}[(s(\theta) - 0)(s(\theta) - 0)^\top] \\ &= \mathbb{E}[s(\theta)s(\theta)^\top] \end{align}\]

Note that the 0’s follow straight from the earlier observation that $\mathbb{E}[s(\theta)] = 0$. Intuitively, Fisher’s information gives us an estimate of how certain we are about the estimate of the parameter $\theta$. This can be seen by recognizing the apparent similarity between the definition of the covariance matrix we have defined above and the definition of Fisher’s information.

In fact, the variance of the parameter $\theta$ is explained by the inverse of Fisher’s information matrix, and this concept is known as the Cramer-Rao Lower Bound. For the purposes of this post, I won’t get deep into what CRLB is, but there are interesting connections we can make between Fisher’s information, CRLB, and the likelihood, which we will get into later.

Empirical Fisher’s Information

Because Fisher’s information requires computing the expectation given some probability distribution, it is often intractable. Therefore, given some dataset, often times we use the empirical Fisher as a drop-in substitute for Fisher’s information. The empirical Fisher is defined quite simply as follows:

\[\frac1n \sum_{i=1}^n \nabla \log p(x_i \vert \theta) \nabla \log p(x_i \vert \theta)^\top\]

In other words, it is simply an unweighted average of the covariance of the score function for each observed data point. Although this is a subtlety, it helps to clarify nonetheless.

Negative Log Likelihood

Something that may not be immediately apparent yet nonetheless true and very important about Fisher’s information is the fact that it is the negative expected value of the second derivative of the log likelihood. In our multivariate context where $\theta$ is a vector, the second derivative is effectively the Hessian. In other words,

\[\begin{align} \text{I}(\theta) &= - \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} s(\theta) \right] \\ &= - \mathbb{E}[\text{H}_{\log p(x \vert \theta)}] \end{align}\]

You might be wondering how the information matrix can be defined in two says, the covariance and the Hessian. Indeed, this threw me off quite a bit as well, and I struggled to find and understand a good resource that explained why this was the case. Thankfully, Mark Reid’s blog and an MIT lecture contained some very helpful pointers that got me a long way. The derivation is not the easiest, but I’ll try to provide a concise version based on my admittedly limited understanding of this topic.

Let’s start from some trivially obvious statements. First, from the definition of a PDF and the derivative operation, we know that

\[\int_{x \in X} f_{\theta}(x) \, dx = 1\]

Therefore, both the first and second derivative of this function are going to be zero. In multivariate speak, both the gradient and the Hessian are zero vectors and matrices, respectively. Using the Leibniz rule we saw earlier, we can interchange the derivative and come up with the following expressions.

\[\int \frac{\partial}{\partial \theta} f_{\theta}(x) \, dx = 0 \\ \int \frac{\partial^2}{\partial \theta^2} f_{\theta}(x) \, dx = 0 \\\]

Granted, these expressions somewhat muffle the shape of the quantity we are dealing with, namely vectors and matrices, but it is concise and intuitive enough for our purposes. With these statements in mind, let’s now begin the derivation by first taking a look at the Hessian of the score function.

From the chain rule, we know that

\[\begin{align} \frac{\partial^2}{\partial \theta^2} \log p(x \vert \theta) &= \frac{\partial}{\partial \theta} \left[\frac{ \frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \right] \\ &= \frac{\frac{\partial^2}{\partial \theta^2}p(x \vert \theta)}{p(x \vert \theta)} - \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \cdot \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \\ \end{align}\]

This does not look good at all. However, let’s not fall into despair, since our goal is not to calculate the second derivative or the Hessian itself, but rather its negative expected value. In calculating the expected value, we will be using integrals, which is where the seemingly trivial statements we established earlier come in handy.

\[\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log p(x \vert \theta) \right] \\= \mathbb{E}\left[\frac{\frac{\partial^2}{\partial \theta^2}p(x \vert \theta)}{p(x \vert \theta)} - \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \cdot \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \right] \\\]

By linearity of expectation, we can split this expectation up into two pieces.

\[\mathbb{E}\left[\frac{\frac{\partial^2}{\partial \theta^2}p(x \vert \theta)}{p(x \vert \theta)} \right] - \mathbb{E}\left[\frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \cdot \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)}\right]\]

Let’s use integrals to express the first expectation.

\[\begin{align} \mathbb{E}\left[\frac{\frac{\partial^2}{\partial \theta^2}p(x \vert \theta)}{p(x \vert \theta)} \right] &= \int \frac{\frac{\partial^2}{\partial \theta^2}p(x \vert \theta)}{p(x \vert \theta)} \cdot p(x \vert \theta) \, dx \\ &= \int \frac{\partial^2}{\partial \theta^2}p(x \vert \theta) \, dx \\ &= 0 \end{align}\]

The good news is that now we see terms canceling out each other. Moreover, from the Leibniz rule and the interchanging of the integral and the derivative, we have shown that the integral in fact evaluates to zero. This ultimately leaves us with

\[\begin{align} \mathbb{E}[\text{H}_{\log p(x \vert \theta)}] &= - \mathbb{E}\left[\frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)} \cdot \frac{\frac{\partial}{\partial \theta} p(x \vert \theta)}{p(x \vert \theta)}\right] \\ &= - \mathbb{E}[\nabla \log p(x \vert \theta) \nabla \log p(x \vert \theta)^\top] \\ &= - \mathbb{E}[s(\theta) s(\theta)^\top] \\ &= - \text{I}(\theta) \end{align}\]

Therefore we have established that

\[\text{I}(\theta) = - \mathbb{E}[\text{H}_{\log p(x \vert \theta)}]\]

And we’re done!

Conclusion

In this post, we took a look at Fisher’s score and the information matrix. There are a lot of concepts that we can build on from here, such as Cramer Rao’s Lower Bound or natural gradient descent, both of which are interesting concepts at the intersection of machine learning and statistics.

Although the derivation is by no means mathematically robust, it nonetheless vindicates a notion that is not necessary apparently obvious, yet makes a lot of intuitive sense in hindsight. I personally found this video by Ben Lambert to be particularly helpful in understanding the connection between likelihood and information. The gist of it is simple: if we consider the Hessian or the second derivative to be indicative of the curvature of the likelihood function, the variance of our estimate of the optimal parameter $\theta$ would be larger if the curvature was smaller, and vice versa. In a sense, the larger the value of the information matrix, the more certain we are about the estimate, and thus the more information we know about the parameter.

I hope you enjoyed reading this post. Catch you up on another post, most likely on the Leibniz rule, then natural gradient descent!