Moments in Statistics
The word “moment” has many meanings. Most commonly, it connotes a slice of time. In the realm of physics, moment refers to the rotational tendency of some object, similar to how torque measures the change in an object’s angular momentum. As statisticians, however, what we are interested in is what moment means in math and statistics. In this post, we will attempt to shed new light on the topic of probability distributions through moment generating functions, or MGF for short.
Introduction to Moments
The mathematical definition of moments is actually quite simple.
- First Moment: \(\mathbf{E}(X)\)
- Second Moment: \(\mathbf{E}(X^2)\)
- Third Moment: \(\mathbf{E}(X^3)\)
- Fourth Moment: \(\mathbf{E}(X^4)\)
And of course, we can imagine how the list would continue: the \(n\)th moment of a random variable would be \(\mathbf{E}(X^n)\). It is worth noting that the first moment corresponds to the mean of the distribution, \(\mu\). The second moment is related to variance, as \(\sigma^2 = \mathbf{E}(X^2) - \mathbf{E}(X)\mathbf{E}(X)\). The third moment relates to the symmetry of the distribution, or the lack thereof, a quality which goes by the name of skewness. The fourth moment relates to kurtosis, which is a measure of how heavy the tail of a distribution is. Higher kurtosis corresponds to many outliers, while the converse would signify that the distribution contains little deviations. As you can see, the common theme is that the moment contains information about the defining features of a distribution, which is why it is such a convenient way to present information about a distribution.
Moment Generating Function
As the name suggests, MGF is a function that generates the moments of a distribution. More specifically, we can calculate the \(n\)th moment of a distribution simply by taking the \(n\)th derivative of a moment generating function, then plugging in 0 for parameter \(t\). We will see what \(t\) is in a moment when we look at the default formula for MGF. This sounds good and all, but why do we want an MGF in the first place, one might ask. Well, given that moments convey defining properties of a distribution, a moment generating function is basically an all-in-one package that contains every bit of information about the distribution in question.
Enough of the abstract, let’s get more specific by taking a look at the mathematical formula for an MGF.
\[M_X(t) = \mathbf{E}(e^{tX}) = \sum e^{tx} f(x) \, dx \tag{1}\]If \(X\) is a continuous random variable, we would take an integral instead.
\[M_X(t) = \mathbf{E}(e^{tX}) = \int e^{tx} f(x) \, dx \tag{2}\]Now, you might be wondering how taking the \(n\)th derivative of \(\mathbf{E}(e^{tX})\) gives us the \(n\)th moment of the distribution for the random variable \(X\). To convince ourselves of this statement, we need to start by looking at the Taylor polynomial for the exponential.
\[e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots \tag{3}\]It’s not difficult to see the coherency of this expression by taking its derivative—the derivative of the polynomial is equal to itself, as we would expect for \(e^x\). From here, we can sufficiently deduce that
\[e^{tX} = 1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \cdots \tag{3}\]The coherency of (3) can simply be seen by making the substitution \(x = tX\). To continue, now that we have an expression for \(e^{tX}\), we can now calculate \(\mathbf{E}(e^{tX})\), which we might recall is the definition of a moment generating function.
\[\begin{align*} M_X(t) &= \mathbf{E}(e^{tX}) \\ &= \mathbf{E}(1 + tX + \frac{t^2 X^2}{2!} + \frac{t^3 X^3}{3!} + \cdots ) \\ &= 1 + t\mathbf{E}(X) + \frac{t^2 \mathbf{E}(X^2)}{2!} + \frac{t^3 \mathbf{E}(X^3)}{3!} + \cdots \end{align*} \tag{4}\]where the second equality stands due to linearity of expectation.
All the magic happens when we derive this function with respect to \(t\).
\[\begin{align*} \frac{d}{dt}M_X(t) &= \frac{d}{dt}\left[1 + t\mathbf{E}(X) + \frac{t^2 \mathbf{E}(X^2)}{2!} + \frac{t^3 \mathbf{E}(X^3)}{3!} + \cdots \right] \\ &= \mathbf{E}(X) + t\mathbf{E}(X^2) + \frac{t^2 \mathbf{E}(X^3)}{2!} + \cdots \end{align*} \tag{5}\]At \(t = 0\), all terms in (5) except for the very first one go to zero, leaving us with
\[\frac{d}{dt}M_X(t) = E(X)\]In other words, deriving the MGF once and plugging in 0 to \(t\) leaves us with the first moment, as expected. If we derive the function again and do the same,
\[\frac{d^2}{dt^2}M_X(t = 0) = \mathbf{E}(X^2) + t\mathbf{E}(X^3) + \cdots \rvert_{t = 0} = \mathbf{E}(X^2) \tag{6}\]And by induction, we can see how the \(n\)th derivative of the MGF at \(t = 0\) would give us the \(n\)th moment of the distribution, \(\mathbf{E}(X^n)\).
Examples with Distributions
The Poisson Distribution
The easiest way to demonstrate the usefulness of MGF is with an example. For fun, let’s revisit a distribution we examined a long time ago on this blog: the Poisson distribution. To briefly recap, the Poisson distribution can be considered as an variation of the binomial distribution where the number of trials, \(n\), diverges to infinity, with rate of success defined as \(\frac{\lambda}{n}\). This is why the Poisson distribution is frequently used to model how many random events are likely in a given time frame.
Here is the probability distribution of the Poisson distribution. Note that \(x\) denotes the number of occurrences of the random event in question.
\[P(X = x) = \frac{\lambda^x e^{- \lambda}}{x!} \tag{7}\]The task here is to obtain the mean of the distribution, i.e. to calculate the first moment, \(\mathbf{E}(X)\). The traditional, no-brainer way of doing this would be to refer to the definition of expected values to compute the sum
\[\mathbf{E}(X) = \sum_{x = 0}^\infty x \frac{\lambda^x e^{- \lambda}}{x!} \tag{8}\]Computing this sum is not difficult, but it requires some clever manipulations and substitutions. Let’s start by simplifying the factorial in the denominator, and pulling out some expressions out of the sigma.
\[\begin{align*} \mathbf{E}(X) &= \lambda e^{- \lambda} \sum_{x \geq 1} \frac{\lambda^{x - 1}}{(x - 1)!} \\ &= \lambda e^{- \lambda} e^\lambda \\ &= \lambda \end{align*}\]where the third equality stands due to the variant of the Taylor series for the exponential function we looked at earlier:
\[e^{a} = \sum_{x = 0}^\infty \frac{a^x}{x!} \tag{9}\]Therefore, we have confirmed that the mean of a Poisson distribution is equal to \(\lambda\), which aligns with what we know about the distribution.
Another way we can calculate the first moment of the Poisson is by deriving its MGF. This might sound a lot more complicated than just computing the expected value the familiar way demonstrated above, but in fact, MGFs are surprisingly easy to calculate, sometimes even easier than using the definition expectation. Let’s begin by presenting a statement of the MGF.
\[M(t) = \sum_{x = 0}^\infty e^{tx} \frac{\lambda^x e^{- \lambda}}{x!}\]Let’s factor out terms that contain lambda, which is not affected by the summation.
\[M(t) = e^{- \lambda} \sum_{x = 0}^\infty \frac{(\lambda e^t)^x}{x!} \tag{10}\]Again, we refer to equation (9) to realize that the sigma expression simplifies into an exponential. In other words,
\[\sum_{x = 0}^\infty \frac{(\lambda e^t)^x}{x!} = e^{\lambda e^t}\]From this observation, we can simplify equation (10) as follows:
\[M(t) = e^{- \lambda} e^{\lambda e^t} = e^{\lambda (e^t - 1)} \tag{11}\]And there is the MGF of the Poisson distribution! All we have to do to obtain the first moment of the Poisson distribution, then, is to derive the MGF once and set \(t\) to 0. Using the chain rule,
\[M'(t) = \frac{d}{dt} e^{\lambda (e^t - 1)} = \lambda e^t e^{\lambda (e^t - 1)}\]At \(t = 0\),
\[M'(0) = \lambda \cdot 1 \cdot e^{\lambda (1 - 1)} = \lambda\]So we have confirmed again that the mean of a Poisson distribution is equal to \(\lambda\).
The Exponential Distribution
Let’s take another distribution as an example, this time the exponential distribution. We have not looked specifically at the exponential distribution in depth previously, but it is a distribution closely related to the [Gamma distribution], which we derived in this post. Specifically, when parameter \(\alpha = 1\) in a Gamma distribution, it is in effect an exponential distribution. Perhaps we will explore these relationships, along with the Erlang distribution, in a future post. For now, all we have to know is the probability density function of the exponential distribution, which is
\[P(X = x) = \lambda e^{- \lambda x}\]This time, the task is to obtain the third moment of the distribution, i.e. \(\mathbf{E}(X^3)\). But the fundamental approach remains identical: we can either use the definition of expected values to calculate the third moment, or compute the MGF and derive it three times. At a glance, the latter seems a lot more complicated. However, it won’t take long for us to see that sometimes, calculating the MGF is sometimes as easy as, if not easier than, taking the expected values approach.
Let’s twist up the order and try the MGF method first.
\[M_X(t) = \mathbf{E}(e^{tx}) = \int_{x = 0}^\infty e^{tx} \lambda e^{- \lambda x} \, dx\]We can pull out the lambda and combine the exponential terms to get
\[$M_X(t) = \lambda \int_{x = 0}^\infty e^{(t - \lambda)x} \, dx\]This is an easy integral. Let’s proceed with the integration and evaluation sequence:
\[M_X(t) = \lambda \lvert \frac{1}{t - \lambda} e^{(t - \lambda)x} \rvert_0^\infty = \lambda(0 - \frac{1}{t -\lambda}) = \frac{\lambda}{\lambda - t} \tag{12}\]Now, all we have to do is to derive the result in (12) three time and plug in \(t = 0\). Although calculating the third order derivative may sound intimidating, it may seem easier in comparison to evaluating the integral
\[\mathbf{E}(X^3) = \int_{x = 0}^\infty x^3 \lambda e^{- \lambda x} \, dx \tag{13}\]which would require us to use integration by parts. In the end, both (12) and (13) are pointing at the same quantity, namely the third moment of the exponential distribution. Perhaps the complexity of calculating either quantity is similar, and the question might just boil down to a matter of preference. However, this example shows that the MGF is a robust method of calculating moments of a distribution, and even more, potentially less computationally expensive than using the brute force method to directly calculate expected values.
Conclusion
This was a short post on moments and moment generating functions. Moments was one of these terms that I had come across on Wikipedia or math stackexchange posts, but never had a chance to figure out. Hopefully, this post gave you some intuition behind the notion of moments, as well as how moment generating functions can be used to compute useful properties that explain a distribution.
In the next post, we will take a brief break from the world of distributions and discuss some topics in information theory that I personally found interesting. If you would like to dwell on the question like “how do we quantify randomness,” don’t hesitate to tune in again in a few days!