Principal Component Analysis

15 minute read

Principal component analysis is one of those techniques that I’ve always heard about somewhere, but didn’t have a chance to really dive into. PCA would come up in papers on GANs, tutorials on unsupervised machine learning, and of course, math textbooks, whether it be on statistics or linear algebra. I decided that it’s about time that I devote a post to this topic, especially since I promised one after writing about [singular value decomposition] on this blog some time ago. So here it goes.

Why PCA?

What do we need principal component analysis for? Or more importantly, what is a principal component to begin with?

Well, to cut to the chase, PCA is a way of implementing dimensionality reduction, often referred to as lossy compression. This simply means that we want to transform some data living in high dimensional space into lower dimensions. Imagine having a data with hundreds of thousands of feature columns. It would take a lot of computing power to apply a machine learning model to fit the data and generate predictions. This is when PCA comes in: with PCA, we can figure out which dimensions are the most important and apply a transformation to compress that data into lower dimensions, making it a lot more tractable and easier to work with.

And in case you’re still wondering, principal components refer to those new extracted dimensions used to newly represent data!

Derivation with Residuals

Let’s derive PCA with some good old linear algebra tricks. I used Ian Goodfellow’s Deep Learning and a lecture slide from Columbia references for this post.

Setup

The setup of a classic PCA problem might be summarized as follows. Suppose we have a dataset of $m$ points, each living in $d$-dimensional space. In other words,

\(D = \{x_1, x_2, \cdots , x_m\}\) where

\[x_i = (x_i^{(1)}, x_i^{(2)} \cdots , x_i^{(d)})^\top\]

Our goal is to find a way to compress the data into lower dimensional space $\mathbb{R}^l$ where $l < d$. We might imagine this as a transformation, i.e. the objective is to find a transformation

\[f: \mathbb{R}^d \mapsto \mathbb{R}^l\]

So that applying $f(x_i)$ will yield a new vector $c$ living in lower dimensional space. We can also imagine there being a reverse transformation or a decoding function $g$ that achieves

\[g(f(x_i)) = c_i \approx x_i \tag{1}\]

Because PCA is in essence a linear transformation, it is most natural to express and understand it as a matrix. Let’s define this transformation as $T$, and the matrix corresponding to the decoding $D$. In other words,

\[c = f(x) = Tx \\x \approx g(c) = Dc \tag{2}\]

PCA makes a number of assumptions to simplify this problem. The most important assumption is that each column of $D$ is orthogonal to each other. As we will see later in an alternate derivation with statistics, this has to do with the notion of covariance. Another restriction is that the columns of $D$ must have a Euclidean norm of one. This constraint is necessary for us to find a unique matrix $D$ that achieves compression—otherwise, we could have any multiples, leading to an infinite number of such matrices.

We make one more convenient assumption about the given data points, $x$. That is, $x$ is assumed to have a mean of zero, i.e. $\mathbb{E}[x] = 0$. If this is not the case, we can easily perform standardization by subtracting the mean from the data.

With this setup in mind, let’s finally start the derivation.

Minimizing Data Loss

As said earlier, the goal of PCA is to compress data (and be able to uncompress it) with as little loss of information as possible. We don’t want to compress data in a haphazard fashion; instead, we want the compression scheme to be able to preserve the structure of the data as much as possible in its lower dimensional representation. From this, we can come up with the following equation:

\[c^* = \mathop{\rm arg\,min}\limits_{c} \lVert x - g(c) \rVert_2 \tag{3}\]

In other words, the goal is to find $c$ that which minimizes the difference between the original data and the reconstructed data. Note that finding this optimal $c$ amounts to finding $f$ that most effectively compresses given data.

Instead of the L2 norm, let’s consider the squared L2 norm for convenience purposes. Note that minimizing the L2 norm is equal to minimizing the squared L2 norm, so there is no semantic difference. By definition of vector transpose, we can now express the squared L2 norm versions of (3) as follows:

\[\begin{align} c^* &= \mathop{\rm arg\,min}\limits_{c} (x - g(c))^\top(x - g(c)) \\ &= \mathop{\rm arg\,min}\limits_{c} x^\top x - x^\top g(c) - g(c)^\top x + g(c)^\top g(c) \\ &= \mathop{\rm arg\,min}\limits_{c} x^\top x - 2x^\top g(c) + g(c)^\top g(c) \\ &= \mathop{\rm arg\,min}\limits_{c} g(c)^\top g(c) - 2x^\top g(c) \end{align} \tag{4}\]

where the second to last equality is due to the fact that $x^\top g(c)$ and $g(c)^\top x$ are both constants that denote the same value. Also, the argument of the minimum is with respect to $c$, we can omit the first term, which is purely in terms of $x$.

It’s time to take derivatives. But in order to do so, we need to unpack $g(c)$ , since we have no idea how to take its derivative. Using (2), we can reorganize (4) as follows:

\[\begin{align} c^* &= \mathop{\rm arg\,min}\limits_{c} (Dc)^\top Dc - 2x^\top Dc \\ &= \mathop{\rm arg\,min}\limits_{c} c^\top D^\top Dc - 2x^\top Dc \\ &= \mathop{\rm arg\,min}\limits_{c} c^\top c - 2x^\top Dc \end{align} \tag{5}\]

The last equality is due to the fact that we constrained the columns of $D$ to be unit vectors that are orthogonal to each other.

Taking the Gradient

Now we can take a derivative of the argument with respect to $c$ and set it equal to zero to find the minimum.

\[\nabla_c (c^\top c - 2 x^\top Dc) = 2c - 2 D^\top x \implies c = D^\top x \tag{6}\]

This tells us that the optimal way of compressing $x$ is simply by multiplying it by the transpose of the decoding matrix. In other words, we have found the transformation $T$ in (2).

For those of you who are confused about how gradients and matrix calculus work, here is a very short explanation. First, notice that $c^\top c$ is just a scalar, since $c$ is a column vector. Taking a gradient with respect to this quantity would mean that we get another column vector of equal dimensions with $c$ with the following elements:

\[\nabla_c c^\top c = (\frac{d c_1^2}{d c_1}, \frac{d c_2^2}{d c_2} \cdots ,\frac{d c_n^2}{d c_n})^\top \tag{7}\]

And we know how to go from there. The same line of thinking can be applied to think about the second term, $2x^\top Dc$. We know that $2x\top D$ is a row vector since its dot product with $c$ should be possible dimensionally speaking. Then, we know that the gradient with respect to $c$ should give each of the elements of $2x\top D$, but in column vector format—hence the need for a transpose.

In general, the rule of thumb is that the gradient of a scalar with respect to a vector or a matrix should return a vector or matrix of the same dimension.

Reconstructing Original Data

Recall from (1) that reconstruction can be achieved by applying compression followed by a decoding operation:

\[g(f(x_i)) = c_i \approx x_i \tag{1}\]

Since we know that $f$ is just $D^\top$ and $g$ is $D$ by definition, we can express (1) in a different way.

\[DD^\top x \approx x \tag{8}\]

In retrospect, this is somewhat intuitive since $D$ can roughly be thought of as a pseudo-orthonormal matrix—pseudo since there is no guarantee that it is a square matrix.

Now, all that is left is to find the matrix $D$. The way to go about this is to reconsider (3), the notion of minimizing data loss, given our findings in (6). In other words,

\[D^* = \mathop{\rm arg\,min}\limits_{D} \lVert X - X DD^\top \rVert_F^2 \tag{9}\]

Instead of considering a single observation, here we consider the design matrix in its entirety. Note that $X$ is a design matrix whose rows correspond to a single observation. And because we are dealing with matrices, the Euclidean norm was replaced with its matrix equivalent, the Frobenius norm. Observe that the first term $X$ can safely be removed from the argument since it is a constant with respect to $D$; let’s also change the argument of the minimum to the maximum given the negative sign.

\[D^* = \mathop{\rm arg\,max}\limits_{D} \lVert X DD^\top \rVert_F^2 \tag{10}\]

The Frobenius norm of a real matrix can be calculated as

\[\lVert A \rVert = \sqrt{\text{Tr}(A A^\top)} \tag{11}\]

Therefore,

\[\begin{align} D^* &= \mathop{\rm arg\,max}\limits_{D} \text{Tr}(X D D^\top D D^\top X^\top) \\ &= \mathop{\rm arg\,max}\limits_{D} \text{Tr}(X D D^\top X^\top) \\ &= \mathop{\rm arg\,max}\limits_{D} \text{Tr}(D^\top X^\top XD) \end{align} \tag{12}\]

The last equality is due to a useful property of trace, which is that we can cycle the order of matrices without changing its value.

Let’s consider a single column in $D$, denoted as $d$. You might also imagine this as a situation where $l$ is one-dimensional, meaning we want to compress data into a single scalar value. It is not difficult to see that the trace of $d^\top X^\top X d$, which is a scalar in the one-dimensional case, is maximized when $d$ is an eigenvector of $X^\top X$ with the largest eigenvalue.

\[\begin{align} \text{Tr}(d^\top X^\top X d) &= d^\top X^\top X d \\ &= d^\top \lambda d \\ &= \lambda \lVert d \rVert_2^2 \\ &= \lambda \end{align} \tag{13}\]

Generalizing this result back to $D$, we see that $D$ is a matrix whose columns correspond to the eigenvectors of $X^\top X$ in descending order.

Derivation with Covariance

If you had prior exposure to PCA, you might know that the standard way of obtaining principal components is by calculating the covariance matrix of the data and finding its eigenvectors. Here, I attempt to present an explanation of how and why the procedure outlined in the preceding section is essentially achieving the same tasks, albeit through a different frame of thought.

Covariance Matrix

The unbiased sample covariance matrix is given by

\[\Sigma = \frac{1}{m - 1}X^\top X \tag{14}\]

Of course, this is operating under the assumption that $X$ has already been standardized such that the mean of the data is zero.

You might be thinking that the formulation in (14) looks different from the one introduced previously on this post on SVD. In that particular post, I stated that covariance could be calculated as

\[\Sigma = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\top] \tag{15}\]

(14) and (15) certainly look different. However, under the hood, they express the same quantity.

  • In (14), we assumed from the beginning that our data has a mean of zero. In (15), we make this assumption explicit by subtracting the mean from the data.
  • In (14), we assumed a row-based design matrix, where each data point is stored as a row vector. In (15), we assumed that each data points are stored as columns of the design matrix; hence the difference in the order of transpose.
  • In (14), we are dealing with the unbiased sample covariance matrix, which is why we divide by a fraction of $1/(m - 1)$. In (15), we simply express this division as an expectation encapsulating the entire expression.

So in a nutshell, the conclusion we arrived at in the preceding section with the minimization of residual sums ultimately amounts to finding the covariance matrix and its eigenvectors. I found this to be the more dominant interpretation of PCA, since indeed it is highly intuitive: the goal of PCA is to find the axes—or the principal components—that which maximize the variance seen in the data.

setosa.io has some excellent visualizations on the notion of covariance and how it relates to PCA, so I highly recommend that you go check it out.

If were to derive PCA from the gecko with the covariance approach, we would be using an iterative approach to find a single principal component at a time. Specifically, our goal would be to find $\alpha$ that which maximizes

\[\text{Var}(d^\top X) = d^\top \Sigma d \quad\text{subject to } d^\top d = 1 \tag{16}\]

Hence the problem is now framed as a constrained optimization problem.

Lagrange Multipliers

We use Lagrangians to solve constrained optimization. The intuition for the Lagrangian method is that the gradient of the constraint and the argument should be parallel to each other at the point of optimization.

\[d^* = \mathop{\rm arg\,max}\limits_{d} d^\top \Sigma d - \lambda (d^\top d - 1) \tag{17}\]

We go about this by taking the gradient of the argument with respect to $d$:

\[\begin{align} \nabla_d (d^\top \Sigma d - \lambda(d^\top d - 1)) &= \Sigma^\top d - 2 \lambda d \\ &= 0 \end{align} \tag{18}\]

Since 2 is just a constant, we can absorb it into $\lambda$ to form a more concise expression. Also, since the covariance matrix is by definition symmetric, we can simplify things further to end up with

\[\Sigma d = \lambda d \tag{19}\]

And once again, we have shown that the principal components are the eigenvectors of the covariance matrix.

Iteration

But the procedure outlined above can be used to find only one principal component, that is the eigenvector with the largest eigenvalue. How do we go about searching for multiple eigenvectors?

This can be done, once again, with Lagrangians, with the added caveat that we will have more trailing terms in the end. Let’s elaborate on this point further. Here, we assume that we have already obtained the first component, $d_1$, and our goal is to find the next component, $d_2$. With induction, we can easily see how this analysis would apply to finding $d_n$.

Simply put, the goal is to maximize $d_2^\top \Sigma d_2$ under the constraint that $d_2$ is orthogonal to $d_1$ while also satisfying the constraint that it is a unit vector. (In reality, the orthogonality constraint is automatically satisfied since the covariance matrix is symmetric, but we demonstrate this nonetheless.) Therefore,

\[d_2^* = \mathop{\rm arg\,max}\limits_{d_2} d_2^\top \Sigma d_2 - \lambda (d_2^\top d_2 - 1) - \phi d_2^\top d_1\tag{20}\]

Using Lagrangians,

\[\begin{align} &\nabla_{d_2}(d_2^\top \Sigma d_2 - \lambda (d_2^\top d_2 - 1) - \phi d_1^\top d_2) \\&= \Sigma^\top d_2 - 2 \lambda d_2 - \phi d_1 \\ &= \Sigma d_2 - \lambda d_2 - \phi d_1 \\ &= 0 \end{align} \tag{21}\]

In the last equality, we make a trivial substitution to simplify and get rid of the constant. We also use the fact that the covariance matrix is symmetric.

If we left multiply (18) by $d_1$,

\[d_1^\top \Sigma d_2 - \lambda d_1^\top d_2 - \phi d_1^\top d_1 = 0 \tag{22}\]

But since $d_1 \perp d_2$, the first two terms go to zero. Also, the last term reduces to $\phi$ since $\lVert d_1 \rVert_2 = 1$. This necessarily means that $\phi = 0$.

If we plug this result back into (18), we end up with the definition of the eigenvector again, but this time for $d_2$.

\[\Sigma d_2 - \lambda _d2 = 0 \tag{23}\]

Essentially, we iterate this process to find a specified number of principal components, which amounts to finding $l$ number of eigenvectors of the sample covariance matrix.

Relevance with Decompositions

A while back, we discussed both eigendecomposition as well as singular value decomposition, both of which are useful ways of decomposing matrices into discrete factors. In this section, we will see how PCA is essentially a way of performing and applying these decomposition techniques under the hood.

Eigendecomposition

Recall that eigendecomposition is a method of decomposing matrices as follows:

\[A = S \Lambda S^{-1} \tag{24}\]

where $\Lambda$ is a diagonal matrix of eigenvalues and $S$ is a matrix of eigenvectors.

PCA is closely related to eigendecomposition, and this should come as no surprise. Essentially, by finding the eigenvalues and eigenvectors of $X^\top X$, we are performing an eigendecomposition on the covariance matrix:

\[X^\top X = W \Lambda W^\top \tag{25}\]

Notice that $W$ is a matrix of principal components. Of course, in this case, $W$ is a square matrix of full rank; to apply dimension compression, we need to slice the first $l$ entries of $W$. At any rate, it is clear that PCA involves eigendecomposition of the covariance matrix.

Singular Value Decomposition

Eigendecomposition can only be applied to matrices of full rank. However, there is a more generalized method for non-square matrices, which is singular value decomposition.

Here is a blueprint of SVD:

\[A = U \Sigma V^\top \tag{26}\]

Where $\Sigma$ is a matrix containing the roots of the eigenvalues, with appropriate dimensional configurations to accommodate the shape of the original matrix.

We cannot perform eigendecomposition on $X$, which has no guarantee that it is square; however, SVD is definitely an option. Assume that $X$ can be decomposed into $U$, $\Sigma$, and $W^\top$. Then the covariance matrix becomes

\[\begin{align} X^\top X &= W \Sigma^\top U^\top U \Sigma W^\top \\ &= W \Sigma \Sigma W^\top \\ &= W \Lambda W^\top \end{align} \tag{27}\]

And we end up in the same place as we did in (25). This is no surprise given that the derivation of SVD involves eigendecomposition.

Conclusion

In this post, we took a deep dive into the mathematics behind principal component analysis. PCA is a very useful technique used in many areas of machine learning. One of the most common applications is to apply PCA to a high-dimensional dataset before applying a clustering algorithm. This makes it easier for the ML model to cluster data, since the data is now aligned in such a way that it shows the most variance.

Upon some more research, I also found an interesting paper that shows that there is a solid mathematical relationship between K-means clustering and PCA. I haven’t read the paper from top to bottom, but instead glossed over a summary of the paper on this thread on stack overflow. It’s certainly a lot of information to take in, and I have no intent of covering this topic in this already rather lengthy post on PCA. So perhaps this discussion will be tabled for a later time, as interesting as it seems.

I hope you enjoyed reading this post. Amidst the chaos of the COVID19 pandemic, let’s try to stay strong and find peace ruminating over some matrices and formulas. Trust me, it works better than you might think.