Flow Models

8 minute read

In this post, we will take a look at Flow models, which I’ve been obsessed with while reading papers like Glow-TTS and VITS. This post is heavily based on this lecture video by Pieter Abbeel, as well as the accompanied problem sets for the course, available here.

Motivation

We want a model that satisfies the following:

  • Simplifies complex, intractable distributions
  • Enables easy sampling and generation

The two conditions are somewhat related in the sense that once you have a function (or a neural network that approximates such a function) that maps complex distributions to a tractable latent space, sampling can be performed immediately given that the mapping function is invertible. Invertibility is not something that can be easily assumed in deep learning and thus calls for some specific architectural decisions. Nonetheless, I find this formulation highly compelling and intuitive.

Change of Variables

To fully understand the mechanics of flow, we need to first revisit the change of variables formula. Let $X$ denote a random variable, and $f_\theta$, some monotonic, invertible function that maps $X$ to a latent space $Z$. In the simplest case, $f_\theta$ might be the CDF of $X$, and $Z$ might be a uniform distribution $U(0, 1)$. More generally, we have

\[z = f_\theta(x)\]

Note that there exists a one-to-one correspondence between the two random variables, which is important to guarantee invertability.

Let $p(\cdot)$ denote the PDF of some random variable. Naively, one might think that

\[p(x) \, dx = p(z) \, dz\]

However, this fails to take into account the fact that a small change in $x$ may or may not be equally spread out in $z$ space. Hence, we need a correcting factor, which is the derivative of $z$ w.r.t. $x$.

\[p(x) = p(z) \left\lvert \frac{\partial f_\theta(x)}{\partial x} \right\rvert \tag{1}\]

More formally, we can see this by considering the derivative of the CDF, which we will denote as $P(\cdot)$.

\[\begin{align} P(Z \leq z) &= P(f_\theta(X) \leq z) \\ &= P(X \leq f_\theta^{-1}(z)) \end{align} \tag{2}\]

(2) holds if $f$ is a monotonically increasing function. If it is a monotonically decreasing function, then

\[P(Z \leq z) = 1 - P(X \leq f_\theta^{-1}(z))\]

Deriving both sides of the equation by $z$, we get

\[\begin{align} p(z) &= \pm \, p(f_\theta^{-1}(z)) \frac{\partial f_\theta^{-1}(z)}{\partial z} \\ &= p(x) \left\lvert \frac{\partial x}{\partial z} \right\rvert \\ \end{align} \tag{3}\]

Rearranging (3) yields (1).

In a multi-dimensional context, the absolute value of the partial derivative term is effectively the determinant of the jacobian matrix.

\[p(x) = p(z) \frac{\text{vol}(dz)}{\text{vol}(dx)} = p(z) \left\lvert \text{det} \frac{dz}{dx} \right\rvert\]

We can understand the determinant of a matrix as calculating the magnitude of volume change that it would produce as a linear transformation of coordinates. We can see this as a multivariate analogue of slope or the gradient.

Training

Flow is nothing more than a neural network that models $f_\theta$. It takes a random variable living in some complex intractable space and sends it to a tractable dimension. In the case of normalizing flows, the target latent distribution is a normal distribution.

As is the case with any likelihood model, the goal is to fit a model that maximizes the log likelihood of data. Therefore, the objective is

\[\max \sum_i \log p(x_i) \tag{4}\]

We can substitute the likelihood with an expression using the latent transformed variable in (1). Then, (4) is equivalent to

\[\max \sum_i \log p(f_\theta(x_i)) + \log \, \left\lvert \text{det} \frac{d f_\theta(x_i)}{d x} \right\rvert\]

We train the flow model to minimize negative log likelihood, or equivalently, maximize log likelihood.

A few remarks:

  • Notice that there is a jacobian sitting in the log likelihood term. This means that the flow model should model a function whose jacobian is easy to compute, which is usually not the case.
  • In a normalizing flow, $f_\theta$ will essentially try to assign as many points near the center of the Gaussian distribution in the vicinity of the mean.

Perks of Flow

Up to this point, you might think that the flow model is a very intricate machinery that comes with many constraints, e.g. invertability, easy jacobian calculation, and etc. Nonetheless, I think it has some clear advantages in two aspects.

Sampling

To sample from a flow model, all we have to do is sample from the posterior distribution, such as a normal or Gaussian, then simply send it down an inverse flow.

Combinations

One salient characteristic of a flow is that a combination of flows is also a flow. If you have a set of invertible, differentiable functions, a stack of such functions will also be differentiable and invertible.

\[z = f_k \circ f_{k - 1} \circ \cdots \circ f_1(x) \\ x = f_1^{-1} \circ f_2^{-1} \circ \cdots \circ f_k^{-1} (z)\]

A capacity of a single flow layer is most likely limited, but a deep stack gives it enough expressional power to handle highly complex prior distributions.

Model Architecture

Flow models must be invertible, which leads to some important considerations when motivating their architecture. For instance, we cannot use ReLU activations since they violate the invertability requirement. Moreover, the jacobian should be easy to compute.

Inversion

The beautiful part of flow is that there is a simple way to resolve both conundrums: affine coupling layers. Let $d$ denote the cardinality of the embedding space on which we are applying a flow model. Then, the affine coupling layer can schematically be written as

\[z_{1:d/2} = x_{1:d/2} \\ \begin{align} z_{d/2:d} &= x_{d/2:d} \odot s_\theta(x_{1:d/2}) + t_\theta(x_{1:d/2}) \\ &= x_{d/2:d} \odot s_\theta(z_{1:d/2}) + t_\theta(z_{1:d/2}) \end{align} \tag{5}\]

In plain language, we can consider $f_\theta$ as a special transformation in which the top half of $z$ is just copied from $x$ without modification. The bottom half undergoes an affine transformation, where the weights and biases are computed from the top half of $x$. We can easily check that this transformation is indeed invertible:

\[x_{1:d/2} = z_{1:d/2} \\ x_{d/2:d} = s_\theta^{-1}(z_{1:d/2})(z_{d/2:d} - t_\theta(z_{1:d/2})) \tag{6}\]

Affine coupling layers are invertible only because the top half of $z$ is equal to that of $x$. This demystifies the copying operation in (5), which may have appeared somewhat unintuitive and awkward initially.

In practice, it appears that flow layers take a slightly more complicated form than the conceptual architecture detailed above. For example, Real NVP proposes the following schema.

\[z_{1:d/2} = x_{1:d/2} \\ h = a \times \text{tanh}(s_\theta(x_{1:d/2})) + b \\ z_{d/2:d} = \text{exp}(h) \times x_{d/2:d} + g_\theta(x_{1:d/2})\]

where $a$ and $b$ are learned parameters, and $s_\theta$ and $g_\theta$ are some affine transformations, such as a multi-layer perceptron.

Jacobian

Earlier, we noted that the determinant of the jacobian matrix must be easy to compute. This is a non-trivial constraint that does not hold true in many cases.

Fortunately, it turns out that the jacobian is very easy to compute given an affine coupling layer. We can somewhat intuit this by considering the copy-and-paste operation that is applied to the top half of the input. Given this operation, we can see that the the upper left quadrant of the jacobian will simply be an identity matrix.

\[\begin{align} \frac{\partial z}{\partial x} &= \begin{pmatrix} \frac{\partial z_{1:d/2}}{\partial x_{1:d/2}} & \frac{\partial z_{1:2/d}}{\partial x_{2/d:d}} \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \frac{\partial z_{d/2:d}}{\partial x_{d/2:d}} \end{pmatrix} \\ &= \begin{pmatrix} I & 0 \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \text{diag}(s_\theta(x_{1:d/2})) \end{pmatrix} \end{align}\]

Although there are still complicated terms in the third quadrant of the jacobian, we do not have to consider them to compute the determinant of the jacobian: the determinant of a lower triangular matrix is simply the product of its diagonal entries. Hence, the determinant of the jacobian simply collapses to the product of the entries in the fourth quadrant. Hence, we see how the affine transform layer satisfies both the invertability and the jacobian determinant requirements.

Implementation

This is my attempt at a simple implementation of an affine transform layer. Although I could have combined the forward() and inverse() functions to remove duplicate lines of code, for clarity’s sake, I left them separate.

import torch
from torch import nn

class AffineCouplingLayer(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        half_size, remainder = divmod(hidden_size, 2)
        assert remainder == 0, print(
            f"Expected `hidden_size` to be even, but received {hidden_size}"
        )
        self.fc = nn.Linear(half_size, hidden_size)
    
    def forward(self, x, inverse=False):
        if inverse:
            return self.inverse(x)
        x1, x2 = x.chunk(2, dim=1)
        z1 = x1
        s, t = self.fc(x1).chunk(2, dim=1)
        z2 = x2 * s + t
        z = torch.cat((z1, z2), dim=1)
        det = s.prod(dim=-1).abs()
        return z, det
    
    def inverse(self, z):
        z1, z2 = z.chunk(2, dim=1)
        x1 = z1
        s, t = self.fc(z1).chunk(2, dim=1)
        x2 = (z2 - t) / s
        x = torch.cat((x1, x2), dim=1)
        return x

This implementation is a close transcription of (5). z1 denotes $z_{1:d/2}$; z2, $z_{d/2:d}$, and ditto the xs. The fully-connected layer self.fc acts as an affine transform. We condition the output z2 on the result of the affine transform applied on x1. The inverse() is a transcription of (6).

We can perform a quick sanity check on this implementation by performing a forward pass, as well as an inverse path, and verifying that inverting the output of the forward pass recovers the original input.

batch_size = 8
hidden_size = 10
half_size = hidden_size // 2
x = torch.randn(batch_size, hidden_size)
l = AffineCouplingLayer(hidden_size)
z, det = l(x)
z.shape
torch.Size([8, 10])

We also get the determinant, which are scalar values. We get 8 values, which equals the batch size in the example input.

det.shape
torch.Size([8])

We can check that the affine coupling layer only transforms the top half of the input.

torch.equal(x[:,:half_size], z[:,:half_size])
True

Trivially, we can also verify that the rest of the output has been modified by the layer.

torch.equal(x[:,half_size:], z[:,half_size:])
False

Most importantly, we can see that the layer is indeed invertable; that is, it recovers the original input given the output of the layer z.

torch.allclose(x, l(z, inverse=True))
True

We use torch.allclose() instead of torch.equal() due to floating point errors that can cause subtle changes in values. This is merely a technicality and does not affect the conclusion that affine coupling layers are fully invertable.

Conclusion

In this post, we discussed flow models. I personally find flow-based models extremely interesting, simply because deep neural networks are normally not something that we can invert like a simple mathematical function. After all, the precise reason why we use deep neural networks is that we want to model complex non-linear functions. Flow models seem to go against this intuition in some sense, while providing us with the tools to handle highly complex data distributions to tractable posteriors.

I hope you enjoyed reading this post. Catch you up in the next one!