A Step Up with Variational Autoencoders

22 minute read

In a previous post, we took a look at autoencoders, a type of neural network that receives some data as input, encodes them into a latent representation, and decodes this information to restore the original input. Autoencoders are exciting in and of themselves, but things can get a lot more interesting if we apply a bit of twist. In this post, we will take a look at one of the many flavors of the autoencoder model, known as variational autoencoders, or VAE for short. Specifically, the model that we will build in this tutorial is a convolutional variational Autoencoder, since we will be using convolutional layers for better image processing.

The model architecture introduced in this tutorial was heavily inspired by the one outlined in François Chollet’s Deep Learning with Python, as well as that from a separate article on the Keras blog.


Let’s start by importing the modules necessary for this demonstration.

from tensorflow.keras import datasets
from tensorflow.keras import backend as K
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras import callbacks
from tensorflow.keras.losses import binary_crossentropy
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

The objective of today’s task is to build an autoencoder model that produces MNIST hand-written digits. The hidden dimension, or the latent space of the model, is going to a random vector living in two-dimensional space. Let’s specify this setup, along with some other miscellaneous configurations, before we proceed with constructing the model architecture.

image_shape = (28, 28, 1)
batch_size = 32
latent_dim = 2
kernel_size = 3
filters = 16
epochs = 30

The Model

It’s time to build our model… or not quite now. Before we start stacking layers for the encoder and the decoder, we need to define a sampling function that will perform the meat of the variational inference involved in VAE.

Sampling Function

Let’s start out by taking a look at the sampling function we will use to define one of the layers of the variational Autoencoder network.

def sampling(args):
    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    epsilon = K.random_normal(shape=(batch, dim))
    return z_mean + K.exp(0.5 * z_log_var) * epsilon

Simply put, the sampling() above below takes as arguments z_mean and z_log_var in the form of a bundled list. As you can guess from the name of the variables, these two parameters refer to the mean and log variance of the random vector living in our predefined latent space. Note that we are assuming a diagonal Gaussian here: in other words, the covariance matrix of the multi-dimensional Gaussian is assumed to be diagonal, meaning that each elements of the vector are independent. If any of this sounds foreign to you, I recommend that you read this post on the Gaussian distribution.

Let’s continue our discussion with the sampling function. The goal here is to sample a random vector in the latent space from the distribution specified by the two parameters, mean and log variance. The sampling process can be expressed as follows:

\[z = \mu_z + \epsilon \cdot \sigma_z \tag{1}\]

where $\mu_z$ denotes the mean, corresponding to z_mean, $\epsilon$ denotes a tensor of random numbers sampled from the standard normal distribution, and $\sigma_z$ denotes the standard deviation (we will see how this is related to z_log_var in just a moment). Essentially, the goal here is to use a resampling technique such that we can sample from a standard normal distribution centered around mean 0 and a standard deviation of 1, but consequentially sample from a distribution of $z$ living in the latent space.

If you are wondering how (1) translates to the return statement,

z_mean + K.exp(0.5 * z_log_var) * epsilon

then the following equation might resolve your curiosity. This is the promised elaboration on the relationship between log variance and standard deviation:

\[\begin{align} \text{exp}\left(0.5 \cdot \log \sigma^2 \right) &= \text{exp}(0.5 \cdot 2 \log \sigma) \\ &= \text{exp}(\log \sigma) \\ &= \sigma \end{align} \tag{2}\]

Therefore, multiplying 0.5 is just a simple algebraic manipulation to morph log variance to standard deviation. The reason why we use log variance instead of just variance or standard deviation is to ensure numerical stability in computation.

Now that this part has been cleared, let’s start stacking away layers!

The Encoder Network

Just like the autoencoder, VAEs are composed of two discrete components: the encoder and the decoder. Here, we take a look at the first piece of the puzzle, the encoder network.

There are several things to note about this model. First, I decided to use a for loop to simplify the process of stacking layers. Instead of repeating the same code over multiple lines, I found this approach to be more succinct and concise. Second, we define a custom layer at the end, shown as Lambda, that uses the sampling function we defined earlier. This is the final key that enables us to build an encoder model that receives as input a 28-by-28 image, then output a two-dimensional latent vector representation of that image to pass onto the decoder network.

inputs = Input(shape=image_shape)
x = inputs
for i in range(2):
    filters *= 2
    x = Conv2D(filters=filters,
shape = K.int_shape(x)
x = Flatten()(x)
x = Dense(16, activation='relu')(x)
z_mean = Dense(latent_dim, name='z_mean')(x)
z_log_var = Dense(latent_dim, name='z_log_var')(x)

z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])

encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder')

Below is the summary of what our model looks like. Note that the model outputs a total of three quantities: z_mean, z_log_var, and z. We need the first two parameters to later sample from the latent distribution; z , of course, is needed to train the decoder.

Model: "encoder"
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 28, 28, 1)]  0                                            
conv2d (Conv2D)                 (None, 14, 14, 32)   320         input_1[0][0]                    
conv2d_1 (Conv2D)               (None, 7, 7, 64)     18496       conv2d[0][0]                     
flatten (Flatten)               (None, 3136)         0           conv2d_1[0][0]                   
dense (Dense)                   (None, 16)           50192       flatten[0][0]                    
z_mean (Dense)                  (None, 2)            34          dense[0][0]                      
z_log_var (Dense)               (None, 2)            34          dense[0][0]                      
z (Lambda)                      (None, 2)            0           z_mean[0][0]                     
Total params: 69,076
Trainable params: 69,076
Non-trainable params: 0

The Decoder Network

The decoder network looks similar to the the encoder, except that much of the architecture is in reverse order. Most notably, we use Conv2DTranpose to undo the convolution done by the encoder. This allows us to effectively scale up the input back to its original dimension, which is what we want to do with a generative model like a VAE.

One subtly worth mentioning is the fact that we use a sigmoid activation in the end. This is because we want the pixel values of the output to be between 0 and 1, just as the original input was normalized before it was fed into the encoder network via division by 255.

latent_inputs = Input(shape=(latent_dim,))
x = Dense(np.prod(shape[1:]), activation='relu')(latent_inputs)
x = Reshape(shape[1:])(x)
for i in range(2):
    x = Conv2DTranspose(filters=filters,
    filters //= 2

outputs = Conv2DTranspose(filters=1,

decoder = Model(latent_inputs, outputs)

The summary of the decoder network is presented below:

Model: "decoder"
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 2)]               0         
dense_1 (Dense)              (None, 3136)              9408      
reshape (Reshape)            (None, 7, 7, 64)          0         
conv2d_transpose (Conv2DTran (None, 14, 14, 64)        36928     
conv2d_transpose_1 (Conv2DTr (None, 28, 28, 32)        18464     
conv2d_transpose_2 (Conv2DTr (None, 28, 28, 1)         289       
Total params: 65,089
Trainable params: 65,089
Non-trainable params: 0

The Variational Autoencoder

Now that we have both the encoder and the decode network fully defined, it’s time to wrap them together into one autoencoder model. This can simply achieved by defining the input as the input of the encoder—the normalized MNIST images—and defining the output as the output of the decoder when fed a latent vector. Concretely, this process might look as follows:

outputs = decoder(encoder(inputs)[2])
vae = Model(inputs, outputs)

Let’s look a the summary of the CVAE. Note that the encoder and the decoder look like individual layers in the grand scheme of the VAE architecture.

Model: "model_1"
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 28, 28, 1)]       0         
encoder (Model)              [(None, 2), (None, 2), (N 69076     
decoder (Model)              (None, 28, 28, 1)         65089     
Total params: 134,165
Trainable params: 134,165
Non-trainable params: 0

The Loss Function

We have almost everything we need, but there is one crucial step that is missing: compiling the model with an optimizer and a loss function. Normally, defining a loss function is very easy: in most cases, we use pre-made loss functions that are available through the TensorFlow API, such as cross entropy or mean squared error. In the case of variational autoencoders, however, this is not such an easy task: how do we judge the robustness or the effectiveness of the decoder, which is essentially a generative algorithm? Of course, we could stop training once the figures it generates becomes reasonable, i.e. the mock MNIST digits it creates looks compelling to the human eye. However, this is a subjective metric at best, and we can’t expect there to be a ML engineer peering at the screen, looking at the outputs of the decoder per each epoch.

To tackle this challenge, we need to dive into some math. Let’s take a look.

Back to Bayes

First, let’s carefully review what our goal is for this task. The motivating idea behind variational autoencoders is that we want to model a specific distribution, namely the distribution of the latent space given some input. As you recall, this latent space is a two dimensional vector modeled as a multivariate diagonal Gaussian. Using Bayes’ theorem, we can express this distribution as follows:

\[\begin{align} p(z \vert x) &= \frac{p(x, z)}{p(x)} \\ &= \frac{p(x \vert z) p(z)}{p(x)} \\ &= \frac{p(x \vert z) p(z)}{\int p(x \vert z) \, dz} \end{align} \tag{3}\]

By now, it is pretty clear what the problem its: the evidence sitting in the denominator is intractable. Therefore, we cannot directly calculate or derive $p(z \vert x)$ in its closed form; hence the need for variational inference.

Evidence Lower Bound

The best we can do is to find a distribution $q(z \vert x)$ that best approximates $p(z \vert x)$. How do we find this distribution? Well, we know one handy concept that measures the difference or the pseudo-distance between two distributions, and that is Kullback-Leibler divergence. As we discussed in this post on entropy, KL divergence tells us how different two distributions are. So the goal here would be find a distribution that minimizes the following expression:

\[D_{KL}[q(z \vert x) \parallel p(z \vert x)] = \mathbb{E}_{q}[\log q(z \vert x) - \log p(z \vert x)] \tag{4}\]

Using the definition of conditional probability, we can simplify (4) as follows:

\[\begin{align} D_{KL}[q(z \vert x) \parallel p(z \vert x)] &= \mathbb{E}_{q}[\log q(z \vert x) - \log p(x, z) + \log p(x)] \\ &= \mathbb{E}_{q}[\log q(z \vert x) - \log p(x, z)] + \log p(x) \end{align} \tag{5}\]

The trick is to notice that $\log p(x)$ is a constant that can break out of the expectation calculation. Let’s continue by deriving an expression for the evidence term.

\[\begin{align} \log p(x) &= D_{KL}[q(z \vert x) \parallel p(z \vert x)] - \mathbb{E}_{q}[\log q(z \vert x) - \log p(x, z)] \\ &= D_{KL}[q(z \vert x) \parallel p(z \vert x)] + \mathbb{E}_{q}[\log p(x, z) - \log q(z \vert x)] \end{align} \tag{6}\]

A useful property to know about KL divergence is the fact that it is always non-negative. We will get into why this is the case in a moment. For now, let’s assume non-negativity to be true and transform (6) into an inequality:

\[\log p(x) \geq \mathbb{E}_{q}[\log p(x, z) - \log q(z \vert x)] \tag{7}\]

The term on the right of the inequality is known as the Evidence Lower Bound, or ELBO for short. Why are we interested in ELBO? First, note that $\log p(x)$, the evidence, is a constant. Therefore, minimizing KL divergence amounts to maximizing ELBO. This is the key to variational inference: instead of calculating the intractable integral in (3), we can find a distribution $q$ that which minimizes KL divergence by maximizing ELBO, which is a tractable operation.

Non-negativity of Kullback-Leibler Divergence

Let’s prove why KL divergence is always greater or equal to zero, which is a condition we assumed to be true in the derivation of ELBO above. For the sake of completeness, I present two ways of proving the same property.

Jensen’s Inequality

In the context of probability, Jensen’s inequality can be summarized as follows. Given a convex function $f(x)$,

\[\mathbb{E}[f(x)] \geq f(\mathbb{E}[x]) \tag{8}\]

We won’t get into rigorous proofs here, but it’s not difficult to see why this inequality stands with some basic geometric intuition. Due to its bow-like shape, the expected value of a convex function evaluated across a given interval will always be greater or equal to the function evaluated at the expected value of the random variable.


How is Jensen’s inequality related to the non-negativity of KL divergence? Let’s return back to the definition of KL divergence. For simplicity and to reduce notational burden, we briefly depart from conditional probabilities $.(z \vert x)$ and return back to generic distributions $p$ and $q$.

\[\begin{align} D_{KL}[q(x) \parallel p(x)] &= \mathbb{E}_{q}[\log q(x) - \log p(x)] \\ &= \int_{- \infty}^\infty q(x) \log \frac{q(x)}{p(x)} \, dx \end{align} \tag{9}\]

Notice that the definition of KL divergence itself is an expected value expression. Also, note that $- \log(x)$ is a convex function—$\log(x)$ itself is concave, but the negative sign flips the concavity the other way. With these observations in mind, we can apply Jensen’s inequality to derive the following:

\[\begin{align} D_{KL}[q(x) \parallel p(x)] &= \int_{- \infty}^\infty q(x) \log \frac{q(x)}{p(x)} \, dx \\ &= \int_{- \infty}^\infty q(x) \left( - \log \frac{p(x)}{q(x)} \right) \, dx \\ & \geq - \log \int_{- \infty}^\infty q(x) \frac{p(x)}{q(x)} \, dx \\ &= - \log \int_{- \infty}^\infty p(x) \, dx \\ &= 0 \end{align} \tag{10}\]

Therefore, we have shown that KL divergence is always greater or equal to zero, which was our end goal.

Simpler Proof with Logarithms

There is another version of a proof that I found a lot more intuitive and easier to follow than the previous approach. This derivation was borrowed from this post.

We start from the simple observation that a logarithmic function is always smaller than a linear one. In other words,

\[\log x \leq x - 1 \tag{11}\]

This is no rocket science, and one can easily verify (11) by simply plotting the two functions on a Cartesian plane.

Using (11), we can proceed in a different direction from the definition of KL divergence.

\[\begin{align} D_{KL}[q(x) \parallel p(x)] &= \int_{- \infty}^\infty q(x) \log \frac{q(x)}{p(x)} \, dx \\ &= - \int_{- \infty}^\infty q(x) \log \frac{p(x)}{q(x)} \, dx \\ &\geq - \int_{- \infty}^\infty q(x) \left(\frac{p(x)}{q(x)} - 1 \right) \, dx \\ &= \int_{- \infty}^\infty p(x) - q(x) \, dx \\ &= \int_{- \infty}^\infty p(x) \, dx - \int_{- \infty}^\infty q(x) \, dx \\ &= 0 \end{align} \tag{12}\]

Once again, we have shown that KL divergence is positive!

Proving this isn’t really necessary in the grand scheme of exploring the mathematics behind VAEs, yet I thought it would help to have this adjunctive section to better understand KL divergence and familiarize ourselves with some standard algebraic manipulations that are frequently invoked in many derivations.

Let’s jump back into variational inference and defining the cost function with ELBO.

Loss with Gaussian Distributions

Recall from the setup of our Variational Autoencoder model that we have defined the latent vector as living in two-dimensional space following a multivariate Gaussian distribution. It’s time to apply the ELBO equation to this specific context and derive a closed-form expression of our loss function.

Let’s recall the formula for ELBO:

\[\log p(x) \geq \mathbb{E}_{q}[\log p(x, z) - \log q(z \vert x)] = \text{ELBO} \tag{7}\]

After some rearranging, we can decompose ELBO into two terms, one of which is a KL divergence:

\[\begin{align} \text{ELBO} &= \mathbb{E}_{q}[\log p(x, z) - \log q(z \vert x)] \\ &= \mathbb{E}_{q}[\log p(x \vert z) + \log p(z) - \log q(z \vert x)] \\ &= \mathbb{E}_{q}[\log p(z) - \log q(z \vert x)] - \mathbb{E}_q[\log p(x \vert z)] \\ &= - D_{KL}[q(z \vert x) \parallel p(z)] - \mathbb{E}_q[\log p(x \vert z)] \end{align} \tag{13}\]

Now, it’s finally time for us to dive deep into math: let’s unpack the closed form expression in (13). Note that the ELBO expression applies to just about any distribution, but since we chose a multivariate Gaussian to be the base distribution, we will see how it unfolds specifically in this context.

Let’s begin by assuming the distribution of our models to be Gaussian. Namely,

\[p(z) = \frac{1}{\sqrt{2 \pi \sigma_p^2}} \text{exp}\left(- \frac{(z - \mu_p)^2}{2 \sigma_p^2}\right) \tag{14}\]

Because $q$ is an approximation of $p$, we naturally assume the same model for the approximate distribution:

\[q(z \vert x) = \frac{1}{\sqrt{2 \pi \sigma_q^2}} \text{exp}\left(- \frac{(z - \mu_q)^2}{2 \sigma_q^2}\right) \tag{15}\]

Now we can derive an expression for the negative KL divergence sitting in the ELBO expression:

\[\begin{align} - D_{KL}[q(z \vert x) \parallel p(z)] &= \int_{- \infty}^\infty q(z \vert x) \log \frac{p(x)}{q(z \vert x)} \, dz \\ &= \mathbb{E}_q \left[\log \left( \frac{\frac{1}{\sqrt{2 \pi \sigma_p^2}} \text{exp}\left(- \frac{(z - \mu_p)^2}{2 \sigma_p^2}\right)}{\frac{1}{\sqrt{2 \pi \sigma_q^2}} \text{exp}\left(- \frac{(z - \mu_q)^2}{2 \sigma_q^2}\right)} \right) \right] \\ &= \mathbb{E}_q \left[ \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{(z - \mu_p)^2}{2 \sigma_p^2} + \frac{(z - \mu_q)^2}{2 \sigma_q^2} \right] \end{align} \tag{16}\]

This may seem like a lot, but it’s really just plugging in the distributions into the definition of KL divergence as an expectation and using some convenient properties of logarithms to perform simple algebraic simplifications. To proceed further, observe that the first term is a constant that can escape out of the expectation:

\[\begin{align} - D_{KL}[q(z \vert x) \parallel p(z)] &= \log \left( \frac{\sigma_q}{\sigma_p} \right) + \mathbb{E}_q \left[ - \frac{(z - \mu_p)^2}{2 \sigma_p^2} + \frac{(z - \mu_q)^2}{2 \sigma_q^2} \right] \\ &= \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] + \frac{1}{2 \sigma_q^2} \mathbb{E}_q[(z - \mu_q)^2] \end{align} \tag{17}\]

From the definition of variance and expectation, we know that

\[\mathbb{E}_q[(z - \mu_q)^2] = \sigma_q^2 \tag{18}\]

Therefore, we can simplify (17) as follows:

\[\begin{align} - D_{KL}[q(z \vert x) \parallel p(z)] &= \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] + \frac{1}{2 \sigma_q^2} \mathbb{E}_q[(z - \mu_q)^2] \\ &= \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] + \frac12 \end{align} \tag{19}\]

Let’s zoom in on the expected value term in (19). Our goal is to use (18) again so that we can flesh out another one half from that term. This can be achieved through some clever algebraic manipulation:

\[\begin{align} \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] &= \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_q + \mu_q - \mu_p)^2] \\ &= \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_q)^2 + (\mu_q - \mu_p)^2 + 2(z - \mu_q)(\mu_q - \mu_p)] \\ &= \frac{1}{2 \sigma_p^2} \left\{ \mathbb{E}_q[(z - \mu_q)^2] + \mathbb{E}_q[(\mu_q - \mu_p)^2] + 2\mathbb{E}_q[(z - \mu_q)(\mu_q - \mu_p)] \right\} \end{align} \tag{20}\]

But since the the expected value of $(\mu_q - \mu_p)^2$ is constant and that of $(z - \mu_q)$ is zero,

\[\frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] = \frac{\sigma_q^2 + (\mu_q - \mu_p)^2}{2 \sigma_p^2} \tag{21}\]

We can now plug this simplified expression back into the calculation of KL divergence, in (19):

\[\begin{align} - D_{KL}[q(z \vert x) \parallel p(z)] &= \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{1}{2 \sigma_p^2} \mathbb{E}_q[(z - \mu_p)^2] + \frac12 \\ &= \log \left( \frac{\sigma_q}{\sigma_p} \right) - \frac{\sigma_q^2 + (\mu_q - \mu_p)^2}{2 \sigma_p^2} + \frac12 \end{align} \tag{22}\]

Since we will standardize our input such that $\mu_p = 0$ and $\sigma_p = 1$, we can plug these quantities into (22) and show that

\[\begin{align} - D_{KL}[q(z \vert x) \parallel p(z)] &= \log(\sigma_q) - \frac{\sigma_q^2 + \mu_q^2}{2} + \frac12 \\ &= \frac12 \left( 1 + \log \sigma_q^2 - \sigma_q^2 - \sigma_p^2 \right) \end{align} \tag{23}\]

We are almost done with deriving the expression for ELBO. I say almost, because we still have not dealt with the trailing term in (13):

\[\begin{align} \text{ELBO} &= - D_{KL}[q(z \vert x) \parallel p(z)] - \mathbb{E}_q[\log p(x \vert z)] \\ &= \frac12 \left( 1 + \log \sigma_q^2 - \sigma_q^2 - \sigma_p^2 \right) - \mathbb{E}_q[\log p(x \vert z)] \end{align} \tag{13}\]

At this point, it is extremely useful to recall the definition of cross entropy, which is generically defined as follows:

\[\begin{align} H(p, q) &= - \int_\mathbb{\chi} p(x) \log q(x) \, dx \\ &= - \mathbb{E}_p[\log q(x)] \end{align} \tag{24}\]

Therefore, we see that the trailing term in (13) is just a cross entropy between two distributions!

This was a circumlocutions journey, but that is enough math we will need for this tutorial. It’s time to get back to coding.

Model Compilation

All that math was for this simple code snippet shown below:

xent_loss = 28 * 28 * binary_crossentropy(K.flatten(inputs), 
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
vae_loss = K.mean(xent_loss + kl_loss)

As you can see, this short code snippet shows, in essence, how we can define a compile a model with a custom loss function. In this case, xent_loss refers to the reconstruction loss, which is the cross entropy term we saw earlier. kl_loss, as you might expect, simply refers to KL divergence. Notice how there is a 0.5 multiplying factor in the kl_loss expression, just like we did when we derived it in the section above. With some keen observations and comparisons, you will easily see that the code is merely a transcription of (13), with some minor differences given dimensionality.

One important fact to note is that the gradient descent algorithm, by default, seeks to minimize the loss function. However, we discussed above how the objective of VAE is to maximize ELBO. Therefore, we modify ELBO into a loss function that is to be minimized by defining the loss function as the negative of ELBO. In other words, the cost function $J$ is defined as $- ELBO$; hence the difference in sign.

Testing the Model

It’s finally time to test the model. Let’s first begin with data preparation and preprocessing.

def load_data():
  (X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()
  X_train = np.reshape(X_train, (len(X_train), 28, 28, 1))
  X_test = np.reshape(X_test, (len(X_test), 28, 28, 1))
  X_train, X_test = X_train.astype('float64') / 255., X_test.astype('float64') / 255.
  return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = load_data()

Now, we should have the training and test set ready to be fed into our network. Next, let’s define a simple callback application using the EarlyStopping monitor so that training can be stopped when no substantial improvements are being made to our model. This was included because training a VAE can take some time, and we don’t want to waste computing resources seeing only submarginal increments to the model performance.

early_stopping_monitor = callbacks.EarlyStopping(patience=2)

        validation_data=(X_test, None),

Training begins!

Train on 60000 samples, validate on 10000 samples
Epoch 1/30
60000/60000 [==============================] - 13s 210us/sample - loss: 191.6765 - val_loss: 170.1529
Epoch 2/30
60000/60000 [==============================] - 11s 180us/sample - loss: 163.9683 - val_loss: 160.2263
Epoch 3/30
60000/60000 [==============================] - 11s 181us/sample - loss: 159.0007 - val_loss: 158.0777
Epoch 4/30
60000/60000 [==============================] - 11s 180us/sample - loss: 156.8238 - val_loss: 156.3414
Epoch 5/30
60000/60000 [==============================] - 11s 181us/sample - loss: 155.4041 - val_loss: 154.7498
Epoch 6/30
60000/60000 [==============================] - 11s 181us/sample - loss: 154.2847 - val_loss: 153.9668
Epoch 7/30
60000/60000 [==============================] - 11s 180us/sample - loss: 153.4675 - val_loss: 153.8024
Epoch 8/30
60000/60000 [==============================] - 11s 179us/sample - loss: 152.7539 - val_loss: 152.6393
Epoch 9/30
60000/60000 [==============================] - 11s 181us/sample - loss: 152.2562 - val_loss: 152.6557
Epoch 10/30
60000/60000 [==============================] - 11s 180us/sample - loss: 151.7278 - val_loss: 151.7882
Epoch 11/30
60000/60000 [==============================] - 11s 179us/sample - loss: 151.3973 - val_loss: 151.6642
Epoch 12/30
60000/60000 [==============================] - 11s 177us/sample - loss: 150.9899 - val_loss: 151.3316
Epoch 13/30
60000/60000 [==============================] - 11s 177us/sample - loss: 150.6191 - val_loss: 152.0779
Epoch 14/30
60000/60000 [==============================] - 11s 179us/sample - loss: 150.3378 - val_loss: 151.6977

After 14 epochs, training has stopped, meaning that no meaningful improvements were being made.

Let’s visualize the representation of the latent space learned by the VAE. Visualizing this representation is easy in this case because we defined the latent space to be two-dimensional; in other words, all points can be plotted on a Cartesian plane. Let’s take a look:

X_test_encoded, _, _ = encoder.predict(X_test, batch_size=batch_size)
plt.figure(figsize=(10, 10))
plt.scatter(X_test_encoded[:, 0], X_test_encoded[:, 1], c=y_test)

This plot shows us how each numbers are distributed across the latent space. Notice that numbers that belong to the same class seem to be generally clustered around each other, although there is a messy region in the middle. This is a reasonable result: while we would expect ones to be fairly easy to distinguish from, say, eights, numbers like zeros and sixes might look very similar, and hence appear mixed as a lump in the fuzzy region in the middle.

One cool thing about VAEs is that we can use their learned representation to see how numbers slowly morph and vary across a specified domain. This is why VAEs are considered to be generative models: if we feed the VAE some two-dimensional vector living in the latent space, it will spit out a digit. Whether or not that digit appears convincing depends on the random vector the decoder was provided as input: if the vector is close to the learned mean, $\mu_q$, then the result will be convincing; if not, we might see a confusing blob of black and white.

Let’s see what exactly is going on in the fuzzy region of the image, because that is apparently where all the digits mingle together and seem indistinguishable from one another. Put differently, if we vary the random vector little by little across that region, we will be able to see how the digit slowly morphs into another number.

n = 15 
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))
grid_x = np.linspace(-2, 2, n)
grid_y = np.linspace(-1, 3, n)

for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_sample = np.array([[xi, yi]])
        x_decoded = decoder.predict(z_sample)
        digit = x_decoded[0].reshape(digit_size, digit_size)
        figure[i * digit_size: (i + 1) * digit_size,
               j * digit_size: (j + 1) * digit_size] = digit

plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='Greys_r')
plt.xticks([]); plt.yticks([])

How cool is that? We were able to get a VAE to show us how one digit can shift across a certain domain of the latent space. This is one of the many cool things we can do with a generative model like a variational autoencoder.


In this post, we took a deep dive into the math behind variational autoencoders. It was a long journey, but definitely worth it because it exposed us to many core concepts in deep learning and statistics. At the same time, I found it fascinating to see how a model could learn from a representation to generate numbers, as we saw in the very last figure.

In a future post, we will look at generative adversarial networks, or GANs, which might be considered as the pinnacle of generative models and a successor to autoencoders. GANs resemble autoencoders in that it is also composed of two models. One core difference, however, is that in GANs, the two models are in a competing relationship, whereas in autoencoders, the encoder and the decoder play distinct, complementary roles. If any of this sounds exciting, make sure to check out the next post.

I hope you enjoyed reading. Catch you up in the next one!