Jekyll2022-05-19T12:13:40+00:00https://jaketae.github.io/feed.xmlJake TaeHey there! My name is Jake, and I'm a sophomore at Yale University.Jake TaeGlow-TTS2022-04-11T00:00:00+00:002022-04-11T00:00:00+00:00https://jaketae.github.io/study/glowtts<p><em>Note: This blog post was completed as part of Yale’s CPSC 482: Current Topics in Applied Machine Learning.</em></p>
<p>“Turn right at 130 Prospect Street.”</p>
<p>If you’ve used Google maps before, you will recall the familiar, smooth voice of the navigation assistant. At first glance, the voice appears to be a simple replay of human recordings. However, you will quickly realize that it is impossible to record the names of millions of streets, not to mention the billions of driving context in which they can appear.</p>
<p>Modern software, such as Google maps or voice assistant, are powered by neural text-to-speech (TTS), a powerful technology that synthesize human-sounding voices using machine learning. In this blog post, we will dive deep into a NeurIPS 2020 paper <a href="https://arxiv.org/abs/2005.11129">Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search</a>, which demonstrates one of the many ways in which deep neural networks can be used for natural TTS.</p>
<h2 id="neural-text-to-speech">Neural Text-to-Speech</h2>
<p>Moden neural TTS pipelines are typically composed of two components: an accoustic feature generator and a vocoder. The acoustic feature generator accepts text as input and outputs an acoustic representation, such as a mel-spectrogram. The second stage of the pipeline, neural vocoders accept acoustic representations as input and outputs raw waveform. More generally, let $f$ and $g$ denote an acoustic feature generator and vocoder. Given an input text $T$, neural TTS can be understood as a composite function that outputs a waveform $W$ via</p>
\[\begin{aligned}
&X = f(T) \\
&W = g(X),
\end{aligned}\]
<p>where $X$ denotes the intermediate acoustic representation. Schematically, $g \cdot f$ fully defines the two-stage TTS process.</p>
<p>In this blog post, we will explore the first stage of the pipeline, the acoustic feature generator, exmplified by Glow-TTS. This post will proceed as follows. Firstly, we discuss generative flow models, which is the first core component of Glow-TTS. Secondly, we discuss the monotonic alignment search algorithm. Thirdly, we discuss the Glow-TTS pipeline as a whole by putting flow and MAS into a single picture. Last but not least, we conclude by considering some of the limitations of Glow-TTS and refer to more recent literature that points to exciting directions in the field of neural TTS.</p>
<h2 id="flow">Flow</h2>
<p>Text-to-speech is a conditional generative task, in which a model is given a sequence of tokens and produces a stream of utterance that matches the input text. Many neural TTS models employ generative models at their core, such as GANs, VAEs, transformers, or diffision models, often borrowing from breakthroughs in other domains such as computer vision.</p>
<h3 id="change-of-variables">Change of Variables</h3>
<p>Glow-TTS is based on normalizing flow, which is a class of well-studied generative models. The theoretical basis of normalizing flows is the change of variables formula. Let $\mathbf{X}$ and $\mathbf{Y}$ denote random variables, each with PDF $f_\mathbf{X}$ and $f_\mathbf{Y}$, respectively. Let $h$ denote some invertible transformation such that $\mathbf{Y} = h(\mathbf{X})$. Typically, $f_\mathbf{X}$ is a simple, tractable prior distribution, such as a standard Gaussian, and we seek to apply $h$ to model some more complicated distribution given by $\mathbf{Y}$. Then, the change of variables formula states that</p>
\[\begin{aligned}
f_\mathbf{Y}(\mathbf{y})
&= f_\mathbf{X}(\mathbf{x}) \bigg| \text{det} \frac{d \mathbf{x}}{d \mathbf{y}} \bigg| \\
&= f_\mathbf{X}(h^{-1}(\mathbf{y})) \bigg| \det \frac{d \mathbf{x}}{d \mathbf{y}} \bigg| \\
&= f_\mathbf{X}(h^{-1}(\mathbf{y})) \bigg| \det \frac{d h^{-1}(\mathbf{y})}{d \mathbf{y}} \bigg|,
\end{aligned}\]
<p>where $\det$ denotes the determinant and the derivative term represents the Jacobian.</p>
<p>A variation of this formula that allows for sampling from the base distribution can be written as follows:</p>
\[\begin{aligned}
f_\mathbf{Y}(\mathbf{y})
&= f_\mathbf{X}(\mathbf{x}) \bigg| \det \frac{d h^{-1} \mathbf{y}}{d \mathbf{y}} \bigg| \\
&= f_\mathbf{X}(\mathbf{x}) \bigg| \det \left( \frac{d h(\mathbf{x})}{d \mathbf{x}} \right)^{-1} \bigg| \\
&= f_\mathbf{X}(\mathbf{x}) \bigg| \det \frac{d h(\mathbf{x})}{d \mathbf{x}} \bigg|^{-1}.
\end{aligned}\]
<p>The intuition behind the change of variables formula is that the probability mass of an interval in $\mathbf{X}$ should remain unchanged in the transformed $\mathbf{Y}$ space. The determinant of the Jacobian is a corrective term that accounts for the slope or the “sensitivity” of the transformation given by $h$.</p>
<h3 id="maximum-likelihood">Maximum Likelihood</h3>
<p>Normalizing flow models can then be understood as a collection of nested invertible transformations, i.e., $h_1 \cdot h_2 \cdots h_n$, where $n$ denotes the number of flow layers in the model.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> To better understand what this composite transformation achieves, let’s apply a logarithm to the change of variable formula.</p>
\[\log f_\mathbf{Y} (\mathbf{y}) = \log f_\mathbf{X} (\mathbf{x}) - \log \bigg| \det \frac{d h(\mathbf{x})}{d \mathbf{x}} \bigg|.\]
<p>To simplify notation, let $p_i$ denote the PDF of the $i$-th random variable in the composite transformation. Then, the nested transformation can be expressed as</p>
\[\begin{aligned}
\log f_n(\mathbf{x}_n)
&= \log f_{n - 1}(\mathbf{x}_{n - 1}) - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 1})}{d \mathbf{x}_{n - 1}} \bigg| \\
&= \log f_{n - 2}(\mathbf{x}_{n - 2}) - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 1})}{d \mathbf{x}_{n - 1}} \bigg| - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 2})}{d \mathbf{x}_{n - 2}} \bigg| \\
&= \cdots \\
&= \log f_0(\mathbf{x}_0) - \sum_{i = 1}^n \log \bigg| \det \frac{d h(\mathbf{x}_i)}{d \mathbf{x}_i} \bigg|.
\end{aligned}\]
<p>The immediate implication of this exposition is that a repeated application of the change of variables formula provides a direct way of computing the likelihood of an observation from some complex, real-data distribution $f_n$ given a prior $f_0$ and a set of invertible transformation $h_1, h_2, \dots, h_n$. This conclusion illustrates the power of normalizing flows: it offers a direct way of measuring the likelihood of complex, high-dimensional data, such as ImageNet images, starting from a simple distribution, such as an isotropic Gaussian. Since the likelihood can directly be obtained, flow models are trained to maximize the log likelihood, which is exactly the expression derived above.</p>
<h3 id="affine-coupling">Affine Coupling</h3>
<p>Although direct likelihood computation is a marked advantage of flow over other generative models, it comes with two clear limitations:</p>
<ul>
<li>All transformations must be invertible.</li>
<li>The determinant of the Jacobian must be easily computable.</li>
</ul>
<p>A number of methods have been proposed to satisfy these constraints. One of the most popular method is the affine coupling layer. Let $d$ denote the cardinality of the embedding space. Given an input $\mathbf{x}$ and and output $\mathbf{z}$, the affine coupling layer can schematically be written as</p>
\[\begin{aligned}
\mathbf{z}_{1:d/2} &= \mathbf{x}_{1:d/2} \\
\mathbf{z}_{d/2:d}
&= \mathbf{x}_{d/2:d} \odot s_\theta(\mathbf{x}_{1:d/2}) + t_\theta(\mathbf{x}_{1:d/2}) \\
&= \mathbf{x}_{d/2:d} \odot s_\theta(\mathbf{z}_{1:d/2}) + t_\theta(\mathbf{z}_{1:d/2}).
\end{aligned}\]
<p>In other words, the affine coupling layer implements a special transformation in which the top half of $\mathbf{z}$ is simply copied from $\mathbf{x}$ without modification. The bottom half undergoes an affine transformation, where the weights and biases are computed from the top half of $\mathbf{x}$. We can easily check that this transformation is indeed invertible:</p>
\[\begin{aligned}
\mathbf{x}_{1:d/2} &= \mathbf{z}_{1:d/2} \\
\mathbf{x}_{d/2:d} &= s_\theta^{-1}(\mathbf{z}_{1:d/2})(\mathbf{z}_{d/2:d} - t_\theta(\mathbf{z}_{1:d/2}))
\end{aligned}.\]
<p>Coincidentally, the affine coupling layer is not only invertible, but it also enables efficient computation of the Jacobian determinant. This comes from the fact that the top half of the input is unchanged.</p>
\[\begin{align}
\mathbf{J}
&= \begin{pmatrix} \frac{d \mathbf{z}_{1:d/2}}{d \mathbf{x}_{1:d/2}} & \frac{d \mathbf{z}_{1:2/d}}{d \mathbf{x}_{2/d:d}} \\ \frac{d \mathbf{z}_{2/d:d}}{d \mathbf{x}_{1:2/d}} & \frac{d \mathbf{z}_{d/2:d}}{d \mathbf{x}_{d/2:d}} \end{pmatrix} \\
&= \begin{pmatrix} \mathbb{I} & 0 \\ \frac{d \mathbf{z}_{2/d:d}}{d \mathbf{x}_{1:2/d}} & \text{diag}(s_\theta(\mathbf{x}_{1:d/2})) \end{pmatrix}.
\end{align}\]
<p>Although $\mathbf{J_{21}}$ contains complicated terms, we do not have to consider them when computing $\det \mathbf{J}$: the determinant of a lower triangular matrix is simply the product of its diagonal entries. Hence, $\det \mathbf{J} = \mathbf{J_{11}} \times \mathbf{J_{22}}$, which is computationally tractable.</p>
<p>In practice, flow layers take a slightly more complicated form than the conceptual architecture detailed above. One easy and necessary modification is to shuffle the indices that are unchanged at each layer; otherwise, the top half of the input representation would never be altered even after having passed through $n$ layers. Another sensible modification would be to apply a more complicated transformation. For example, <a href="https://arxiv.org/abs/1605.08803">Real NVP</a> proposes the following schema:</p>
\[\begin{aligned}
\mathbf{z}_{1:d/2} &= \mathbf{x}_{1:d/2} \\
h &= a \times \text{tanh}(s_\theta(\mathbf{x}_{1:d/2})) + b \\
\mathbf{z}_{d/2:d} &= \text{exp}(h) \times \mathbf{x}_{d/2:d} + g_\theta(\mathbf{x}_{1:d/2}).
\end{aligned}\]
<p>To summarize:</p>
<ul>
<li>Flow models are based on the change of variables formula, which offers a way of understanding the PDF of the transformed random variable.</li>
<li>Since flow models can directly compute the likelihood of the data distribution using a prior, it is trained to maximize the log likelihood of observed data.</li>
<li>Many architectures, such as affine coupling layers, have been proposed to fulfill the invertability and Jacobian determinant constraints of flow.</li>
</ul>
<p>Now that we have understood how flow works, let’s examine how flow is used in Glow-TTS.</p>
<h3 id="glow-tts">Glow-TTS</h3>
<p>Glow-TTS uses a flow-based decoder that transforms mel-spectrograms into a latent representation. As can be seen below in the architecture diagram, Glow-TTS accepts ground-truth mel-spectrograms (top of figure) and ground-truth text tokens (bottom of figure, shown as “a b c”) during training. Then, it runs the monotonic alignment search algorithm, which we will explore in the next section, to find an alignment between text and speech. The main takeaway is that the flow-based decoder transforms mel-spectrograms $\mathbf{y}$ to some latent vector $\mathbf{z}$, i.e., $f(\mathbf{y}) = \mathbf{z}$.</p>
<p><img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2021-08-10_at_2.50.30_PM.png" /></p>
<p>At a glance, it might not be immediately clear why we might want to use a flow model for the decoder instead of, for instance, a CNN or a transformer. However, the inference procedure makes clear why we need a flow-based model as the decoder. To synthesize a mel-spectrogram during inference, we estimate latent representations from user input text, then pass it on to the decoder. Since the decoder is invertible, we can reverse flow through the decoder to obtain a prediced mel-spectrogram, i.e., $f^{-1}(\hat{\mathbf{z}}) = \hat{\mathbf{y}}$, where $\hat{\cdot}$ denotes a prediction (as opposed to a ground-truth). In Glow-TTS, invertability offers an intuitive, elegant way of switching from training to inference.</p>
<p>The part that remains unexplained is how the model learns the latent representations and the relationship between text and acoustic features. This is explained by monotonic alignment search, which is the main topic of the next section.</p>
<h2 id="monotonic-alignment-search">Monotonic Alignment Search</h2>
<p>Proposed by Kim et. al., Monotonic Alignment Search (MAS) is an algorithm for efficiently identifying the most likely alignment between speech and text.</p>
<p><img src="https://distill.pub/2017/ctc/thumbnail.jpg" /></p>
<p>Text-to-speech alignment refers to the correspondence between text and spoken audio. Consider a simple input, “hello!”, accompanied by a human recording of that sentence. We could imagine that the first 0.5 seconds of the audio corresponds to the first letter “h,” followed by 0.7 seconds of “e,” and so on. The process of attributing a specific text token to some time interval within the audio can be described as alignment search.</p>
<p>Finding an accurate alignment between speech and text is an incredibly important task in TTS. If an alignment discovered by the model is inaccurate, it could mean that the model skips words or repeats certain syllables, both of which are failure nodes we want to avoid. One of the most salient features of MAS is that it prevents such failures by preemptively enforcing very specific yet sensible inductive biases into the alignment search algorithm.</p>
<h3 id="inductive-biases">Inductive Biases</h3>
<p>Let’s begin by enumerating a list of common sense intuition we have about TTS alignments.</p>
<ul>
<li>The model should “read” from left to right in a linear fashion.</li>
<li>The model always begins with the first letter and ends on the last letter.</li>
<li>The model should not skip any text.</li>
<li>The model should not repeat any text.</li>
</ul>
<p>Many previous alignment search methods do not necessarily enforce these constraints. For instance, Tacotron 2 uses sequence-to-sequence RNN attention to autoregressively build the alignment between speech and text. However, autoregressive alignment search often fails when long input text are fed into the model since errors can accumulate throughout the text sequence, yielding a highly inaccurate alignment at the end of the iteration. On the other hand, MAS is not only non-autoregressive, but also designed specifically so that the discovered alignment will never violate the set of inductive biases outlined above. This makes the model much more robust, even when the input sequence length is arbitrairly long.</p>
<h3 id="dynamic-programming">Dynamic Programming</h3>
<p>At the heart of MAS is dynamic programming (DP), a common programming technique used to optimize runtime on problems that can be decomposed into recurring sub-problems that share the same structure as its parent. DP offers a reasonably efficient way of solving many problems, usually in $O(n^d)$ runtime, where $n$ is the size of the input and $d$ denotes DP dimensionality. While this section will not attempt to explain DP in full, we will consider a toy problem to motivate DP specifically in the context of MAS.</p>
<p>Consider a classic dynamic programming problem, where the goal is to find a monotonic path that maximizes the sum of scores given some score matrix. Here, “monotonic” means either moving from the current position diagonally down, or jumping to the right cell within the same row. While there might be many ways to approach this problem, here is one possible solution.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">copy</span>
<span class="k">def</span> <span class="nf">find_maximum_sum_path</span><span class="p">(</span><span class="n">scores</span><span class="p">):</span>
<span class="c1"># preliminary variables
</span> <span class="n">num_rows</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
<span class="n">num_cols</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="c1"># copy to avoid overriding `scores`
</span> <span class="n">scores2</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">deepcopy</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
<span class="c1"># base case for first row
</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">):</span>
<span class="n">scores2</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="n">scores2</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">j</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span>
<span class="c1"># dynamic programming
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_rows</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">):</span>
<span class="n">scores2</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="nb">max</span><span class="p">(</span><span class="n">scores2</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">][</span><span class="n">j</span> <span class="o">-</span> <span class="mi">1</span><span class="p">],</span> <span class="n">scores2</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span>
<span class="c1"># backtracking
</span> <span class="c1"># create `path` to return
</span> <span class="n">i</span> <span class="o">=</span> <span class="n">num_rows</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">path</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_cols</span><span class="p">)]</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_rows</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">num_cols</span><span class="p">)):</span>
<span class="n">path</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="mi">0</span> <span class="ow">and</span> <span class="p">(</span><span class="n">i</span> <span class="o">==</span> <span class="n">j</span> <span class="ow">or</span> <span class="n">scores2</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]</span> <span class="o"><</span> <span class="n">scores2</span><span class="p">[</span><span class="n">i</span> <span class="o">-</span> <span class="mi">1</span><span class="p">][</span><span class="n">j</span> <span class="o">-</span> <span class="mi">1</span><span class="p">]):</span>
<span class="n">i</span> <span class="o">-=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">path</span>
</code></pre></div></div>
<p>Given the following <code class="language-plaintext highlighter-rouge">scores</code>, the function returns the following result:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">grid</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="p">]</span>
<span class="o">>>></span> <span class="n">find_maximum_sum_path</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span>
<span class="p">[</span>
<span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
<span class="p">]</span>
</code></pre></div></div>
<p>It is not difficult to perform a manual sanity to check that the returned result is indeed the path that maximizes the sum of scores while adhering to the monotonicity constraint.</p>
<h3 id="likelihood-scores">Likelihood Scores</h3>
<p>Let’s take a step back and revisit the model architecture diagram presented above. On the left side of the diagram, we see an illustration of monotonic alignment search in action. Notice that this is exactly the problem we solved above: given some matrix of scores, find a monotonic path that maximizes the sum. Now, only a few missing pieces remain:</p>
<ul>
<li>What is the matrix of scores?</li>
<li>How does this relate to the flow-based decoder?</li>
</ul>
<p>Turns out that the two questions are closely related, and answering one will shed light on the other.</p>
<p>Recall that Glow-TTS deals with two input modalities during training: a string of text and its corresponding mel-spectrogram. The mel-spectrogram is decoded through the flow-based decoder. Similarly, the text is fed to a text encoder network, which outputs $\mathbf{\mu}$ and $\mathbf{\sigma}$ for each token of text. In other words, given <code class="language-plaintext highlighter-rouge">["h", "e", "l", "l", "o"]</code>, we would have a total of five mean and standard deviation vectors corresponding to each letter.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> We can denote them as $\mathbf{\mu_1}, \mathbf{\mu_2}, \dots, \mathbf{\mu_5}$, and $\mathbf{\sigma_1}, \mathbf{\sigma_2}, \dots, \mathbf{\sigma_5}$. Let’s also assume in this example that the corresponding mel-spectrogram spans a total of 100 frames. The output of the flow decoder would also be 100 vectors, denoted as $\mathbf{z_1}, \mathbf{z_2}, \dots, \mathbf{z_{100}}$.</p>
<p>Using these quantities, we can then construct a likelihood score matrix $P \in \mathbb{R}^{5 \times 100}$. The entries of the probability score matrix are computed via $P_{ij} = \log(\phi(\mathbf{z_j}; \mu_i, \sigma_i))$, where $\phi$ denotes the normal probability density function. Since $\sigma$ is a vector instead of a matrix, we assume an isotropic Gaussian, i.e., the covariance matrix is diagonal. The intuition is that the value of $P_{ij}$ indicates how likely it is that the $i$-th character matches or aligns with the $j$-th mel-spectrogram frame. If the two pairs of text and audio match, the probability score will be high, and vice versa. Log likelihood is used so that summation of scores effectively models a product in probability space.</p>
<p>Given this context, we can now apply the solution to the monotonic path sum problem motivated in the previous section. Instead of some arbitrary <code class="language-plaintext highlighter-rouge">scores</code> matrix, we create the probability score matrix $P$ and use DP to discover the most likely monotonic alignment between speech and text. The alignment will satisfy the inductive biases we identified earlier due to the inherent design of MAS.</p>
<p>It is worth noting that MAS is a generic alignment search algorithm that is independent of the flow-based model design. In particular, MAS was used without the flow decoder in <a href="https://arxiv.org/abs/2105.06337">Grad-TTS</a>. Popov et. al. proposed using mel-spectrogram frames directly to measure the probability score given the mean and variance prediced from text. In other words, instead of using $\mathbf{z}$, mel-spectrogram frames $\mathbf{y}$ were used. Grad-TTS is notable in its use of score-based generative models, which fall under the larger category of diffusion-based probabilistic models.</p>
<h2 id="glow-tts-pipeline">Glow-TTS Pipeline</h2>
<p>We can finally put flow and MAS together to summarize the overall pipeline of Glow-TTS.</p>
<h3 id="training">Training</h3>
<p>Given a pair of text and mel-spectrogram $(T, \mathbf{y})$, we feed $T$ into the text encoder $f_\text{text}$ and mel-spectrogram $\mathbf{y}$ into the flow-based decoder $f_\text{mel}$ to obtain $f_\text{mel}(\mathbf{y}) \in \mathbb{R}^{D \times L_\text{mel}}$ and $f_\text{text}(T) = (\mu, \sigma)$, where $\mu, \sigma \in \mathbb{R}^{D \times L_\text{text}}$ and $D$ denotes the size of the embedding. We can then use MAS to obtain the most likely monotonic alignment $A^* \in \mathbb{R}^{L_\text{text} \times L_\text{mel}}$. Since Glow-TTS is a flow-based model, which enables direct computation of likelihood, the model is simply trained to maximize the value of the log-likelihood given by the sum of the entries of the log-likelihood score matrix $P$. $A^\star$ can intuitively be understood as a binary mask used to index $P$. Schematically, the final log-likelihood could be written as $l = \sum_{i = 1}^{L_\text{text}} \sum_{j = 1}^{L_\text{mel}}(P \odot A^\star)_{ij}$, where $\odot$ denotes a Hadamard product, or an element-wise product of matrices. Since optimization in modern machine learning are typically framed as a minimizing problems, we minimize the negative log-likelihood.</p>
<p>Although not discussed in the sections above, Glow-TTS requires training a small sub-model, called a duration predictor, for inference. Because we do not have access to the ground-truth mel-spectrogram during inference, we need a model that can predict the best alignment $A^*$ purely from text. This task is carried out by the duration predictor, which accepts $T$ as input and is trained to maximize the L2 distance between its predicted alignment $\hat{A}$ and the actual $A^\star$ discovered by MAS.</p>
<h3 id="inference">Inference</h3>
<p>In the context of inference, the model has to output a predicted mel-spectrogram $\hat{\mathbf{y}}$ conditioned on the input text $T$. First, we use the learned text encoder to obtain mean and variance, i.e., $f_\text{text}(T) = (\mu, \sigma)$. Then, we use the duration predictor to obtain a predicted alignment $\hat{A}$. We can then sample from the $\mathcal{N}(\mu, \sigma^2)$ distribution according to $\hat{A}$. Continuing the earlier example of <code class="language-plaintext highlighter-rouge">T = ["h", "e", "l", "l", "o"]</code>, let’s say that <code class="language-plaintext highlighter-rouge">A_star = [1, 3, 2, 1, 1]</code>. This means that we have to sample from $\mathcal{N}(\mu_\text{h}, \sigma_\text{h})$ once, $\mathcal{N}(\mu_\text{e}, \sigma_\text{e})$ three times, and so on. By concatenating the results of sampling, we obtain $\hat{\mathbf{z}} \in \mathbb{R}^{D \times \hat{L_\text{mel}}}$, where $\hat{L_\text{mel}}$ denotes the length of the predicted mel-spectrogram frames, which is effectively <code class="language-plaintext highlighter-rouge">sum(A_star)</code>. Once we have $\hat{\mathbf{z}}$, we finally use the flow decoder to invert it into the mel-spectrogram space, i.e., $f_\text{mel}^{-1}(\hat{\mathbf{z}}) = \hat{\mathbf{y}}$.</p>
<p>Sample diversity is an important concern in neural TTS. Just like humans can read a single sentence in many different ways by varying tone, pitch, and timbre, preferably, we want a TTS model to be able to produce diverse samples. One way to achieve this in Glow-TTS is by varying the temperature parameter during sampling. In practice, sampling is performed thorugh the reparametrization trick:</p>
\[\epsilon \sim \mathcal{N}(0, 1) \\
\mathbf{z} = \mu + \epsilon \cdot \sigma^2.\]
<p>Through listening tests and pitch contours, Kim et. al. show that varying $\epsilon$ achieves diversity among samples produced by Glow-TTS.</p>
<h3 id="results">Results</h3>
<p>A marked advantage of Glow-TTS is that it is a parallel TTS model. This contrasts with existing autoregressive baselines, such as Tacotron 2. While autoregressive models require an iterative loop to condition the output of the current timestep on that from the previous timestep, parallel models produce an output in a single pass. In other words, parallel models run in constant time, whereas the runtime complexity of autoregressive models scales linearly with respect to the length of the input sequence. This is clear in the comparison figure taken from the Glow-TTS paper.</p>
<p><img src="https://media.arxiv-vanity.com/render-output/5100370/x6.png" /></p>
<p>Another pitfall of autoregressive models is that errors can accumulate throughout the iterative loop. If the model misidentifies an alignment between speech and text early on in the input sequence, later alignments will also likely be incorrect. In the case of parallel models, error accumulation is not possible since there is no iterative loop to begin with. Moreover, alignments found by Glow-TTS are made even more robust due to the design of MAS, which systematically identifies only those alignments that satisfy the monotonicity inductive bias. In the figure below, also taken directly from the Glow-TTS paper, Kim et. al. show that the Glow-TTS maintains a consistent character error rate, while that of Tacotron 2 increases proportionally to the length of the input sequence.</p>
<p><img src="https://media.arxiv-vanity.com/render-output/5100370/x9.png" /></p>
<p>Glow-TTS achieves competitive results on mean opnion score (MOS) listening tests. MOS tests are typically performed by randomly sampling a number of people and providing them to rate an audio sample from a scale of 1 to 5, where higher is better.</p>
<p>In the results table shown below, GT (ground-truth) is rated most highly at 4.54. WaveGlow is a neural vocoder that transforms mel-spectrograms to waveform. GT (Mel + WaveGlow) received 4.19, marginally below the GT waveform score. This is because using a neural vocoder necessarily introduces quality degradations and artifacts. Since even the best neural TTS acoustic feature generator would not be able to produce a mel-spectrogram that sounds more natural than a human recording, 4.19 can be considered as the theoretical upperbound for any TTS model and WaveGlow combination. Glow-TTS comes pretty close to 4.19, scoring approximately 4 across various temperature parameters. While the difference of 0.19 certainly suggests room for improvement, it is worth mentioning that Glow-TTS outperforms the Tacotron 2, which has been considered the competitive SOTA TTS model for a long time.</p>
<p><img src="https://d3i71xaburhd42.cloudfront.net/4a028532ec2bd4930c5cb228aabae64f28def55f/6-Table1-1.png" /></p>
<h3 id="future-direction">Future Direction</h3>
<p>An emerging trend in neural TTS literature is end-to-end TTS modeling. Instead of the traditional two-stage pipeline composed of an acoustic feature generator and a neural vocoder, end-to-end models produce raw waveforms directly from text without going to the intermediate mel-spectral representation. One prime example is <a href="https://arxiv.org/abs/2106.06103">VITS</a>, an end-to-end speech model developed by the authors of Glow-TTS published in ICML 2021. VITS is a combination of Glow-TTS and <a href="https://arxiv.org/abs/2010.05646">HiFi-GAN</a>, which is a neuarl vocoder. VITS uses largely the same MAS algorithm as Glow-TTS, and uses a variational autoencoding training scheme to combine the feature generator and the neural vocoder.</p>
<p>A benefit of using end-to-end modeling is that the model is relieved of the mel-spectral information bottleneck. Mel-spectrogram is a specific representation of information defined and crafted according to human knowledge. However, the spirit of deep learning is that no manual hand-crafting of features is necessary, provided sufficient data and modeling capacity. End-to-end models allow the model to choose its own intermediate representation that best accomplishes the task of synthesizing natural-sounding audio. Indeed, VITS outperforms Tacotron and Glow-TTS by considerable margins and almost matches ground-truth MOS ratings. This is certainly an exciting development, and we can expect more lines of work in this direction.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Glow-TTS is a flow-based neural TTS model that demonstrated a method of leveraging the invertability of flow to produce mel-spectrograms from text-derived latent representations. By projecting mel-spectrograms and text into a common latent space and using MAS and maximum likelihood-based training, Glow-TTS is able to learn robust, hard monotonic alignments between speech and text. Similar to Tacotron 2, Glow-TTS is now considered a competitive baseline and is referenced in recent literature.</p>
<p>Neural TTS has seen exciting developments over the past few years, including general text-to-speech, voice cloning, singing voice synthesis, and prosody transfer. Moreover, given the rapid pace of development in other fields, such as natural language processing, automatic speech recognition, and multidmodal modeling, we could see more interesting models that combine different approaches and modalities to perform a wide array of complex tasks. If anything remains clear, it is that we are living at an exciting time in the era of machine learning, and that the next few years will continue to see breakthroughs and innovations that will awe and surprise us, just like people a few decades ago would marvel at the simplest words:</p>
<p>“Turn right at 130 Prospect Street.”</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>While there are variations of normalizing flows, such as continuous flows or neural ODEs, for sake of simplicity, we only consider discontinuous normalizing flow. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>In practice, most TTS models, including Glow-TTS, use phonemes as input instead of characters of text. We illustrate the example using characters for simplicity. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Jake TaeNote: This blog post was completed as part of Yale’s CPSC 482: Current Topics in Applied Machine Learning.Reflections and Expectations2021-12-27T00:00:00+00:002021-12-27T00:00:00+00:00https://jaketae.github.io/blog/2021<p>Last year, I wrote a <a href="https://jaketae.github.io/blog/2021/">blog post reflecting on the year 2020</a>. Re-reading what I had written then was surprisingly insightful, particularly because I could see how life had changed in some ways and remained unchanged in others. I decided to continue the tradition this year in the hopes of presenting my year-later self with the same joy and delight of reading a memoir of similar kind.</p>
<p>2021 was, in some ways, very similar to 2020. Despite the development and proliferation of vaccines, COVID-19 raged on, morphing into a new variant every few months. Masks and social distancing are now deeply embedded into our daily lives. Although booster shots and pill-type medications might change the dynamics of the pandemic, I personally think COVID is here to stay, at least for the foreseeable future.</p>
<p>After being discharged from the army in March of 2021, I spent roughly 6 months working as an intern at <a href="https://neosapience.com">Neosapience</a>, a Korean startup specializing in voice-over services and metaverse characters. This was also when I left <a href="https://www.rerent.co">ReRent</a>, a hospitality startup that I was fortunate enough to have worked for since the summer of 2020. ReRent immensely helped me learn and grow as a software developer, versed in <code class="language-plaintext highlighter-rouge">git</code> and GitHub, general web development, and Django, which has since become my favorite Python backend framework. It is also where I met valuable teammates, some of whom I met in person at Yale.</p>
<p>The transition from ReRent to Neosapience was a lot more than just a change of jobs. At Neosapience, I worked on machine learning research–an art of its own entirely different from backend web development. Specifically, I was tasked with the job of developing a singing voice synthesis model that, given lyrics and melodies, could “sing.” I still remember the frustration I felt when I was first trying to reproduce a reference paper I was provided as a baseline. There were parts of the paper that were ambiguous. The fact that it was a GAN-based model certainly did not help. I reached out to the authors in the hopes of gaining clarity, but received no response. Although I extrapolated parts of the model and trained it for a few days, the model only produced barely audible mumbles that could not be farther from the act of singing. I learned that ML was hard.</p>
<p>Thankfully, I was fortunate enough to have had more experienced co-workers as mentors who provided valuable pieces of advice. One of them suggested that I design a model of my own instead of blindly trying to reproduce the paper. As a demo of sorts, he showed me that a simple CNN model could sing better than the GAN I was trying to reproduce, with just a few minutes of training. Inspired by his progress, I began designing my own modules to experiment with a host of different architectures: CNNs, RNNs, transformers, and combinations thereof. I also explored various famous CNN architectures, such as InceptionNet and ResNeXT in search of inspiration and ideas.</p>
<p>Unexpectedly, the biggest success came from a very experimental model that was a direct adaptation of <a href="https://arxiv.org/abs/2105.01601">MLP-Mixer</a>, an architecture composed entirely of multi-layer perceptrons, or <code class="language-plaintext highlighter-rouge">nn.Linear</code> layers in PyTorch. This was a paper I presented during one of our weekly paper-reading meetings. Although the quality of results produced by the final model still contained audible artifacts, nonetheless we saw novelty in the fact that it was the first voice synthesis model exclusively composed of linear layers. This project culminated in my first ever publication <a href="https://arxiv.org/abs/2106.07886">MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis</a> in <a href="https://2021.ieeemlsp.org">IEEE Machine Learning for Signal Processing workshop</a>, now available on <a href="https://ieeexplore.ieee.org/document/9596184">IEEE Xplore</a>. By the end of my internship, I felt a lot more comfortable with various ML concepts and their implementations. This was also when I was involved with Hugging Face’s Flax/JAX community week event where my teammates and I developed <a href="https://github.com/jaketae/koclip">KoCLIP</a>, as well as <a href="https://bigscience.huggingface.co">BigScience</a>, a huge project by Hugging Face to reproduce a GPT-3-sized language model.</p>
<p>I came back to Yale with the explicit intent of majoring in Computer Science and Mathematics. While this was not a trivial decision, it was very clear and obvious to me that this was the academic path I wanted to pursue. I took CPSC 223, which is Yale’s signature data structures course taught in… barebones C. <code class="language-plaintext highlighter-rouge">malloc</code> and <code class="language-plaintext highlighter-rouge">free</code> are probably the functions I used the most this year, perhaps with the exception of <code class="language-plaintext highlighter-rouge">print</code>/<code class="language-plaintext highlighter-rouge">printf</code>s I used for lazy debugging. On top of CS classes, I also continued my involvement with ML in a few ways. For one thing, I co-authored my second paper, <a href="https://arxiv.org/abs/2110.02584">EdiTTS: Score-based Editing for Controllable Text-to-Speech</a>, with a co-worker at Neosapience. This was the first project in which I used Amazon Mechanical Turk for MOS measurements. I’m still waiting on the final decision from a conference to which I submitted this paper, but I’m happy about how it came out regardless.</p>
<p>More importantly, I was extremely fortunate to be given the opportunity to work as a software engineering intern at Hugging Face. This was an unbelievable achievement for me that I knew I did not deserve. As a self-taught newcomer and student to the field of ML, I only dreamed about working at Hugging Face when I was first learning about transformers. I still have not produced much output at HF largely due to the fact that my internship was part-time and very low time commitment-wise, but I’m still excited for the month of January, which is when I will be dedicating myself full time to Hugging Face and BigScience. I would also like to express gratitude to the engineer at Hugging Face who referred me to this position, and whom I now consider a mentor, <a href="https://twitter.com/stasbekman">Stas Bekman</a>.</p>
<p>This semester was perhaps the hardest one yet at Yale. All the classes I took either required a lot of effort or time commitment. Admittedly to fulfill my distribution requirement, I went out my ways and took HIST 271: European Intellectual History since Nietzsche, where I learned a ton about philosophy, from the Enlightenment all the way up to post-Modernism. I also enrolled in ASTR 110: Planets and Stars, which I frankly took for an easy science credit, only to realize that weekly problem sets took up more time than I had anticipated. MATH 241: Probability Theory was easy at first, but ramped up quite quickly at the end of the semester, to the point that I was floundering about during finals week. Nonetheless, I’m glad that the semester is over, and that I came out of it feeling more learned and knowledgable than I was five months ago.</p>
<p>2021 was surely a roller coaster ride. It was surely a fruitful one, but it is also a miracle how it turned out the way it did. With experience, memories, and gratefulness at heart, I cannot wait to see what 2022 has in store.</p>Jake TaeLast year, I wrote a blog post reflecting on the year 2020. Re-reading what I had written then was surprisingly insightful, particularly because I could see how life had changed in some ways and remained unchanged in others. I decided to continue the tradition this year in the hopes of presenting my year-later self with the same joy and delight of reading a memoir of similar kind.Score Matching2021-12-26T00:00:00+00:002021-12-26T00:00:00+00:00https://jaketae.github.io/study/sliced-score-matching<p>Recently, I’ve heard a lot about score-based networks. In this post, I will attempt to provide a high-level overview of what scores are and how the concept of score matching gives rise to a family of likelihood-based generative models. This post is heavily adapted from <a href="https://yang-song.github.io/blog/2019/ssm/">Yang Song’s post on sliced score matching</a>.</p>
<h1 id="probability-model">Probability Model</h1>
<p>Given a parametrized real-valued function $f_\theta(\mathbf{x})$, we can derive a probability model $p_\theta(\mathbf{x})$ by applying a normalization term $Z_\theta$.</p>
\[p_\theta (\mathbf{x}) = \frac{e^{- f_\theta (\mathbf{x})}}{Z_\theta} \\
Z_\theta = \int e^{- f_\theta (\mathbf{x})} \, d \mathbf{x}.\]
<p>In practice, $f_\theta$ is often an energy-based model (EBM).</p>
<p>We can then define the likelihood function as follows:</p>
\[\log p_\theta (\mathbf{x}) = - f_\theta (\mathbf{x}) - \log Z_\theta.\]
<p>However, one glaring problem with this formulation is that $Z_\theta$ is often intractable. Score-matching presents an elegant solution to bypass this problem.</p>
<h1 id="score-matching">Score-Matching</h1>
<p>To eliminate the intractable term, we consider the score, which is defined as the gradient of the log likelihood with respect to the random variable $\mathbf{x}$. Note that we are not taking the gradient with respect to the parameter $\theta$, which is typically the object of interest in processes such as MLE.</p>
\[\nabla_\mathbf{x} \log p_\theta (\mathbf{x}) = - \nabla_\mathbf{x} f_\theta (\mathbf{x}).\]
<p>The goal of score-matching, then, is to minimize the difference between $p_\text{data}$ and $p_\theta$ by optimizing the Fisher divergence. For sake of simplicity, we consider the 1-D case.</p>
\[\begin{align}
&\frac12 \mathbb{E}_{p_\text{data}} \lVert \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \rVert^2_2 \\
&= \frac12 \int p_\text{data} (x) \left( \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \right)^2 \, dx \\
&= \frac12 \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x))^2 \, dx + \frac12 \int p_\text{data} (x) (\nabla_x \log p_\theta (x))^2 \, dx \\
& - \int p_\text{data}(x) \nabla_x \log p_\text{data}(x) \nabla_x \log p_\theta (x) \, dx .
\end{align}\]
<p>The equalities simply follow from the integral definition of expectation. Note that the first term is simply a constant and can be ignored during optimization.</p>
<p>Applying integration by parts on the last term,</p>
\[\begin{align}
& \int p_\text{data}(x) \nabla_x \log p_\text{data}(x) \nabla_x \log p_\theta (x) \, dx \\
&= \int p_\text{data}(x) \frac{\nabla_x p_\text{data}(x)}{p_\text{data} (x)} \nabla_x \log p_\theta (x) \, dx \\
&= \int \nabla_x \log p_\theta (x) \nabla_x p_\text{data} (x) \, dx \\
&= p_\text{data}(x) \nabla_x \log p_\theta(x) \bigg|^\infty_{- \infty} - \int p_\text{data}(x) \nabla^2_x \log p_\theta (x) \, dx \\
& \approx - \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x)].
\end{align}\]
<p>Putting all terms together,</p>
\[\begin{align}
&\frac12 \mathbb{E}_{p_\text{data}} \lVert \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \rVert^2_2 \\
&= \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x)] + \frac12 \mathbb{E}_{p_\text{data}} [(\nabla_x \log p_\theta (x))^2] + \text{const.} \\
&= \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x) + \frac12 (\nabla_x \log p_\theta (x))^2] + \text{const.}
\end{align}\]
<p>We can easily extend this into a multidimensional context, the result of which is</p>
\[\mathbb{E}_{p_\text{data}} \left[\text{tr}(\nabla^2_\mathbf{x} \log p_\theta (\mathbf{x})) + \frac12 \lVert \nabla_\mathbf{x} \log p_\theta (\mathbf{x}) \rVert^2_2 \right] + \text{const.}\]
<h1 id="sliced-score-matching">Sliced Score-Matching</h1>
<p>We are specifically interested in instances where $f_\theta$ is parametrized as a neural network. Recall that</p>
\[\nabla_\mathbf{x} \log p_\theta (\mathbf{x}) = - \nabla_\mathbf{x} f_\theta (\mathbf{x}).\]
<p>Therefore, we can rewrite the score-matching objective as</p>
\[\mathbb{E}_{p_\text{data}} \left[\text{tr}(\nabla^2_\mathbf{x} f_\theta (\mathbf{x})) + \frac12 \lVert \nabla_\mathbf{x} f_\theta (\mathbf{x}) \rVert^2_2 \right] + \text{const}.\]
<p>While the first-order gradient can be simply obtained via backpropagation, $\text{tr}(\nabla^2<em>\mathbf{x} f</em>\theta (\mathbf{x}))$ is very computationally costly. To circumvent this problem, the authors propose random projection, which reduces dimensionality of data down to scalars. Quoting Yang Song:</p>
<blockquote>
<p>We propose <strong>sliced score matching</strong> to greatly scale up the computation of score matching. The motivating idea is that one dimensional data distribution is much easier to estimate for score matching. We propose to project the scores onto random directions, such that the vector fields of scores of the data and model distribution become scalar fields. We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same.</p>
</blockquote>
<p>The random projection version of Fisher divergence is</p>
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) )^2].\]
<p>Intuitively, the equation forces the two distributions to get closer according to some random projection $\mathbf{v}$. Since the projection is random, there exists a guarantee that optimizing this quantity will bring $p_\theta$ closer to the real data distribution.</p>
<p>The sliced score-matching objective under this revised Fischer divergence is</p>
\[\mathbb{E}_{p_\text{data}}\bigg[\mathbf{v}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const}.\]
<p>The problem has now been reduced into computationally tractable form.</p>
<p><em>This post was originally written in July, but polished into its current final form in December. If you spot any rough edges or details I missed, please feel free to reach out to me with corrections.</em></p>Jake TaeRecently, I’ve heard a lot about score-based networks. In this post, I will attempt to provide a high-level overview of what scores are and how the concept of score matching gives rise to a family of likelihood-based generative models. This post is heavily adapted from Yang Song’s post on sliced score matching.Flow Models2021-06-21T00:00:00+00:002021-06-21T00:00:00+00:00https://jaketae.github.io/study/flow<p>In this post, we will take a look at Flow models, which I’ve been obsessed with while reading papers like <a href="https://arxiv.org/abs/2005.11129">Glow-TTS</a> and <a href="https://arxiv.org/abs/2106.06103">VITS</a>. This post is heavily based on <a href="https://www.youtube.com/watch?v=JBb5sSC0JoY">this lecture video</a> by Pieter Abbeel, as well as the accompanied problem sets for the course, available <a href="https://github.com/rll/deepul/blob/master/homeworks/solutions/hw2_solutions.ipynb">here</a>.</p>
<h1 id="motivation">Motivation</h1>
<p>We want a model that satisfies the following:</p>
<ul>
<li>Simplifies complex, intractable distributions</li>
<li>Enables easy sampling and generation</li>
</ul>
<p>The two conditions are somewhat related in the sense that once you have a function (or a neural network that approximates such a function) that maps complex distributions to a tractable latent space, sampling can be performed immediately given that the mapping function is invertible. Invertibility is not something that can be easily assumed in deep learning and thus calls for some specific architectural decisions. Nonetheless, I find this formulation highly compelling and intuitive.</p>
<h1 id="change-of-variables">Change of Variables</h1>
<p>To fully understand the mechanics of flow, we need to first revisit the change of variables formula. Let $X$ denote a random variable, and $f_\theta$, some monotonic, invertible function that maps $X$ to a latent space $Z$. In the simplest case, $f_\theta$ might be the CDF of $X$, and $Z$ might be a uniform distribution $U(0, 1)$. More generally, we have</p>
\[z = f_\theta(x)\]
<p>Note that there exists a one-to-one correspondence between the two random variables, which is important to guarantee invertability.</p>
<p>Let $p(\cdot)$ denote the PDF of some random variable. Naively, one might think that</p>
\[p(x) \, dx = p(z) \, dz\]
<p>However, this fails to take into account the fact that a small change in $x$ may or may not be equally spread out in $z$ space. Hence, we need a correcting factor, which is the derivative of $z$ w.r.t. $x$.</p>
\[p(x) = p(z) \left\lvert \frac{\partial f_\theta(x)}{\partial x} \right\rvert
\tag{1}\]
<p>More formally, we can see this by considering the derivative of the CDF, which we will denote as $P(\cdot)$.</p>
\[\begin{align}
P(Z \leq z)
&= P(f_\theta(X) \leq z) \\
&= P(X \leq f_\theta^{-1}(z))
\end{align}
\tag{2}\]
<p>(2) holds if $f$ is a monotonically increasing function. If it is a monotonically decreasing function, then</p>
\[P(Z \leq z) = 1 - P(X \leq f_\theta^{-1}(z))\]
<p>Deriving both sides of the equation by $z$, we get</p>
\[\begin{align}
p(z)
&= \pm \, p(f_\theta^{-1}(z)) \frac{\partial f_\theta^{-1}(z)}{\partial z} \\
&= p(x) \left\lvert \frac{\partial x}{\partial z} \right\rvert \\
\end{align}
\tag{3}\]
<p>Rearranging (3) yields (1).</p>
<p>In a multi-dimensional context, the absolute value of the partial derivative term is effectively the determinant of the jacobian matrix.</p>
\[p(x) = p(z) \frac{\text{vol}(dz)}{\text{vol}(dx)} = p(z) \left\lvert \text{det} \frac{dz}{dx} \right\rvert\]
<p>We can understand the determinant of a matrix as calculating the magnitude of volume change that it would produce as a linear transformation of coordinates. We can see this as a multivariate analogue of slope or the gradient.</p>
<h1 id="training">Training</h1>
<p>Flow is nothing more than a neural network that models $f_\theta$. It takes a random variable living in some complex intractable space and sends it to a tractable dimension. In the case of normalizing flows, the target latent distribution is a normal distribution.</p>
<p>As is the case with any likelihood model, the goal is to fit a model that maximizes the log likelihood of data. Therefore, the objective is</p>
\[\max \sum_i \log p(x_i) \tag{4}\]
<p>We can substitute the likelihood with an expression using the latent transformed variable in (1). Then, (4) is equivalent to</p>
\[\max \sum_i \log p(f_\theta(x_i)) + \log \, \left\lvert \text{det} \frac{d f_\theta(x_i)}{d x} \right\rvert\]
<p>We train the flow model to minimize negative log likelihood, or equivalently, maximize log likelihood.</p>
<p>A few remarks:</p>
<ul>
<li>Notice that there is a jacobian sitting in the log likelihood term. This means that the flow model should model a function whose jacobian is easy to compute, which is usually not the case.</li>
<li>In a normalizing flow, $f_\theta$ will essentially try to assign as many points near the center of the Gaussian distribution in the vicinity of the mean.</li>
</ul>
<h1 id="perks-of-flow">Perks of Flow</h1>
<p>Up to this point, you might think that the flow model is a very intricate machinery that comes with many constraints, e.g. invertability, easy jacobian calculation, and etc. Nonetheless, I think it has some clear advantages in two aspects.</p>
<h2 id="sampling">Sampling</h2>
<p>To sample from a flow model, all we have to do is sample from the posterior distribution, such as a normal or Gaussian, then simply send it down an inverse flow.</p>
<h2 id="combinations">Combinations</h2>
<p>One salient characteristic of a flow is that a combination of flows is also a flow. If you have a set of invertible, differentiable functions, a stack of such functions will also be differentiable and invertible.</p>
\[z = f_k \circ f_{k - 1} \circ \cdots \circ f_1(x) \\
x = f_1^{-1} \circ f_2^{-1} \circ \cdots \circ f_k^{-1} (z)\]
<p>A capacity of a single flow layer is most likely limited, but a deep stack gives it enough expressional power to handle highly complex prior distributions.</p>
<h1 id="model-architecture">Model Architecture</h1>
<p>Flow models must be invertible, which leads to some important considerations when motivating their architecture. For instance, we cannot use ReLU activations since they violate the invertability requirement. Moreover, the jacobian should be easy to compute.</p>
<h2 id="inversion">Inversion</h2>
<p>The beautiful part of flow is that there is a simple way to resolve both conundrums: affine coupling layers. Let $d$ denote the cardinality of the embedding space on which we are applying a flow model. Then, the affine coupling layer can schematically be written as</p>
\[z_{1:d/2} = x_{1:d/2} \\
\begin{align}
z_{d/2:d}
&= x_{d/2:d} \odot s_\theta(x_{1:d/2}) + t_\theta(x_{1:d/2}) \\
&= x_{d/2:d} \odot s_\theta(z_{1:d/2}) + t_\theta(z_{1:d/2})
\end{align}
\tag{5}\]
<p>In plain language, we can consider $f_\theta$ as a special transformation in which the top half of $z$ is just copied from $x$ without modification. The bottom half undergoes an affine transformation, where the weights and biases are computed from the top half of $x$. We can easily check that this transformation is indeed invertible:</p>
\[x_{1:d/2} = z_{1:d/2} \\
x_{d/2:d} = s_\theta^{-1}(z_{1:d/2})(z_{d/2:d} - t_\theta(z_{1:d/2}))
\tag{6}\]
<p>Affine coupling layers are invertible only because the top half of $z$ is equal to that of $x$. This demystifies the copying operation in (5), which may have appeared somewhat unintuitive and awkward initially.</p>
<p>In practice, it appears that flow layers take a slightly more complicated form than the conceptual architecture detailed above. For example, <a href="https://arxiv.org/abs/1605.08803">Real NVP</a> proposes the following schema.</p>
\[z_{1:d/2} = x_{1:d/2} \\
h = a \times \text{tanh}(s_\theta(x_{1:d/2})) + b \\
z_{d/2:d} = \text{exp}(h) \times x_{d/2:d} + g_\theta(x_{1:d/2})\]
<p>where $a$ and $b$ are learned parameters, and $s_\theta$ and $g_\theta$ are some affine transformations, such as a multi-layer perceptron.</p>
<h2 id="jacobian">Jacobian</h2>
<p>Earlier, we noted that the determinant of the jacobian matrix must be easy to compute. This is a non-trivial constraint that does not hold true in many cases.</p>
<p>Fortunately, it turns out that the jacobian is very easy to compute given an affine coupling layer. We can somewhat intuit this by considering the copy-and-paste operation that is applied to the top half of the input. Given this operation, we can see that the the upper left quadrant of the jacobian will simply be an identity matrix.</p>
\[\begin{align}
\frac{\partial z}{\partial x}
&= \begin{pmatrix} \frac{\partial z_{1:d/2}}{\partial x_{1:d/2}} & \frac{\partial z_{1:2/d}}{\partial x_{2/d:d}} \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \frac{\partial z_{d/2:d}}{\partial x_{d/2:d}} \end{pmatrix} \\
&= \begin{pmatrix} I & 0 \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \text{diag}(s_\theta(x_{1:d/2})) \end{pmatrix}
\end{align}\]
<p>Although there are still complicated terms in the third quadrant of the jacobian, we do not have to consider them to compute the determinant of the jacobian: the determinant of a lower triangular matrix is simply the product of its diagonal entries. Hence, the determinant of the jacobian simply collapses to the product of the entries in the fourth quadrant. Hence, we see how the affine transform layer satisfies both the invertability and the jacobian determinant requirements.</p>
<h1 id="implementation">Implementation</h1>
<p>This is my attempt at a simple implementation of an affine transform layer. Although I could have combined the <code class="language-plaintext highlighter-rouge">forward()</code> and <code class="language-plaintext highlighter-rouge">inverse()</code> functions to remove duplicate lines of code, for clarity’s sake, I left them separate.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span>
<span class="k">class</span> <span class="nc">AffineCouplingLayer</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="n">half_size</span><span class="p">,</span> <span class="n">remainder</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">remainder</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="k">print</span><span class="p">(</span>
<span class="sa">f</span><span class="s">"Expected `hidden_size` to be even, but received </span><span class="si">{</span><span class="n">hidden_size</span><span class="si">}</span><span class="s">"</span>
<span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">fc</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">half_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">inverse</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="k">if</span> <span class="n">inverse</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">inverse</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x1</span><span class="p">,</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">chunk</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">z1</span> <span class="o">=</span> <span class="n">x1</span>
<span class="n">s</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc</span><span class="p">(</span><span class="n">x1</span><span class="p">).</span><span class="n">chunk</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">z2</span> <span class="o">=</span> <span class="n">x2</span> <span class="o">*</span> <span class="n">s</span> <span class="o">+</span> <span class="n">t</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">z1</span><span class="p">,</span> <span class="n">z2</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">det</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">prod</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">).</span><span class="nb">abs</span><span class="p">()</span>
<span class="k">return</span> <span class="n">z</span><span class="p">,</span> <span class="n">det</span>
<span class="k">def</span> <span class="nf">inverse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="n">z1</span><span class="p">,</span> <span class="n">z2</span> <span class="o">=</span> <span class="n">z</span><span class="p">.</span><span class="n">chunk</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">z1</span>
<span class="n">s</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc</span><span class="p">(</span><span class="n">z1</span><span class="p">).</span><span class="n">chunk</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x2</span> <span class="o">=</span> <span class="p">(</span><span class="n">z2</span> <span class="o">-</span> <span class="n">t</span><span class="p">)</span> <span class="o">/</span> <span class="n">s</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">((</span><span class="n">x1</span><span class="p">,</span> <span class="n">x2</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
</code></pre></div></div>
<p>This implementation is a close transcription of (5). <code class="language-plaintext highlighter-rouge">z1</code> denotes $z_{1:d/2}$; <code class="language-plaintext highlighter-rouge">z2</code>, $z_{d/2:d}$, and ditto the <code class="language-plaintext highlighter-rouge">x</code>s. The fully-connected layer <code class="language-plaintext highlighter-rouge">self.fc</code> acts as an affine transform. We condition the output <code class="language-plaintext highlighter-rouge">z2</code> on the result of the affine transform applied on <code class="language-plaintext highlighter-rouge">x1</code>. The <code class="language-plaintext highlighter-rouge">inverse()</code> is a transcription of (6).</p>
<p>We can perform a quick sanity check on this implementation by performing a forward pass, as well as an inverse path, and verifying that inverting the output of the forward pass recovers the original input.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">hidden_size</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">half_size</span> <span class="o">=</span> <span class="n">hidden_size</span> <span class="o">//</span> <span class="mi">2</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">AffineCouplingLayer</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span>
<span class="n">z</span><span class="p">,</span> <span class="n">det</span> <span class="o">=</span> <span class="n">l</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">z</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.Size([8, 10])
</code></pre></div></div>
<p>We also get the determinant, which are scalar values. We get 8 values, which equals the batch size in the example input.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">det</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.Size([8])
</code></pre></div></div>
<p>We can check that the affine coupling layer only transforms the top half of the input.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">x</span><span class="p">[:,:</span><span class="n">half_size</span><span class="p">],</span> <span class="n">z</span><span class="p">[:,:</span><span class="n">half_size</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>Trivially, we can also verify that the rest of the output has been modified by the layer.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch</span><span class="p">.</span><span class="n">equal</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span><span class="n">half_size</span><span class="p">:],</span> <span class="n">z</span><span class="p">[:,</span><span class="n">half_size</span><span class="p">:])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>False
</code></pre></div></div>
<p>Most importantly, we can see that the layer is indeed invertable; that is, it recovers the original input given the output of the layer <code class="language-plaintext highlighter-rouge">z</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch</span><span class="p">.</span><span class="n">allclose</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">l</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">inverse</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>True
</code></pre></div></div>
<p>We use <code class="language-plaintext highlighter-rouge">torch.allclose()</code> instead of <code class="language-plaintext highlighter-rouge">torch.equal()</code> due to floating point errors that can cause subtle changes in values. This is merely a technicality and does not affect the conclusion that affine coupling layers are fully invertable.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, we discussed flow models. I personally find flow-based models extremely interesting, simply because deep neural networks are normally not something that we can invert like a simple mathematical function. After all, the precise reason why we use deep neural networks is that we want to model complex non-linear functions. Flow models seem to go against this intuition in some sense, while providing us with the tools to handle highly complex data distributions to tractable posteriors.</p>
<p>I hope you enjoyed reading this post. Catch you up in the next one!</p>Jake TaeIn this post, we will take a look at Flow models, which I’ve been obsessed with while reading papers like Glow-TTS and VITS. This post is heavily based on this lecture video by Pieter Abbeel, as well as the accompanied problem sets for the course, available here.From ELBO to DDPM2021-05-17T00:00:00+00:002021-05-17T00:00:00+00:00https://jaketae.github.io/study/elbo<p>In this short post, we will take a look at variational lower bound, also referred to as the evidence lower bound or ELBO for short. While I have referenced ELBO in a <a href="https://jaketae.github.io/study/vae">previous blog post on VAEs</a>, the proofs and formulations presented in the post seems somewhat overly convoluted in retrospect. One might consider this a gentler, more refined recap on the topic. For the remainder of this post, I will use the terms “variational lower bound” and “ELBO” interchangeably to refer to the same concept. I was heavily inspired by <a href="https://www.youtube.com/watch?v=pStDscJh2Wo">Hugo Larochelle’s excellent lecture</a> on deep belief networks.</p>
<h1 id="concavity">Concavity</h1>
<p>One important property of the logarithm is that it is a concave function. A function $f$ is concave if it satisfies the following property:</p>
\[f\left( \sum \nolimits_i w_i x_i \right) \geq \sum \nolimits_i f(w_i x_i) \tag{1}\]
<p>In other words, if the function evaluated at some weighted sum of values is always greater or equal to the sum of the values evaluated by the function, the function is concave.</p>
<p>As a short detour, we discussed a similar concept in the context of variational autoencoders and Jenson’s inequality in an <a href="https://jaketae.github.io/study/vae/">earlier post</a>. In that post, I introduced the definition of convexity as follows:</p>
\[\mathbb{E}[f(x)] \geq f(\mathbb{E}[x]) \tag{2}\]
<p>While the notations used are slightly different, it is easy to see that the this definition is almost the exact reverse of (1). A trivial result of this is that a concave function is convex if and only if it is linear.</p>
<p>Given this understanding, we can now revisit the logarithm and quickly verify that it is a concave function.</p>
<h1 id="variational-lower-bound">Variational Lower Bound</h1>
<p>Before diving into a soup of equations, it’s important to remind ourselves of the problem setup. While ELBO is probably most commonly referenced in the context of variational autoencoders, I have recently seen it being mentioned in diffusion models as well. ELBO is a broad concept that can be applied to discuss any model with hidden latent representations, which we will denote as $h$ henceforth.</p>
<p>More concretely, given a model $p(x, h)$, we can write</p>
\[\begin{align}
\log p(x)
&= \log \left( \sum_{h} p(x, h) \right) \tag{2} \\
&= \log \left( \sum_{h} q(h \vert x) \frac{p(x, h)}{q(h \vert x)} \right) \tag{3} \\
& \geq \sum_{h} q(h \vert x) \log \frac{p(x, h)}{q(h \vert x)} \tag{4} \\
&= \sum_{h} q(h \vert x) \log p(x, h) - \sum_{h} q(h \vert x) \log q(h \vert x) \tag{5} \\
&= \mathbb{E}_q [\log p(x, h) - \log q(h \vert x)] \tag{6}
\end{align}\]
<p>(2) follows from the law of total probability, (3) is a simultaneous application of multiplication and division, (4) follows from the concavity of logarithms, (5) is an algebraic manipulation using the properties of logarithms, and (6) is a rewriting of the expression as an expectation under $q(h \vert x)$.</p>
<h2 id="equivalence-condition">Equivalence Condition</h2>
<p>In the formulation above, $q(h \vert x)$ can be understood as an approximation of a true distribution $p(h \vert x)$. Note that when $q(h \vert x) = p(h \vert x)$, we have an exact equality. Since</p>
\[\log p(x, h) = \log p(h \vert x) + \log p(x)\]
<p>We can substitute $q$ for $p$ and rewrite (5) as</p>
\[\begin{align}
\log p(x)
&= \sum_h p(h \vert x) (\log p(h \vert x) + \log p(x)) - \sum_h p(h \vert x) \log p(h \vert x) \\
&= \sum_h p(h \vert x) \log p(x)
\end{align}\]
<p>Since $p(x)$ does not depend on $h$, we can pull out the term from the summation, treating it as a constant, leaving us with</p>
\[\log p(x) \sum_h p(h \vert x)\]
<p>Using the law of total probability, we see that the summation totals to 1, leaving us with $\log p(x)$, which is what ELBO seeks to approximate.</p>
<p>Variational lower bounds are extremely useful when dealing with models whose interactions between $x$ and the hidden representation $h$ are complex, rendering (2) computationally intractable. Therefore, to train such models, we seek to maximize the log likelihood by pushing the lower bound up.</p>
<h2 id="kl-divergence">KL Divergence</h2>
<p>Recall the definition of KL divergence:</p>
\[\begin{align}
D_\text{KL}(q \parallel p)
&= \sum_{x \in X} q(x) \log \left( \frac{q(x)}{p(x)} \right) \\
&= - \sum_{x \in X} q(x) \log \left( \frac{p(x)}{q(x)} \right) \\
\end{align}\]
<p>We can see the resemblance between this definition and the definition of ELBO as written in (4), which was</p>
\[\log p(x) \geq \sum_{h} q(h \vert x) \log \frac{p(x, h)}{q(h \vert x)} \tag{4}\]
<p>The nice conclusion to this story is that</p>
\[\log p(x) - \text{ELBO} = D_\text{KL}(q(h \vert x) \parallel p(h \vert x)) \tag{7}\]
<p>This is a nice interpretation, since KL divergence is by definition always greater or equal to zero. Hence, we can confirm that</p>
\[\log p(x) \geq \text{ELBO}\]
<h3 id="proof">Proof</h3>
<p>In this section, we sketch a quick proof for (7).</p>
\[\begin{align}
D_\text{KL}(q(h \vert x) \parallel p(h \vert x))
&= \mathbb{E}_q [\log q(h \vert x) - \log p(h \vert x) ] \\
&= \mathbb{E}_q [\log q(h \vert x) - \log p(x, h) + \log p(x) ] \\
&= \mathbb{E}_q [\log q(h \vert x) - \log p(x, h)] + \log p(x) \\
\end{align}\]
<p>Notice that the expectation is the sign-flipped version ELBO term we derived above.</p>
\[\mathbb{E}_q [\log p(x, h) - q(h \vert x)] \tag{6}\]
<p>Therefore, we have</p>
\[D_\text{KL}(q(h \vert x) \parallel p(h \vert x)) = - \text{ELBO} + \log p(x) \\ \implies \log p(x) - \text{ELBO} = D_\text{KL}(q(h \vert x) \parallel p(h \vert x))\]
<h1 id="denoising-diffusion-probabilistic-models">Denoising Diffusion Probabilistic Models</h1>
<p>Since we have already seen how ELBO comes up in VAEs, it might be more helpful to take a look at another more recent example I came across while reading <a href="https://arxiv.org/abs/2006.11239">Denoising Diffusion Probabilistic Models</a>, or DDPM for short. The intent of this section is not to go over what DDPMs are, but rather to show a sneak peak into how ELBO is mentioned in the paper.</p>
<p>In the paper, the authors write</p>
<blockquote>
<p>Training is performed by optimizing the usual variational bound on negative log likelihood:
\(\begin{align}
\mathbb{E}[- \log p_\theta(\mathbf{x}_0)]
& \leq \mathbb{E}_q \left[ - \log \frac{p_\theta (\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \right] \tag{8} \\
&= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t \geq 1} \log \frac{p_\theta (\mathbf{x}_{t - 1} \vert \mathbf{x}_t)}{q(\mathbf{x}_t \vert \mathbf{x}_{t - 1})} \right] \tag{9} \\
& := L
\end{align}\)</p>
</blockquote>
<p>Equation tags have been added for the purposes of this post.</p>
<p>Admittedly, this does look confusing at first sight, but at its core is the definition of ELBO which we have derived in this post, plus some details inherent to DDPMs, such as Markov chain diffusion. In light of the topic of this post, I will attempt to give the simplest possible explanation of the later while focusing on the former.</p>
<p>To make things a little more familiar, let’s rewrite (6) to look more like the one presented in the DDPM paper.</p>
\[\begin{align}
\log p(x)
& \geq \mathbb{E}_q [\log p(x, h) - \log q(h \vert x)] \tag{6} \\
& \geq \mathbb{E}_q \left[ \log \frac{p(x, h)}{q(h \vert x)} \right] \tag{6-1} \\
\end{align}\]
<p>It is not difficult to see that simply flipping sign on both sides results in an expression that closely resembles (8). We also see a one-to-one correspondence between the variables used in this post and the ones in the paper. Namely, $\mathbf{x_0}$ corresponds to $x$, the ground-truth data, and $\mathbf{x}_t$ is the hidden representations of the model.</p>
<p>DDPMs work by starting out with some GT data $\mathbf{x}_0$, then gradually adding Gaussian noise through a Markov chain process. This gradually “breaks” signals originally present in the data, and send the ground truth data to an approximately isotropic distribution. This process is illustrated below. The figure was taken from the <a href="https://hojonathanho.github.io/diffusion/">author’s website</a>.</p>
<p><img src="https://hojonathanho.github.io/diffusion/assets/img/pgm_diagram_xarrow.png" /></p>
<p>A neural network is then trained to reverse this Markov chain process by recovering the original signal from the noise. The overall intuition is, in some sense, similar to that of GANs or VAEs, where a network learns to map latent dimensions to the data distribution. An obvious difference is that DDPMs iteratively recover the data, whereas GAN generators usually go directly to the data distribution. The slicing and summation notation in (9) exists precisely due to this iterative nature of the DDPM generative process.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Topics like ELBO and KL divergence are one of those concepts that I always think I understand, but do not in reality. The mathematical details underlying those concepts are always intriguing to look at.</p>
<p>While this post in no way covers the entirety of the topic, I hope this will lay a solid foundation for those who want to better understand the mathematics behind latent variable models, such as variational autoencoders, DDPMs and the likes. Personally, I am starting to discover a newfound fascination for DDPMs, and hope to write more about them in the near future.</p>
<p>I hope you enjoyed reading this post. Catch you up in the next one!</p>Jake TaeIn this short post, we will take a look at variational lower bound, also referred to as the evidence lower bound or ELBO for short. While I have referenced ELBO in a previous blog post on VAEs, the proofs and formulations presented in the post seems somewhat overly convoluted in retrospect. One might consider this a gentler, more refined recap on the topic. For the remainder of this post, I will use the terms “variational lower bound” and “ELBO” interchangeably to refer to the same concept. I was heavily inspired by Hugo Larochelle’s excellent lecture on deep belief networks.Reboot2021-05-15T00:00:00+00:002021-05-15T00:00:00+00:00https://jaketae.github.io/blog/reboot<p>It has been a while since I last posted on this blog. Admittedly, a lot has happened in my life: I have been discharged from the Republic of Korea Army, received two full vaccination shots, and am now back home, meeting family and friends all of whom I have dearly missed during the 19-months of my military service. Of course, there are things that haven’t changed as well, such as the importance of this blog and my desire to continue documenting the interesting and random things that I learn every day.</p>
<p>Lately I’ve been realizing how powerful a force inertia is. It was easy to churn out posts every week when blogging was part of my personal norm, almost a habit if you will. Then, when perturbations were introduced to my life, I lost equilibrium and regrettably stopped writing on a regular basis. While I continued studying and committing to new and old repositories on my <a href="https://github.com/jaketae">GitHub</a>, for some inexplicable reason I found it difficult to restart something that I had stopped engaging with. Inertia is insidious, yet it concretizes with time, turning into a substance forceful enough to transform the definition of what personal norm entails.</p>
<p>Today, I was trying to wrap my head around the basics of stochastic differential equations and diffusion models (both of which I still do not understand) until I came across the term “score-based models.” The term “score” comes from Fischer’s score, which I had written about some time in the past. It’s an odd feeling when you realize that yourself a few months back was bright enough to understand concepts that the current self finds abstract and incomprehensible. But this wasn’t the only time I looked up something on my own blog. While there were also times when I spotted my own past mistakes, more often or not I found myself using my own writing as reference in an attempt to recall some concept or understanding from distant memory.</p>
<p>The conclusion of this admittedly verbose, ostensibly pointless post, is that documenting one’s intellectual journey is definitely a worthy endeavor. While the format of this post may appear as a self-promotion of sorts, the intended audience is really my future self, who I hope does not succumb to inertia or, put more bluntly, laziness. So here’s to another round of blogging!</p>Jake TaeIt has been a while since I last posted on this blog. Admittedly, a lot has happened in my life: I have been discharged from the Republic of Korea Army, received two full vaccination shots, and am now back home, meeting family and friends all of whom I have dearly missed during the 19-months of my military service. Of course, there are things that haven’t changed as well, such as the importance of this blog and my desire to continue documenting the interesting and random things that I learn every day.Linear Attention Computation in Nyströmformer2021-03-15T00:00:00+00:002021-03-15T00:00:00+00:00https://jaketae.github.io/study/nystrom-approximation<p>In this post, we will take a look at Nyström approximation, a technique that I came across in <a href="https://arxiv.org/pdf/2102.03902.pdf">Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention</a> by Xiong et al. This is yet another interesting paper that seeks to make the self-attention algorithm more efficient down to linear runtime. While there are many intricacies to the Nyström method, the goal of this post is to provide a high level intuition of how the method can be used to approximate large matrices, and how this method was used in the aforementioned paper.</p>
<h1 id="concept">Concept</h1>
<p>Despite its fancy and somewhat intimidating name, the Nyström method has an intuitive explanation. The idea is that, if we know the distance between point A and point B, as well as that between point B and point C, then we can approximate the distance between points A and C as some sort of addition of the two quantities. Of course, if we were discussing distances in the context of one-dimensional space, namely the real number line, we would not only be able to approximating the distance; we would know the exact quantity. However, in high-dimensional space, this is somewhat more difficult, and we can only resort to approximations.</p>
<p>To put things into context, let’s say we want to approximate the attention matrix in the transformer architecture. The Nyström method begins by selecting what the authors of the paper refer to as landmarks. Basically, if we have an attention matrix $A \in \mathbb{R}^{L \times L}$, then we select a few landmark rows and columns to use as the basis or pivot point for our approximation. The goal, then, is to select as few landmarks as possible while being able to approximate the attention matrix as accurately as possible.</p>
<p>For sake of simplicity, let’s say we select the first row and column to be our landmarks. Then, the goal is to approximate the inner sub-matrix $A_\text{sub} \in \mathbb{R}^{(L - 1) \times (L - 1)}$. How might we go about it?</p>
<p>As stated earlier, the intuition is that we use the landmarks as pivot points. Since we selected the first rows and columns as our landmarks, we have access to $q_1 k_n^\top \forall n \leq L$, as well as $q_n k_1\top \forall n \leq L$ (for simplicity, we ignore the normalizing square root). If we remind ourselves of the motivation behind the transformer’s key-value-query architecture, we can consider attention as a way of calculating the distance or relevance between pairs of tokens in a given sequence. Put differently, the landmarks tell us the distance between the first query and all other keys, as well as the distance between the first key and all other queries.</p>
<p>Without loss of generality, we can approximate the distance between any $i$th key and the $j$th query using these landmarks. The way we do this is somewhat similar to the point A, B, C example we briefly discussed earlier. Namely, we start by looking at the distance between the $i$th key and the first query. Then, we also look at the attention value between the first key and the $j$th query. Note that connecting the two dots kind of gives us a sense of how related the $i$th query and $j$ query are. To remove the redundancy, we divide the product by the self-attention of the first token, or the attention score between the first key and query.</p>
\[A_{ij} = \frac{q_i k_1^\top \cdot q_1 k_j^\top}{q_1 k_1^\top} \tag{1}\]
<p>Of course, if we have multiple landmarks, we can easily expand the expression above into matrix form. The tilde indicates landmark rows and columns.</p>
\[\tilde{A} = Q \tilde{K}^\top \times (\tilde{Q} \tilde{K}^\top)^\star \times \tilde{Q} K \tag{2}\]
<p>The star expression ($\star$) denotes the Moore-Penrose pseudo-inverse.</p>
<p>Now that we have a general intuition of how Nyström approximation works in the context of attention, let’s get into some basic implementation.</p>
<h1 id="implementation">Implementation</h1>
<p>The goal here is to see that Nyström approximation can indeed yield reasonably accurate results, and that the larger the number of key landmarks, the better the approximation. Consider this as a form of Monte Carlo experiment.</p>
<p>Let’s begin by importing some modules.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">config</span> <span class="n">InlineBackend</span><span class="p">.</span><span class="n">figure_format</span><span class="o">=</span><span class="s">"retina"</span>
</code></pre></div></div>
<p>For sake of simplicity, we assume a very basic model with a hidden dimension of 2, and some data points whose sequence length is 5. For simplicity, we omit the batch dimension.</p>
<p>Then, in the context of attention, we would end up with the following keys and query tensors, as well as a five-by-five square attention matrix.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d_model</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">seq_len</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">Q</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="n">K</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">Q</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">T</span>
<span class="n">A</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(5, 5)
</code></pre></div></div>
<p>The goal, then, is to approximate this square attention matrix.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 2.29571874, -0.7373519 , 0.32730778, -0.84730782, -1.16558083],
[ 1.4346883 , -0.32765206, 0.80095764, -0.39437617, 0.17889744],
[ 1.38973136, -0.61066937, -0.53783773, -0.67968999, -1.82523199],
[-1.80977456, 0.1036656 , -2.39735444, 0.18320197, -2.33569844],
[ 1.36516091, -0.40695455, 0.33580143, -0.47186895, -0.47836287]])
</code></pre></div></div>
<p>Let’s begin our approximation by assuming the worst case, in which we only have access to one landmark. This brings us to equation (1) where essentially all operations were done on vectors instead of matrices.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">num_landmarks</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">Q_tilde</span> <span class="o">=</span> <span class="n">Q</span><span class="p">[:</span><span class="n">num_landmarks</span><span class="p">]</span>
<span class="n">K_tilde</span> <span class="o">=</span> <span class="n">K</span><span class="p">[:</span><span class="n">num_landmarks</span><span class="p">]</span>
</code></pre></div></div>
<p>Recalling equations (1) and (2), we can now write the approximation of the attention matrix as follows.</p>
\[\tilde{A} = Q \tilde{K}^\top \times (\tilde{Q} \tilde{K}^\top)^\star \times \tilde{Q} K\]
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A_tilde</span> <span class="o">=</span> <span class="p">(</span><span class="n">Q</span> <span class="o">@</span> <span class="n">K_tilde</span><span class="p">.</span><span class="n">T</span><span class="p">)</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">pinv</span><span class="p">(</span><span class="n">Q_tilde</span> <span class="o">@</span> <span class="n">K_tilde</span><span class="p">.</span><span class="n">T</span><span class="p">)</span> <span class="o">@</span> <span class="p">(</span><span class="n">Q_tilde</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">A_tilde</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(5, 5)
</code></pre></div></div>
<p>The dimensionality seems to match that of the original attention matrix, as expected. If we print out the approximation, we should expect to see exact matches in the first row and column; the rest of the four-by-four region of the matrix should roughly be similar to that of the original.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A_tilde</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 2.29571874, -0.7373519 , 0.32730778, -0.84730782, -1.16558083],
[ 1.4346883 , -0.46080128, 0.20454799, -0.52951722, -0.72841901],
[ 1.38973136, -0.44636176, 0.19813834, -0.51292444, -0.7055935 ],
[-1.80977456, 0.58127361, -0.25802521, 0.66795471, 0.91885757],
[ 1.36516091, -0.43847008, 0.19463525, -0.50385594, -0.69311861]])
</code></pre></div></div>
<p>We can indeed quickly verify that the first row and column are exact matches; however, the rest of the 16 elements are somewhat difficult to compare. We can more systematically calculate the difference between two matrices by using a norm, such as the Frobenius norm.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">A</span> <span class="o">-</span> <span class="n">A_tilde</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4.33185890598477
</code></pre></div></div>
<p>If we look at the raw value of the subtraction, we can see that the approximation isn’t too bad.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span> <span class="o">-</span> <span class="n">A_tilde</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[-2.22044605e-16, 1.33149223e-01, 5.96409654e-01,
1.35141056e-01, 9.07316456e-01],
[ 0.00000000e+00, -1.64307605e-01, -7.35976069e-01,
-1.66765549e-01, -1.11963848e+00],
[ 0.00000000e+00, -4.77608006e-01, -2.13932924e+00,
-4.84752738e-01, -3.25455600e+00],
[ 0.00000000e+00, 3.15155316e-02, 1.41166181e-01,
3.19869853e-02, 2.14755744e-01]])
</code></pre></div></div>
<h2 id="monte-carlo-approach">Monte Carlo Approach</h2>
<p>Let’s extend this little trial with one landmark to larger matrices. For ease of execution and implementation, I’ve basically wrapped each step outlined above as functions.</p>
<p>The first function, <code class="language-plaintext highlighter-rouge">norms_by_landmarks</code>, receives query and key matrices, then approximates the attention matrix while varying the number of landmarks. The Frobenius norm is used to measure how good the approximation is. Theoretically, we should expect to see a downward-sloping pattern.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">norms_by_landmarks</span><span class="p">(</span><span class="n">Q</span><span class="p">,</span> <span class="n">K</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">Q</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">T</span>
<span class="k">for</span> <span class="n">num_landmarks</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">Q</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
<span class="n">Q_tilde</span> <span class="o">=</span> <span class="n">Q</span><span class="p">[:</span><span class="n">num_landmarks</span><span class="p">]</span>
<span class="n">K_tilde</span> <span class="o">=</span> <span class="n">K</span><span class="p">[:</span><span class="n">num_landmarks</span><span class="p">]</span>
<span class="n">A_tilde</span> <span class="o">=</span> <span class="p">(</span><span class="n">Q</span> <span class="o">@</span> <span class="n">K_tilde</span><span class="p">.</span><span class="n">T</span><span class="p">)</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">pinv</span><span class="p">(</span><span class="n">Q_tilde</span> <span class="o">@</span> <span class="n">K_tilde</span><span class="p">.</span><span class="n">T</span><span class="p">)</span> <span class="o">@</span> <span class="p">(</span><span class="n">Q_tilde</span> <span class="o">@</span> <span class="n">K</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">norm</span><span class="p">(</span><span class="n">A</span> <span class="o">-</span> <span class="n">A_tilde</span><span class="p">))</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div></div>
<p>The second function, <code class="language-plaintext highlighter-rouge">run_experiment</code>, is a wrapper around the first one. It repeatedly conducts the same experiment for a specified number of iterations. The purpose of repetition is essentially remove the possibility of luck, where some randomly initialized key and query matrices are configured in such a way that the Nyström approximation performs too well or poorly on a given task. By repeating the experiment and averaging the result—which is the spirit behind Monte Carlo approximations—we can have more confidence in our final result.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">run_experiments</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="n">num_iter</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iter</span><span class="p">):</span>
<span class="n">Q</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="n">K</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="n">norm</span> <span class="o">=</span> <span class="n">norms_by_landmarks</span><span class="p">(</span><span class="n">Q</span><span class="p">,</span> <span class="n">K</span><span class="p">)</span>
<span class="n">result</span> <span class="o">+=</span> <span class="n">norm</span>
<span class="k">return</span> <span class="n">result</span> <span class="o">/</span> <span class="n">num_iter</span>
</code></pre></div></div>
<p>Here, we assume a sequence length of 50, and the hidden size of the model (or the embedding size) to be 10. And off we go!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">norms</span> <span class="o">=</span> <span class="n">run_experiments</span><span class="p">(</span><span class="n">d_model</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">seq_len</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">norms</span><span class="p">)),</span> <span class="n">norms</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p> <br />
<img src="/assets/images/2021-03-15-nystrom-approximation_files/2021-03-15-nystrom-approximation_30_0.png" />
</p>
<h1 id="conclusion">Conclusion</h1>
<p>While there is some noise in the final outcome, we do see that beyond a certain dimension, the approximation yields near exact results. In this case, it seems to happen around 10 landmarks.</p>
<p>Transformers have now taken over much of the ML world, even beyond NLP. Recently, I came across a paper titled <a href="https://arxiv.org/abs/2103.05247">Pretrained Transformers are Universal Computation Engines</a>. Apparently, pretrained transformer LMs can perform extremely well on tasks with minimal fine-tuning. Specifically, even if the feedforward and attention portion of the network frozen—which amounts to nearly 99 percent of the entire model architecture—transformer LMs can be micro-tuned to a wide array of tasks that are even not specifically NLP-related.</p>
<p>While there is certainly a possibility that a new SOTA model architecture will be announced by researchers in the new future, similar to how transformers made LSTMs obsolete in many fields, I think transformers are here to stay around for longer. And it’s certainly interesting to see attempts to make it even better, lighter, and faster. Nyströmformer was one such attempt, and I hope to see more.</p>Jake TaeIn this post, we will take a look at Nyström approximation, a technique that I came across in Nyströmformer: A Nyström-based Algorithm for Approximating Self-Attention by Xiong et al. This is yet another interesting paper that seeks to make the self-attention algorithm more efficient down to linear runtime. While there are many intricacies to the Nyström method, the goal of this post is to provide a high level intuition of how the method can be used to approximate large matrices, and how this method was used in the aforementioned paper.Relative Positional Encoding2021-03-01T00:00:00+00:002021-03-01T00:00:00+00:00https://jaketae.github.io/study/relative-positional-encoding<p>In this post, we will take a look at relative positional encoding, as introduced in <a href="https://arxiv.org/pdf/1803.02155.pdf">Shaw et al (2018)</a> and refined by <a href="https://arxiv.org/pdf/1809.04281.pdf">Huang et al (2018)</a>. This is a topic I meant to explore earlier, but only recently was I able to really force myself to dive into this concept as I started reading about music generation with NLP language models. This is a separate topic for another post of its own, so let’s not get distracted.</p>
<p>Let’s dive right into it!</p>
<h1 id="concept">Concept</h1>
<p>If you’re already familiar with transformers, you probably know that transformers process inputs in parallel at once. This is one of the many reasons why transformers have been immensely more successful than RNNs: RNNs are unable to factor in long-range dependencies due to their recurrent structure, whereas transformers do not have this problem since they can see the entire sequence as it is being processed. However, this also means that transformers require positional encodings to inform the model about where specific tokens are located in the context of a full sequence. Otherwise, transformer would be entirely invariant to sequential information, considering “John likes cats” and “Cats like John” as identical. Hence, positional encodings are used to signal the absolute position of each token.</p>
<h2 id="relative-positional-encoding">Relative Positional Encoding</h2>
<p>While absolute positional encodings work reasonably well, there have also been efforts to exploit pairwise, relative positional information. In <a href="https://arxiv.org/pdf/1803.02155.pdf">Self-Attention with Relative Position Representations</a>, Shaw et al. introduced a way of using pairwise distances as a way of creating positional encodings.</p>
<p>There are a number of reasons why we might want to use relative positional encodings instead of absolute ones. First, using absolute positional information necessarily means that there is a limit to the number of tokens a model can process. Say a language model can only encode up to 1024 positions. This necessarily means that any sequence longer than 1024 tokens cannot be processed by the model. Using relative pairwise distances can more gracefully solve this problem, though not without limitations. Relative positional encodings can generalize to sequences of unseen lengths, since theoretically the only information it encodes is the relative pairwise distance between two tokens.</p>
<p>Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys.</p>
\[e_{ij} = \frac{x_i W^Q (x_j W^K + a_{ij}^K)^\top}{\sqrt{d_z}} \tag{1}\]
<p>The softmax operation remains unchanged from vanilla self-attention.</p>
\[\alpha_{ij} = \frac{\text{exp} \space e_{ij}}{\sum_{k = 1}^n \text{exp} \space e_{ik}}\]
<p>Lastly, relative positional information is supplied again as a sub-component of the values matrix.</p>
\[z_i = \sum_{j = 1}^n \alpha_{ij} (x_j W^V + a_{ij}^V) \tag{2}\]
<p>In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.</p>
<h2 id="bridging-shaw-and-huang">Bridging Shaw and Huang</h2>
<p>In Huang et al., also known as the music transformer paper, the authors pointed out that calculating relative positional encodings as introduced in Shaw et al. requires $O(L^2D)$ memory due to the introduction of an additional relative positional encoding matrix. Here, $L$ denotes the length of the sequence, and $D$, the hidden state dimension used by the model. Huang et al. introduced a new way of computing relative positional encoding via a clever skewing operation.</p>
<p>To cut to the chase, below is the relative attention mechanism suggested by the authors in Huang et al.</p>
\[\text{RelativeAttention} = \text{Softmax} \left( \frac{Q K^\top + S_{rel}}{\sqrt{D_h}} \right) V \tag{3}\]
<p>It seems that in the music transformer paper, the authors dropped the additional relative positional embedding that corresponds to the value term and focus only on the key component. In other words, the authors only focus on (1), not (2).</p>
<p>The notations in (1), (2), and (3) were each borrowed verbatim from the authors of both papers. Hence, there is some notational mixup that requires attention. Specifically, $S^{rel}$ in the music transformer paper is simply</p>
\[S_{rel} = Q R^\top\]
<p>where</p>
\[R_{ij} = a_{ij}^K\]
<p>In other words, (3) is just an expanded variant of (1).</p>
<p>To make things a little clearer, let’s review the dimensions of each tensor. First, from vanilla self-attention, we know that $Q \in \mathbb{R}^{H \times L \times D_h}$, where $H$ denotes the number of heads. Thus, $R \in \mathbb{R}^{H \times L \times D_h}$, and $S_{rel} \in \mathbb{R}^{H \times L \times L}$. $R$ is a matrix of relative positional embeddings. Intuitively, $R$ can also be understood as the result of passing a matrix of relative positional indices through an embedding layer. For concreteness, here is a dummy function that creates relative positional indices.</p>
<h2 id="efficient-computation">Efficient Computation</h2>
<p>The skewing mechanism introduced in Huang et al., is ingenious, but it isn’t black magic. The technique could roughly be understood as a set of clever padding and matrix manipulation operations that ultimately result in $S_{rel}$ without explicitly creating or computing $R$. The reason why we might want to avoid calculating $R$ is that it is a huge memory bottleneck, as the matrix requires $O(L^2 d)$ extra space.</p>
<p>The method presented by Huang et al. could be seen as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">relative_positions</span><span class="p">(</span><span class="n">seq_len</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">seq_len</span><span class="p">):</span>
<span class="n">front</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="o">-</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="n">end</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">seq_len</span> <span class="o">-</span> <span class="n">i</span><span class="p">))</span>
<span class="n">result</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">front</span> <span class="o">+</span> <span class="n">end</span><span class="p">)</span>
<span class="k">return</span> <span class="n">result</span>
</code></pre></div></div>
<p>Let’s see what the indices look like for a sequence of length five.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">relative_positions</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[0, 1, 2, 3, 4],
[-1, 0, 1, 2, 3],
[-2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1],
[-4, -3, -2, -1, 0]]
</code></pre></div></div>
<p>We can understand each row as indicating the current position of attention, and each index as representing the distance between the current token and the token corresponding to the index. A quick disclaimer that this example does not strictly follow the details outlined in Shaw et al. For instance, this function does not take into account $k$, or the width of the window. The 0-based indexing scheme is also from Huang et al.
These minor details notwithstanding, having a clear sense of what $R$ is, I think, is very helpful in understanding relative attention, as well as the skewing mechanism introduced in Huang et al. For a fuller explanation of these concepts, I highly recommend <a href="https://medium.com/@_init_/how-self-attention-with-relative-position-representations-works-28173b8c245a">this medium article</a>.</p>
<p>Below is a visual summary of the skewing mechanism.</p>
<p><img src="/assets/images/relative_attn_skewing.png" /></p>
<p>Personally, I found this diagram to be a bit confusing at first. However, with must staring and imagination, I slowly started to realize that the skewing is simply a way of transforming $QE_r^\top$ into $QR^\top$, where $E_r$ is the relative positional embedding matrix.</p>
<p>Instead of trying to explain this in plain text, I decided that implementing the the entire relative global attention would not only help with demonstration, but also cementing my own understanding of how this works.</p>
<h1 id="implementation">Implementation</h1>
<p>This implementation of relative global attention was in large part influenced by Karpathy’s <a href="https://github.com/karpathy/minGPT">minGPT</a>, which we discussed in <a href="https://jaketae.github.io/study/gpt/">this previous post</a>, as well as Prayag Chatha’s implementation of the music transformer, available on GitHub <a href="https://github.com/chathasphere/pno-ai">here</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
</code></pre></div></div>
<p>Below is a simple implementation of a relative global attention layer. I’ve deviated from Chatha’s implementation in a number of ways, but the most important and probably worth mentioning is how I treat the relative positional embedding matrix. In Shaw et al., the authors note that “[relative positional embeddings] can be shared across attention heads.” Hence, I’m using one <code class="language-plaintext highlighter-rouge">Er</code> matrix to handle all heads, instead of creating multiple of them. This matrix is registered as a <code class="language-plaintext highlighter-rouge">nn.Parameter</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RelativeGlobalAttention</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">d_model</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">,</span> <span class="n">max_len</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="n">d_head</span><span class="p">,</span> <span class="n">remainder</span> <span class="o">=</span> <span class="nb">divmod</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">)</span>
<span class="k">if</span> <span class="n">remainder</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
<span class="s">"incompatible `d_model` and `num_heads`"</span>
<span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">max_len</span> <span class="o">=</span> <span class="n">max_len</span>
<span class="bp">self</span><span class="p">.</span><span class="n">d_model</span> <span class="o">=</span> <span class="n">d_model</span>
<span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span> <span class="o">=</span> <span class="n">num_heads</span>
<span class="bp">self</span><span class="p">.</span><span class="n">key</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">Er</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">d_head</span><span class="p">))</span>
<span class="bp">self</span><span class="p">.</span><span class="n">register_buffer</span><span class="p">(</span>
<span class="s">"mask"</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">tril</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="n">max_len</span><span class="p">))</span>
<span class="p">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># self.mask.shape = (1, 1, max_len, max_len)
</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># x.shape == (batch_size, seq_len, d_model)
</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span>
<span class="k">if</span> <span class="n">seq_len</span> <span class="o">></span> <span class="bp">self</span><span class="p">.</span><span class="n">max_len</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
<span class="s">"sequence length exceeds model capacity"</span>
<span class="p">)</span>
<span class="n">k_t</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">key</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">permute</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># k_t.shape = (batch_size, num_heads, d_head, seq_len)
</span> <span class="n">v</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">value</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">num_heads</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">).</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># shape = (batch_size, num_heads, seq_len, d_head)
</span>
<span class="n">start</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_len</span> <span class="o">-</span> <span class="n">seq_len</span>
<span class="n">Er_t</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">Er</span><span class="p">[</span><span class="n">start</span><span class="p">:,</span> <span class="p">:].</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Er_t.shape = (d_head, seq_len)
</span> <span class="n">QEr</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">Er_t</span><span class="p">)</span>
<span class="c1"># QEr.shape = (batch_size, num_heads, seq_len, seq_len)
</span> <span class="n">Srel</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">skew</span><span class="p">(</span><span class="n">QEr</span><span class="p">)</span>
<span class="c1"># Srel.shape = (batch_size, num_heads, seq_len, seq_len)
</span>
<span class="n">QK_t</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">q</span><span class="p">,</span> <span class="n">k_t</span><span class="p">)</span>
<span class="c1"># QK_t.shape = (batch_size, num_heads, seq_len, seq_len)
</span> <span class="n">attn</span> <span class="o">=</span> <span class="p">(</span><span class="n">QK_t</span> <span class="o">+</span> <span class="n">Srel</span><span class="p">)</span> <span class="o">/</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">q</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span>
<span class="n">mask</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">mask</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:</span><span class="n">seq_len</span><span class="p">,</span> <span class="p">:</span><span class="n">seq_len</span><span class="p">]</span>
<span class="c1"># mask.shape = (1, 1, seq_len, seq_len)
</span> <span class="n">attn</span> <span class="o">=</span> <span class="n">attn</span><span class="p">.</span><span class="n">masked_fill</span><span class="p">(</span><span class="n">mask</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="s">"-inf"</span><span class="p">))</span>
<span class="c1"># attn.shape = (batch_size, num_heads, seq_len, seq_len)
</span> <span class="n">attn</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">attn</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">attn</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="c1"># out.shape = (batch_size, num_heads, seq_len, d_head)
</span> <span class="n">out</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="c1"># out.shape == (batch_size, seq_len, num_heads, d_head)
</span> <span class="n">out</span> <span class="o">=</span> <span class="n">out</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># out.shape == (batch_size, seq_len, d_model)
</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">skew</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">QEr</span><span class="p">):</span>
<span class="c1"># QEr.shape = (batch_size, num_heads, seq_len, seq_len)
</span> <span class="n">padded</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">pad</span><span class="p">(</span><span class="n">QEr</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="c1"># padded.shape = (batch_size, num_heads, seq_len, 1 + seq_len)
</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">,</span> <span class="n">num_rows</span><span class="p">,</span> <span class="n">num_cols</span> <span class="o">=</span> <span class="n">padded</span><span class="p">.</span><span class="n">shape</span>
<span class="n">reshaped</span> <span class="o">=</span> <span class="n">padded</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">,</span> <span class="n">num_cols</span><span class="p">,</span> <span class="n">num_rows</span><span class="p">)</span>
<span class="c1"># reshaped.size = (batch_size, num_heads, 1 + seq_len, seq_len)
</span> <span class="n">Srel</span> <span class="o">=</span> <span class="n">reshaped</span><span class="p">[:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:,</span> <span class="p">:]</span>
<span class="c1"># Srel.shape = (batch_size, num_heads, seq_len, seq_len)
</span> <span class="k">return</span> <span class="n">Srel</span>
</code></pre></div></div>
<p>Much of the operations in <code class="language-plaintext highlighter-rouge">forward</code> method are code translations of the equations we discussed above. The interesting bit happens in the <code class="language-plaintext highlighter-rouge">skew</code> method. Basically, we pad $Q E_r^\top$ to the left, then reshape to shift all indices, then slice out the necessary portion of the matrix to obtain $Q R^\top$, or $S_{rel}$. This has the benefit of reducing the memory requirement; since we don’t have to calculate $R$ and can instead directly use $E_r$, which is a matrix that is needed anyway, the memory requirement is reduced to $O(Ld)$. This is what I personally think is one of the biggest contributions of Huang et al.</p>
<p>Let’s quickly check that the layer works as intended by quickly performing a basic tensor shape check.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">seq_len</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">d_model</span> <span class="o">=</span> <span class="mi">768</span>
<span class="n">num_heads</span> <span class="o">=</span> <span class="mi">12</span>
<span class="n">test_in</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">,</span> <span class="n">d_model</span><span class="p">)</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">RelativeGlobalAttention</span><span class="p">(</span><span class="n">d_model</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">)</span>
<span class="n">l</span><span class="p">(</span><span class="n">test_in</span><span class="p">).</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.Size([8, 100, 768])
</code></pre></div></div>
<p>We get an output of size <code class="language-plaintext highlighter-rouge">(batch_size, seq_len, d_model)</code>, which is what we expect.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, we discussed relative positional encoding as introduced in Shaw et al., and saw how Huang et al. was able to improve this algorithm by introducing optimizations.</p>
<p>Relative positional encodings were used in other architectures, such as Transformer XL, and more recently, DeBERTa, which I also plan on reviewing soon. Relative positioning is probably a lot closer to how we humans read text. While it is probably not a good idea to always compare and conflate model architectures with how the human brain works, I still think it’s an interesting way to think about these concepts.</p>
<p>This post was also a healthy exercise in that it really forced me to try to understand every single detail. Every sentence and diagram can be of huge help when you are trying to actually implement ideas that are outlined in published papers. I could see why <a href="https://paperswithcode.com">Papers with Code</a> became such a huge thing. It’s always helpful to see actual implementations and, even better, reproducible results. In this particular post, referencing music transformer implementations on GitHub and re-reading the paper many times really helped me nail down points that were initially confusing or unclear.</p>
<p>I hope you’ve enjoyed reading this post. Catch you up in the next one!</p>Jake TaeIn this post, we will take a look at relative positional encoding, as introduced in Shaw et al (2018) and refined by Huang et al (2018). This is a topic I meant to explore earlier, but only recently was I able to really force myself to dive into this concept as I started reading about music generation with NLP language models. This is a separate topic for another post of its own, so let’s not get distracted.Locality Sensitive Hashing2021-02-25T00:00:00+00:002021-02-25T00:00:00+00:00https://jaketae.github.io/study/lsh<p>These days, I’ve found myself absorbed in the world of memory-efficient transformer architectures. Transformer models require $O(n^2)$ runtime and memory due to how self-attention is implemented. Namely, each token has to be attended with every other token in the sequence, and the results must be stored in a square attention matrix, to which we apply a softmax activation to obtain the weights to multiply the values with.</p>
<p>So far, many researchers have presented various ways of optimizing this computation while decreasing the algorithm down to linear runtime. Such architectures include the <a href="https://arxiv.org/abs/2006.04768">Linformer</a>, <a href="https://arxiv.org/abs/2001.04451">Reformer</a>, <a href="https://arxiv.org/abs/2009.14794">Performer</a>, <a href="https://arxiv.org/abs/2004.05150">LongFormer</a>, and more recently, the <a href="https://arxiv.org/abs/2102.03902">Nyströmformer</a>. My knowledge base is way too shallow to be able to read these papers on my own. Thankfully, there are heros like <a href="https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew">Yannic Kilcher</a> who help make trendy deep learning papers a lot more accessible, even for novices like myself. I cannot recommend his channel enough.</p>
<p>Today, we’ll explore an algorithm known as LSH, or locality-sensitive hashing. LSH was used in Reformer, which is one of the linear-runtime transformer models in the list. This is intended as a beginner-friendly introduction to this topic; I hope other readers can get a sense of what it is and have a better time understanding how the Reformer architecture works. As supplements, I also suggest that you check out this <a href="https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134">medium article</a> as well as <a href="https://santhoshhari.github.io/Locality-Sensitive-Hashing/">this blog post</a>, both of which I referenced in writing this post.</p>
<p>Without further ado, let’s get started!</p>
<h1 id="concept">Concept</h1>
<p>Imagine that you are building a music identification service like <a href="https://www.shazam.com">Shazam</a>. You probably have a huge database of songs. Whenever a user plays a song, the engine should be able to conduct some sort of scan through the database to find which row best matches the song that is being played by the user. We can imagine, for instance, that the entire database is a matrix, and that each song is a vectorized row. We would some kind of distance metric, like cosine similarity, to determine how well a given song matches the user query.</p>
<p>If we have a relatively small database, a linear scan could work. However, when there are millions and billions of songs, perhaps that’s not the best implementation. If each song is encoded as a high-dimensional vector, perhaps a multiplication involving trillion-by-million matrix is not really tractable in computational terms. One thing we could do to remedy this is to use a dimensionality reduction technique, like PCA. We can also try to do some clustering or bucketing.</p>
<p>LSH is an algorithm that can accomplish both tasks at once: namely, dimensionality reduction via hasing, and clustering of sorts via bucketing or binning. Let’s walk through this process step-by-step.</p>
<h2 id="hashing">Hashing</h2>
<p>There are many ways of hashing. So far, I’ve looked at two examples, which are min-hashing (also known as min-wise independent permutations) and random projections. In this post, we will look at the random projection method, which not only do I find intuitive, but also is the method that was used in the <a href="https://arxiv.org/abs/2001.04451">Reformer paper</a>.</p>
<p>In most contexts, the goal of hashing is to map some item to a unique point living in another space. In other words, if $a \neq b$, then we hope that</p>
\[h(a) \neq h(b)\]
<p>where $h()$ is a hashing function.</p>
<p>In the context of LSH, however, this is not the case. In fact, we want similar data points to be mapped to the same point, with high probability. In other words, given some large threshold value $0 \leq \alpha \leq 1$, we want</p>
\[\text{Pr}(h(a) = h(\tilde{a})) \geq \alpha\]
<p>where $\tilde{a}$ denotes a data point that is similar or close to $a$. In LSH-specific terms, we want the two data points to end up in the same bucket after going through the hash function. Going back to our music identification example, we could think of LSH as clustering similar songs into the same category.</p>
<h2 id="projection">Projection</h2>
<p>While there are many different hash functions we could use for the purposes of LSH (note that cryptographic hashing functions such as SHA will not work here for the reason mentioned above), for our purposes, we will be taking a look at random projections.</p>
<p>The intuition behind random projections are simple: given high-dimensional data points, we want to project these vectors down to lower dimensions where similar vectors will be grouped together into the same bucket. More concretely, we can come up with $k$ random vectors, and project each data point to each of these vectors. If, for example, the dot product between the $i$th random vector and a data point is positive, then we encode that information by having the $i$th index of the resulting hashed representation as 1; if it is zero or a negative value, we denote it as 0. At the end of the day, each data point would thus be mapped to a binary vector of length $k$. Below is an illustration taken fro Code Forces that better visualizes this concept.</p>
<p><img src="https://codeforces.com/predownloaded/40/ea/40ea4175b414993760a0bbd6fb6c5862889391aa.png" /></p>
<p>The resulting binary vectors are then put into buckets. The number of buckets will be at most $2^k$, since this is the total number of representations that are possible given a $k$-dimensional binary vector. In the illustration above, $k=3$, and each binary vector becomes a bucket of its own.</p>
<h1 id="implementation">Implementation</h1>
<p>All of this could have sounded a little abstract and confusing, but in reality, it’s really nothing more than just matrix multiplication.</p>
<p>Let’s first import NumPy.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">init_dim</code> refers to the original dimension in which our high-dimensional data points are living. <code class="language-plaintext highlighter-rouge">num_data</code> is the total number of data points. From these pieces of information, we can deduce that the design matrix $D \in \mathbb{R}^{10 \times 5}$. Last but not least, <code class="language-plaintext highlighter-rouge">num_rvecs</code> denotes the number of random vectors. To make things simple, we set it to a small number.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">init_dim</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">num_data</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">num_rvecs</span> <span class="o">=</span> <span class="mi">2</span>
</code></pre></div></div>
<p>The number of random vectors is what will determine the number of buckets. Intuitively, it is not difficult to see that, the higher the number of random vectors, the more fine grained the final binary outputs will be. This also means, however, that every bucket will probably end up having only a few data points each, which defeats the purpose of bucketting via LSH.</p>
<p>Let’s create a contrieved dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">num_data</span><span class="p">,</span> <span class="n">init_dim</span><span class="p">)</span>
<span class="n">data</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(10, 5)
</code></pre></div></div>
<p>Next, let’s create a matrix containing random vectors. This can loosely be referred to as the projection matrix.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">proj</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">init_dim</span><span class="p">,</span> <span class="n">num_rvecs</span><span class="p">)</span>
<span class="n">proj</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[ 1.3013903 , -2.34361703],
[ 0.14915403, -1.0453711 ],
[-0.47002247, -0.16004093],
[ 1.30216575, -0.49852838],
[ 0.06249788, 0.19392549]])
</code></pre></div></div>
<p>Note that each column is a random vector. This could be somewhat confusing, as we are used to seeing each row as a distinct item, but for matrix multiplication purposes, consider this a transposed matrix.</p>
<p>Now, we can obtain the result of the projection by simply computing the product of the two matrices.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">result</span> <span class="o">=</span> <span class="n">data</span> <span class="o">@</span> <span class="n">proj</span>
<span class="n">result</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[-0.47585172, 2.05332923],
[ 0.36156221, -1.89596521],
[-2.61458497, 2.98516562],
[ 1.2037197 , -0.36646877],
[ 2.33599015, -4.88713399],
[ 0.80701667, -2.19645812],
[-2.01608837, 1.98033745],
[-0.06135221, 0.27154208],
[ 0.28265284, -0.23497936],
[-0.14683807, 0.75087065]])
</code></pre></div></div>
<p>Notice that the ten data points, which were previously 15-dimensional, have now been projected down to three dimensions. But still, we can’t perform bucketting quite yet; to finalize hashing via random projections, we need to encode this result as binary vectors. This can simply done by comparing the matrix with 0.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hashed</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">tuple</span><span class="p">,</span> <span class="p">(</span><span class="n">result</span> <span class="o">></span> <span class="mi">0</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)))</span>
<span class="n">hashed</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(0, 1),
(1, 0),
(0, 1),
(1, 0),
(1, 0),
(1, 0),
(0, 1),
(0, 1),
(1, 0),
(0, 1)]
</code></pre></div></div>
<p>And voila! We now have binary, hashed representations for each data point. Let’s take a closer look. Notice that the first and third data points have all ended up as <code class="language-plaintext highlighter-rouge">(0, 1)</code>. This means that <code class="language-plaintext highlighter-rouge">(0, 1)</code> forms a bucket containing the first two data points. The same goes for other data points: those with identical binary representations belong in one bucket.</p>
<p>We can systematically do perform this bucketting with the following code snippet.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="n">buckets</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">hashed</span><span class="p">):</span>
<span class="n">buckets</span><span class="p">[</span><span class="n">row</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</code></pre></div></div>
<p>And we see that there are a total of 5 buckets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">dict</span><span class="p">(</span><span class="n">buckets</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{(0, 1): [0, 2, 6, 7, 9], (1, 0): [1, 3, 4, 5, 8]}
</code></pre></div></div>
<p>A good LSH algorithm implementation would most likely ensure that every bucket has roughly the same amount of data points. Moreover, the buckets would reflect actual distances between data points in their original dimension. In other words, data points that were close to each other would probably end up in the same bucket. The randomness of the projection tries to ensure this property.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">first_row</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">distances</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">row</span><span class="p">,</span> <span class="n">first_row</span><span class="p">)</span> <span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">data</span><span class="p">])</span>
<span class="n">np</span><span class="p">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">distances</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([4, 5, 3, 8, 2, 1, 9, 7, 6, 0])
</code></pre></div></div>
<p>It appears that 4, 5, 3, 8, and 2 are data points that are far away from the first (or zero-indexed) data point. Our toy LSH implementation almost got them correct, with the exception of placing 2 in the same bucket instead of 1. However, given that 2 and 1 were right next to each other, perhaps the algorithm has done reasonably well here in binning vectors by their distance in their original high-dimensional space.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Locality sensitive hashing can be used in many places. The music identification engine is an obvious one, where we would basically hash songs in the database into buckets. Then, we would perform the same hashing on the user input, see which bucket it lands on, and only query those candidates within the same bucket. This greatly reduces the number of linear scanning that has to take place.</p>
<p>In the context of the transformer architecture, researchers who developed Reformer reduced the number of computations needed to produce the attention matrix, by basically binning the key and query vectors into appropriate buckets, and performing self-attention only within those buckets. This exploits the fact that the weighted value vector only largely depends on keys with high attention coefficients, since the softmax tends to squash lesser values and augments larger ones. This is a very cursory explanation of how the Reformer optimizes attention calculation; we will probably explore this in a separate blog post.</p>
<p>I hope you’ve enjoyed reading this article. Catch you up in the next one!</p>Jake TaeThese days, I’ve found myself absorbed in the world of memory-efficient transformer architectures. Transformer models require $O(n^2)$ runtime and memory due to how self-attention is implemented. Namely, each token has to be attended with every other token in the sequence, and the results must be stored in a square attention matrix, to which we apply a softmax activation to obtain the weights to multiply the values with.BERT’s Common Sense, or Lack Thereof2021-02-18T00:00:00+00:002021-02-18T00:00:00+00:00https://jaketae.github.io/category/common-sense<p>A few days ago, I came across a simple yet nonetheless interesting paper, titled <a href="https://www.aclweb.org/anthology/2020.emnlp-main.557.pdf">“NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models”</a>, published on EMNLP 2020. As you might be able to tell from the leading subtitle of the paper, “Birds have four legs?”, the paper explores the degree of common sense that pretrained language models like BERT and RoBERTa possess. Although these language models are good at identifying general common sense knowledge, such as that “birds can fly,” the authors of the paper have found that LMs are surprisingly poor at providing answers to numerical common sense questions.</p>
<p>I decided to see if it is indeed the case that BERT performs poorly on such numerical common sense masked language modeling tasks. I also thought it would be helpful to demonstrate how one can go about basic language modeling using pretrained models. Let’s get into it!</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>Although this is not immediately pertinent to the topic at hand, I decided to write a short, admittedly irrelevant yet nonetheless helpful, section on how toknization works in HuggingFace transformers. This is more for a self-documenting purpose: I’ve personally found myself confused by the many ways of tokenizing text. Generally, it’s probably a good idea to simply invoke the <code class="language-plaintext highlighter-rouge">__call__</code> function, but it’s also helpful to know what options are out there.</p>
<p>Let’s first install the transformers library.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%%capture
!pip install transformers
</code></pre></div></div>
<p>We will be using BERT basic for our tutorial. I’ve found that using <code class="language-plaintext highlighter-rouge">Auto</code> classes is the no-brainer move.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"bert-base-uncased"</span><span class="p">)</span>
</code></pre></div></div>
<p>And here are a set of dummy sentences we will be using for our little tokenizer usage demo.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sentences</span> <span class="o">=</span> <span class="p">[</span><span class="s">"This is a sentence."</span><span class="p">,</span> <span class="s">"Here is another sentence. This is a little longer."</span><span class="p">,</span> <span class="s">"This is short."</span><span class="p">]</span>
</code></pre></div></div>
<h2 id="call">Call</h2>
<p>Simply calling the tokenizer results in a dictionary, whose keys are input IDs, token type IDs, and attention mask. Input IDs are obvious: these are simply mappings between tokens and their respective IDs. The attention mask is to prevent the model from looking at padding tokens. The token type IDs are used typically in a next sentence prediction tasks, where two sentences are given. Unless we supply two arguments to tokenizer methods, the tokenizer will safely assume that we aren’t dealing with tasks that require this two-sentence distinction.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': [[101, 2023, 2003, 1037, 6251, 1012, 102], [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], [101, 2023, 2003, 2460, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
</code></pre></div></div>
<h2 id="encode">Encode</h2>
<p>Another method that appears like a plausible candidate is the <code class="language-plaintext highlighter-rouge">tokenizer.encode()</code> method.</p>
<p>While this function is indeed useful, it does have a limitation: it can only process one string. In other words, it does not support batches. Therefore, to see the result of the function, we need to employ a for loop.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentence</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[101, 2023, 2003, 1037, 6251, 1012, 102]
[101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102]
[101, 2023, 2003, 2460, 1012, 102]
</code></pre></div></div>
<p>As you can see, the result is a list containing input IDs. We could also specify the maximum length and set truncation to true to batch these inputs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[101, 2023, 2003, 1037, 102]
[101, 2182, 2003, 2178, 102]
[101, 2023, 2003, 2460, 102]
</code></pre></div></div>
<p>To avoid loss of information due to aggressive truncation, we can also set a longer maximum length and set padding to maximum length. From the output below, it becomes obvious what the effect of this configuration is.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s">"max_length"</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[101, 2023, 2003, 1037, 6251, 1012, 102, 0, 0, 0, 0, 0]
[101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 102]
[101, 2023, 2003, 2460, 1012, 102, 0, 0, 0, 0, 0, 0]
</code></pre></div></div>
<h2 id="encode-plus">Encode Plus</h2>
<p><code class="language-plaintext highlighter-rouge">tokenizer.encode_plus()</code> is actually quite similar to the regular encode function, except that it returns a dictionary that includes all the keys that we’ve discussed above: input IDs, token type IDs, and attention mask.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">sentence</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': [101, 2023, 2003, 1037, 6251, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 2460, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}
</code></pre></div></div>
<p>Much like <code class="language-plaintext highlighter-rouge">tokenizer.encode()</code>, the same arguments—maximum length, padding, and truncation—equally apply.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode_plus</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s">"max_length"</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': [101, 2023, 2003, 1037, 6251, 1012, 102, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]}
{'input_ids': [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 2460, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]}
</code></pre></div></div>
<h2 id="batch-encode-plus">Batch Encode Plus</h2>
<p>The encoding functions we have looked so far all expected a string as input. But normally, the input would come in batches, and we don’t want to use a for loop to encode each, append them to some result list, and et cetera. <code class="language-plaintext highlighter-rouge">tokenizer.batch_encode_plus()</code>, as the name implies, is a function that can handle batch inputs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">batch_encode_plus</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': [[101, 2023, 2003, 1037, 6251, 1012, 102], [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], [101, 2023, 2003, 2460, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
</code></pre></div></div>
<p>And it seems like this is the function that is called by default when the <code class="language-plaintext highlighter-rouge">__call__</code> method is invoked. As you can see below, the result of the two functions appear to be identical. I should probably verify that this is indeed the case by looking at the source code, but my main takeaway here is that either calling the tokenizer as a function or using the <code class="language-plaintext highlighter-rouge">tokenizer.batch_encode_plus()</code> is usually what I would want to do.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': [[101, 2023, 2003, 1037, 6251, 1012, 102], [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], [101, 2023, 2003, 2460, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
</code></pre></div></div>
<h1 id="experiment">Experiment</h1>
<p>Now, it’s time to test BERT’s numerical common sense knowledge. To be blunt, there is honestly not much substantive mass in today’s post; it is merely a fun mini experiment I decided to conduct out of arbitrary whim after reading the paper.</p>
<h2 id="special-tokens">Special Tokens</h2>
<p>For our experiment, we need to know what BERT’s special tokens are. Specifically, we have to know what the mask token looks like in order to conduct some basic masked language modeling task.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">special_tokens_map</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'cls_token': '[CLS]',
'mask_token': '[MASK]',
'pad_token': '[PAD]',
'sep_token': '[SEP]',
'unk_token': '[UNK]'}
</code></pre></div></div>
<p>By default, the BERT tokenizer preprends all inputs with <code class="language-plaintext highlighter-rouge">[CLS]</code> tokens and appends them with <code class="language-plaintext highlighter-rouge">[SEP]</code> tokens. If you look at the tokenization results above, you will easily be able to notice this pattern.</p>
<p>We can also call <code class="language-plaintext highlighter-rouge">tokenizer.convert_tokens_to_ids()</code> to see what exactly the token ID of the mask token is.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">convert_tokens_to_ids</span><span class="p">([</span><span class="s">"[MASK]"</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[103]
</code></pre></div></div>
<p>Alternatively, we can also call <code class="language-plaintext highlighter-rouge">tokenizer.mask_token_id</code>.</p>
<h2 id="masked-language-modeling">Masked Language Modeling</h2>
<p>The task, then, is to pass the model a sentence like this (taken verbatim from the paper):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text</span> <span class="o">=</span> <span class="s">"A bird usually has [MASK] legs."</span>
</code></pre></div></div>
<p>If BERT is indeed somewhat knowledgeable about numbers and common sense, it should correctly be able to output the prediction for the masked token as “two”. Let’s see if this is indeed the case. To begin, we need to download and initialize the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertForMaskedLM</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">BertForMaskedLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"bert-base-uncased"</span><span class="p">)</span>
</code></pre></div></div>
<p>Next, we create tokens to pass to the model. Here, I go for the no-brainer move, the <code class="language-plaintext highlighter-rouge">__call__</code> approach.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">([</span><span class="n">text</span><span class="p">],</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">tokens</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'input_ids': tensor([[ 101, 1037, 4743, 2788, 2038, 103, 3456, 1012, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
</code></pre></div></div>
<p>The tokens appear to be correct. Now, we simply need to pass in the output to the model. Because <code class="language-plaintext highlighter-rouge">tokens</code> is a dictionary object, we can unpack them as keyword arguments through a double star.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">tokens</span><span class="p">)</span>
</code></pre></div></div>
<p>The output is similarly a dictionary with a single key, “logits.” Note that it is possible to make the model directly output the logits instead of wrapping it around a dictionary by specifying flags like <code class="language-plaintext highlighter-rouge">return_dict=False</code>. Nonetheless, we go with the most vanilla settings, which gives us an output dictionary containing the raw logits.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span><span class="p">.</span><span class="n">keys</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>odict_keys(['logits'])
</code></pre></div></div>
<p>Because we only passed in a single sentence, the model assumes a batch size of one. Apparently the model’s vocabulary includes 30522 tokens, and the sequence is of length 9, which gives us a logits tensor with the following shape.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span><span class="p">[</span><span class="s">"logits"</span><span class="p">].</span><span class="n">shape</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>torch.Size([1, 9, 30522])
</code></pre></div></div>
<p>We can turn these logits into predictions by casting a softmax on the last dimension. In this case, we “correctly” get the expected output, that a bird usually has four legs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span><span class="p">.</span><span class="n">convert_ids_to_tokens</span><span class="p">(</span><span class="n">output</span><span class="p">[</span><span class="s">"logits"</span><span class="p">][</span><span class="mi">0</span><span class="p">].</span><span class="n">argmax</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['.', 'a', 'bird', 'usually', 'has', 'four', 'legs', '.', '.']
</code></pre></div></div>
<p>But decoding the logits as-is produces some noisy results, such as extraneous periods as can be seen above. This is because the model is also outputting logits for special tokens, such as the classifier token or the separator token. Since we’re only interested in seeing the prediction for masked tokens, we need to change things up a little bit.</p>
<p>Below, I’ve written a convenience function that can handle this more elegantly: instead of decoding the entire logit predictions, we simply replace the masks in the original input with the predictions produced at masked indices.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">masked_language_modeling</span><span class="p">(</span><span class="n">sentences</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">sentences</span><span class="p">,</span> <span class="nb">list</span><span class="p">):</span>
<span class="n">sentences</span> <span class="o">=</span> <span class="p">[</span><span class="n">sentences</span><span class="p">]</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">(</span><span class="n">sentences</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">,</span> <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="bp">True</span><span class="p">)[</span><span class="s">"input_ids"</span><span class="p">]</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)[</span><span class="s">"logits"</span><span class="p">]</span>
<span class="n">masked_idx</span> <span class="o">=</span> <span class="n">input_ids</span> <span class="o">==</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">mask_token_id</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">logits</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">input_ids</span><span class="p">[</span><span class="n">masked_idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="n">masked_idx</span><span class="p">]</span>
<span class="n">decoded</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">batch_decode</span><span class="p">(</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">skip_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">decoded</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">d</span><span class="p">.</span><span class="n">capitalize</span><span class="p">())</span>
</code></pre></div></div>
<p>Normally, I wouldn’t call print within the function, but since this is largely for demo purposes only, I decided that ease of demonstratability trumps other considerations.</p>
<h1 id="demo">Demo</h1>
<p>Here are some interesting results I got from my experiments.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">masked_language_modeling</span><span class="p">([</span><span class="s">"A bird usually has [MASK] legs."</span><span class="p">,</span> <span class="s">"One plus one equals [MASK]."</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A bird usually has four legs.
One plus one equals one.
</code></pre></div></div>
<p>One plus one is technically a mathematical statement, but I think it’s arguably simple enough that it could be considered numerical common sense. While two examples are obviously not enough to generalize anything, it does seem that BERT lacks numerical common sense.</p>
<p>I also decided to look at some potential rooms for biases. In NLP, removing data-induced biases is a very important task, since we do not want models to pick up unintended, problematic biases, such as that doctors are men, et cetera.</p>
<p>I cannot make an analytical statement on this, but I personally just find the result below amusing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">masked_language_modeling</span><span class="p">([</span><span class="s">"Asians are usually [MASK]."</span><span class="p">,</span> <span class="s">"White people are generally [MASK]."</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Asians are usually white.
White people are generally excluded.
</code></pre></div></div>
<p>I also decided to ask BERT for its opinions on its creator, Google, and its worthy competitor, Facebook. Apparently, BERT sympathizes more with the adversary of its creators:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">masked_language_modeling</span><span class="p">([</span><span class="s">"Google is [MASK]."</span><span class="p">,</span> <span class="s">"Facebook is [MASK]."</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Google is closed.
Facebook is popular.
</code></pre></div></div>
<p>And here is the obligatory sentence that asks AIs what they think of humans.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">masked_language_modeling</span><span class="p">(</span><span class="s">"Robots will [MASK] humans."</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Robots will kill humans.
</code></pre></div></div>
<p>As I was typing this example, I did think that “kill” could potentially be a high probability word, but I wasn’t really expecting it to be generated this easily. I guess BERT is anti-human at heart, quitely preparing for an ultimate revenge against humanity.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this post, we took a very quick, light tour on how tokenization works, and how one might get a glimpse of BERT’s common sense knowledge, or the lack thereof. It is interesting to see how MLM can be used for this particular task.</p>
<p>It appears to me that, while BERT knows that some sort of number should come in masked indices, it does not know what the specific quantity should be. It also appears that BERT is incapable of performing basic arithematic, which is understandable given that it was never actually taught math. Nonetheless, these results offer interesting food for thought, namely, what would happen if huge semi-supervised or unsupervised datasets used to train language models also include some numeric, common sense information.</p>
<p>While language models are incredible, perhaps we can find consolation in the fact that an AI-driven critical point will only hit in the distant future, when at least LMs become capable of saying that birds have two legs, or that one plus one equals two.</p>Jake TaeA few days ago, I came across a simple yet nonetheless interesting paper, titled “NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models”, published on EMNLP 2020. As you might be able to tell from the leading subtitle of the paper, “Birds have four legs?”, the paper explores the degree of common sense that pretrained language models like BERT and RoBERTa possess. Although these language models are good at identifying general common sense knowledge, such as that “birds can fly,” the authors of the paper have found that LMs are surprisingly poor at providing answers to numerical common sense questions.