I recently completed another summer internship at Meta (formerly Facebook). I was surprised to learn that one of the intern friends I met was an avid reader of my blog. Encouraged by the positive feedback from my intern friends, I decided to write another post before the end of summer. This post is dedicated to the mandem: Yassir, Amal, Ryan, Elvis, and Sam.
Today, we will take a look at LoRA: Low-Rank Adaptation of Large Language Models by Hu et al. Alongside bitsandbytes, LoRA has been a key ingredient in democratizing language models like Llama^{1}, making them available for both inference and fine-tuning on consumer-grade GPUs. In particular, LoRA is a prerequisite to understand QLoRA, which combines int4 quantization with low-rank adaptation.
This post was heavily inspired by other great resources on LoRA:
Let’s get into it!
In this section, we will quickly recap some concepts in linear algebra, which will help us understand LoRA.
In linear algebra, rank denotes the dimension of the row and column space of the matrix. In other words, it is the number of linearly independent row or column vectors of the matrix. Another handy little fact about rank is that only full-rank matrices are invertible. To recall:
Without getting into too much detail, the rough proof sketch for these propositions involves using reduction operations to produce diagonal matrices, and using other elementary facts about invertibility and determinants.
For the purposes of understanding LoRA, it suffices to intuit rank as the amount of information encoded into a matrix from the perspective of decomposition. Concretely, consider $A$, a 4 x 4 matrix.
\[A = \begin{pmatrix} 1 & 3 & 1 & 4 \\ 2 & 7 & 3 & 9 \\ 1 & 5 & 3 & 1 \\ 1 & 2 & 0 & 8 \\ \end{pmatrix}\]In reduced echelon form, we obtain $B$:
\[B = \begin{pmatrix} 1 & 0 & -2 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ \end{pmatrix}\]Therefore, it is clear that $A$ is a rank 3 matrix. Then the claim is that $A$ can be decomposed into two matrices of size (4, 3) and (3, 4). Indeed, we have
\[\begin{pmatrix} 1 & 3 & 4 \\ 2 & 7 & 9 \\ 1 & 5 & 1 \\ 1 & 2 & 8 \\ \end{pmatrix} \begin{pmatrix} 1 & 0 & -2 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ \end{pmatrix} = \begin{pmatrix} 1 & 3 & 1 & 4 \\ 2 & 7 & 3 & 9 \\ 1 & 5 & 3 & 1 \\ 1 & 2 & 0 & 8 \\ \end{pmatrix}\]In other words, rank determines the structure of matrix decomposition. In this example, $A$ was relatively closer to being full-rank: it was a rank 3 matrix, and the maximal rank it could have was 4. However, we can also imagine decomposing matrices with smaller rank, e.g., an $n \times n$ matrix being decomposed into $(n, m)$ and $(m, n)$, where $m \ll n$. Indeed, this is the key behind LoRA.
Large language models (LLMs) have become very large in recent years. Even the smaller standard LLMs, such as Llama 7B, have billions of parameters by default. Finetuning such large models for individual tasks is prohibitively expensive to say the least.
LoRA is a parameter-efficient training methodology. In short, instead of training the entire model parameters, LoRA proposes to train a few extra parameters to fuse with the activations of the original model. Let’s see what this exactly means.
LoRA starts from the simple hypothesis that
the change in weights during model adaptation … has a low “intrinsic rank”[.]
In other words, the authors of LoRA hypothesize that the delta shift in model weights during training is actually a low rank matrix. If this is true, we should be able to emulate the effects of full finetuning by simply training two small low rank matrices. This is precisely what LoRA does.
LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen[.]
In the diagram above, $W_\text{nk}$ is the full pretrained weight matrix. Instead of trying to finetune $W_\text{nk}$ in its entirety, LoRA instead adds to auxiliary matrices, $A$ and $B$, which have rank $r$. Let $W_\text{nk} \in \mathbb{R}^{n \times k}$. Then $A \in \mathbb{R}^{n times r}, B \in \mathbb{R}^{r \times k}$. If $r$ is small enough, then only training $A$ and $B$ will be much cheaper than training $W_\text{nk}$, i.e.,
\[nk > r (n + k).\]It is easy to see that when $r = \max{n, k}$, then we recover the full finetuning setup. Therefore, LoRA can be seen as a generalization of full finetuning.
The authors limit the study of LoRA to the transformers architecture, testing it on a wide range of encoder and decoder models such as RoBERTa, DeBERTa, and GPT-3. They apply LoRA on the weight matrices of the self-attention module.
The forward pass of a LoRA model can be written as
\[h = W_\text{nk} X + AB X\]This is obviously different from the original unmodified forward pass, which would be
\[h = W_\text{nk} X.\]We could maintain separate modules for the original weight matrix $W_\text{nk}$ and $A, B$. However, after training is complete, we can speed up the forward pass by fusing the modules to reduce FLOPs.
\[\begin{align*} W'_\text{nk} &= W + AB \\ h &= W'_\text{nk} X. \end{align*}\]In other words, instead of maintaining the two-branch structure, we simply merge the LoRA delta matrix into the original frozen pretrained weight.
Now that we have an idea of how LoRA works, let’s try a simple implementation with PyTorch Lightning. This implementation was heavily inspired by sunildkumar’s lora_from_scratch.
%%capture
!pip install lightning
We import necessary dependencies and set the seed for reproducability.
import torch
from torch import nn
from torch import optim
from torch.nn import functional as F
from torch.utils.data import random_split, DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from torchmetrics import Accuracy
import lightning.pytorch as pl
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
pl.seed_everything(42)
INFO: Global seed set to 42
INFO:lightning.fabric.utilities.seed:Global seed set to 42
We will be using the MNIST toy dataset. PyTorch Lightning provides a convenient data module API, where we can pack all logic related to the data into a single class.
class MNISTDataModule(pl.LightningDataModule):
def __init__(self, data_dir: str = ".", batch_size: int = 1024):
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.transform = transforms.ToTensor()
def setup(self, stage: str):
if stage == "fit":
mnist_full = MNIST(
self.data_dir, train=True, download=True, transform=self.transform,
)
self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
elif stage == "test":
self.mnist_test = MNIST(
self.data_dir, train=False, download=True, transform=self.transform,
)
def train_dataloader(self):
return DataLoader(self.mnist_train, batch_size=self.batch_size)
def val_dataloader(self):
return DataLoader(self.mnist_val, batch_size=self.batch_size)
def test_dataloader(self):
return DataLoader(self.mnist_test, batch_size=self.batch_size)
def predict_dataloader(self):
return DataLoader(self.mnist_predict, batch_size=self.batch_size)
In this dummy experiment, we will train the model on MNIST for 5 epochs. The baseline will then be continued to train on the dataset for 5 more epochs to emulate the effects of full “finetuning.” The LoRA model will be initialized from the 5-epoch checkpoint, then trained for another 5 epochs.
We will be training a simple dense model.
class MNISTModel(pl.LightningModule):
def __init__(self, hidden_size: int = 64, lr=2e-4):
super().__init__()
self.lr = lr
num_classes = 10
self.l1 = nn.Linear(28 * 28, hidden_size)
self.l2 = nn.Linear(hidden_size, hidden_size)
self.l3 = nn.Linear(hidden_size, num_classes)
self.val_accuracy = Accuracy(
task="multiclass", num_classes=num_classes,
)
self.test_accuracy = Accuracy(
task="multiclass", num_classes=num_classes,
)
def forward(self, x):
x = torch.flatten(x, start_dim=1)
x = F.dropout(F.relu(self.l1(x)))
x = F.dropout(F.relu(self.l2(x)))
x = self.l3(x)
return x
def configure_optimizers(self):
return optim.Adam(self.parameters(), lr=self.lr)
def base_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
return x, y, logits, loss
def training_step(self, batch, batch_idx):
_, _, logits, loss = self.base_step(batch, batch_idx)
self.log("train_loss", loss)
return loss
def validation_step(self, batch, batch_idx):
x, y, logits, loss = self.base_step(batch, batch_idx)
preds = torch.argmax(logits, dim=1)
self.val_accuracy.update(preds, y)
self.log("val_loss", loss)
self.log("val_acc", self.val_accuracy)
def test_step(self, batch, batch_idx):
x, y, logits, loss = self.base_step(batch, batch_idx)
preds = torch.argmax(logits, dim=1)
self.val_accuracy.update(preds, y)
self.log("test_loss", loss)
self.log("test_acc", self.val_accuracy)
model = MNISTModel()
datamodule = MNISTDataModule()
pretrainer = pl.Trainer(
accelerator="auto",
devices=1,
max_epochs=5,
logger=pl.loggers.CSVLogger("logs")
)
pretrainer.fit(model, datamodule=datamodule)
pretrainer.test(model, datamodule=datamodule)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
-----------------------------------------------------
55.1 K Trainable params
0 Non-trainable params
55.1 K Total params
0.220 Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
-----------------------------------------------------
55.1 K Trainable params
0 Non-trainable params
55.1 K Total params
0.220 Total estimated model params size (MB)
INFO: `Trainer.fit` stopped: `max_epochs=5` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test_acc │ 0.7035999894142151 │
│ test_loss │ 0.9380418062210083 │
└───────────────────────────┴───────────────────────────┘
[{'test_loss': 0.9380418062210083, 'test_acc': 0.7035999894142151}]
Let’s read the metrics.
def read_metrics(path):
metrics = pd.read_csv(path)
del metrics["step"]
metrics.set_index("epoch", inplace=True)
display(metrics.dropna(axis=1, how="all").head())
sns.relplot(data=metrics, kind="line")
plt.show()
read_metrics(f"{pretrainer.logger.log_dir}/metrics.csv")
train_loss | val_loss | val_acc | test_loss | test_acc | |
---|---|---|---|---|---|
epoch | |||||
0 | 2.113543 | NaN | NaN | NaN | NaN |
0 | NaN | 2.083865 | 0.3740 | NaN | NaN |
1 | 1.708588 | NaN | NaN | NaN | NaN |
1 | NaN | 1.654867 | 0.5168 | NaN | NaN |
2 | 1.350459 | NaN | NaN | NaN | NaN |
We save the model using both the PyTorch Lightning trainer API as well as the default PyTorch API. We will use the former to contiue training the model to simualte full finetuning and the latter to initialize the LoRA model from the trained checkpoint.
pretrainer.save_checkpoint("model.ckpt")
torch.save(model.state_dict(), 'model.pt')
Let’s continue training the model for 5 more epochs to see how it improves. This is the full finetuning baseline.
model = MNISTModel.load_from_checkpoint("model.ckpt")
trainer = pl.Trainer(
accelerator="auto",
devices=1,
max_epochs=5,
logger=pl.loggers.CSVLogger("logs")
)
trainer.fit(model, datamodule=datamodule)
read_metrics(f"{trainer.logger.log_dir}/metrics.csv")
trainer.test(model, datamodule=datamodule)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
-----------------------------------------------------
55.1 K Trainable params
0 Non-trainable params
55.1 K Total params
0.220 Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
-----------------------------------------------------
55.1 K Trainable params
0 Non-trainable params
55.1 K Total params
0.220 Total estimated model params size (MB)
INFO: `Trainer.fit` stopped: `max_epochs=5` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
train_loss | val_loss | val_acc | |
---|---|---|---|
epoch | |||
0 | 0.920625 | NaN | NaN |
0 | NaN | 0.848711 | 0.7360 |
1 | 0.804315 | NaN | NaN |
1 | NaN | 0.757564 | 0.7658 |
2 | 0.751054 | NaN | NaN |
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test_acc │ 0.809499979019165 │
│ test_loss │ 0.6315763592720032 │
└───────────────────────────┴───────────────────────────┘
[{'test_loss': 0.6315763592720032, 'test_acc': 0.809499979019165}]
We see that the test accuracy improved from the previous 0.7 to around 0.81, as expected.
Next, we create a new LoRA model. To build the LoRA model, we will create a simple LoRALinear
class that abstracts away the initialization and forward pass through the two low rank matrices.
import math
class LoRALinear(nn.Module):
def __init__(self, in_features: int, out_features: int, rank: int):
super().__init__()
self.A = nn.Parameter(torch.empty(in_features, rank))
self.B = nn.Parameter(torch.empty(rank, out_features))
nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
nn.init.zeros_(self.B)
def forward(self, x):
return x @ (self.A @ self.B)
The LoRA model inherits from the MNISTModel
. We perform two steps:
MNISTModel
;LoRALinear
with the specified rank
.During the forward process, we use an alpha
parameter to determine how much mixing we want to perform between the activations from LoRA and the frozen pretrained model.
class MNISTLoRAModel(MNISTModel):
def __init__(self, rank: int, alpha: float, hidden_size: int = 64):
super().__init__(hidden_size)
for name, parameter in self.named_parameters():
parameter.requires_grad = False
self.rank = rank
self.alpha = alpha
self.l1_lora = LoRALinear(28 * 28, hidden_size, self.rank)
self.l2_lora = LoRALinear(hidden_size, hidden_size, self.rank)
self.l3_lora = LoRALinear(hidden_size, 10, self.rank)
def forward(self, x):
x = torch.flatten(x, start_dim=1)
x = F.dropout(F.relu(self.l1(x) + self.alpha * self.l1_lora(x)))
x = F.dropout(F.relu(self.l2(x) + self.alpha * self.l2_lora(x)))
x = self.l3(x) + self.alpha * self.l3_lora(x)
return x
def configure_optimizers(self):
optimizer = super().configure_optimizers()
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, "min", patience=10)
return {
"optimizer": optimizer,
"lr_scheduler": {
"scheduler": scheduler,
"monitor": "val_loss",
"frequency": 1
},
}
Here, we set rank
to 32 and alpha
to 1. Let’s try training the model for 5 additional epochs, just like the baseline. Note that with this LoRA configuration, we are training around 33K parameters, which is smaller than the full finetuning baseline of 55K parameters.
lora_model = MNISTLoRAModel(rank=32, alpha=1)
state_dict = torch.load("model.pt")
lora_model.load_state_dict(state_dict, strict=False)
datamodule = MNISTDataModule()
lora_trainer = pl.Trainer(
accelerator="auto",
devices=1,
max_epochs=5,
logger=pl.loggers.CSVLogger("logs"),
)
lora_trainer.fit(lora_model, datamodule=datamodule)
read_metrics(f"{lora_trainer.logger.log_dir}/metrics.csv")
lora_trainer.test(lora_model, datamodule=datamodule)
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
5 | l1_lora | LoRALinear | 27.1 K
6 | l2_lora | LoRALinear | 4.1 K
7 | l3_lora | LoRALinear | 2.4 K
-----------------------------------------------------
33.6 K Trainable params
55.1 K Non-trainable params
88.7 K Total params
0.355 Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
| Name | Type | Params
-----------------------------------------------------
0 | l1 | Linear | 50.2 K
1 | l2 | Linear | 4.2 K
2 | l3 | Linear | 650
3 | val_accuracy | MulticlassAccuracy | 0
4 | test_accuracy | MulticlassAccuracy | 0
5 | l1_lora | LoRALinear | 27.1 K
6 | l2_lora | LoRALinear | 4.1 K
7 | l3_lora | LoRALinear | 2.4 K
-----------------------------------------------------
33.6 K Trainable params
55.1 K Non-trainable params
88.7 K Total params
0.355 Total estimated model params size (MB)
INFO: `Trainer.fit` stopped: `max_epochs=5` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.
train_loss | val_loss | val_acc | |
---|---|---|---|
epoch | |||
0 | 0.907201 | NaN | NaN |
0 | NaN | 0.928417 | 0.7052 |
1 | 0.904668 | NaN | NaN |
1 | NaN | 0.876187 | 0.7232 |
2 | 0.812540 | NaN | NaN |
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test_acc │ 0.7870000004768372 │
│ test_loss │ 0.6841080188751221 │
└───────────────────────────┴───────────────────────────┘
[{'test_loss': 0.6841080188751221, 'test_acc': 0.7870000004768372}]
The test accuracy is around 0.79, which is just 0.02 points shy of the score achieved by the full finetuning baseline model.
One would expect LoRA to more closely match the performance of the baseline with larger rank. Let’s continue the experiment with different ranks to verify this hypothesis. Below, we repeat the experiment with ranks 1, 2, 4, 8, 16, 32, and 64. Logs from Lightning are omitted.
def run_lora(rank):
lora_model = MNISTLoRAModel(rank=rank, alpha=1)
state_dict = torch.load("model.pt")
lora_model.load_state_dict(state_dict, strict=False)
datamodule = MNISTDataModule()
lora_trainer = pl.Trainer(
accelerator="auto",
devices=1,
max_epochs=5,
)
lora_trainer.fit(lora_model, datamodule=datamodule)
return lora_trainer.test(lora_model, datamodule=datamodule)[0]["test_acc"]
ranks = [1, 2, 4, 8, 16, 32, 64]
test_accs = [run_lora(rank) for rank in ranks]
Plotting the results, we see that LoRA indeed reaches the baseline when we give it full rank of 64; in fact, when rank is 64, the number of trainable parameters exceeds that of the baseline since we essentially have two full rank matrices instead of 1, and we see that LoRA outperforms the baseline. It is clear that as the rank increases, LoRA’s performance more closely matches that of the baseline.
plt.plot(ranks, test_accs)
plt.axhline(y=0.809499979019165, color="r", linestyle="--")
plt.xlabel("Rank")
plt.ylabel("Accuracy")
plt.show()
In this post, we explored LoRA, a parameter efficient finetuning methodology. The beauty of LoRA is in its simplicity—it is motiviated by a simple heuristic, and it is relatively straightforward to implement in practice. LoRA has been applied to a variety of architectures, including LLMs and Stable Diffusion. This is in part because LoRA has been primarily battle-tested in self-attention modules, which are used in both LLMs and text-to-image models. Through this experiment, we also saw that rank is an important hyperparameter that effectively represents a tradeoff between model performance and computational cost: the higher the rank, the larger the number of parameters.
LoRA was further improved and explored in follow-up papers such as QLoRA: Efficient Finetuning of Quantized LMs by Dettmers et al. These developments have really contributed to the democratization of LLMs: people can now consider finetuning LMs on consumer-grade GPUs from the comfort of their homes. This year, we have also seen an exponential number of Llama variants too many to name, which has invigorated the open source community to reproduce, match, and sometimes even outperform closed models like GPT-3.5 or GPT-4. This is an exciting development, and I am excited to see more breakthroughs to come.
I’m still not sure about the “right” way of capitalizing Llama. In the original paper, the model was written as “LLaMA.” However, in the most recent paper, the same authors opted for a simplified convention, “Llama.” I’m going with the second version since it is simpler and more recent. ↩
Update: The code was modified with further optimizations. In particular, instead of checking the trie per every DFS call, we update the trie pointer along the DFS call so that the trie does not have to be queried repeatedly.
Recently, I started playing Game Pidgeon games with my girlfriend. We often play Word Hunt, where the objective is to find as many words as possible in a grid of English letters within 30 seconds.
Being a non-native English speaker, I seldom score a win against my girlfriend; she often claims victory with significant margins. In a desparate attempt to level the playing field, and also inspired by a YouTube video on Word Hunt, I decided to resort to computers and algorithms.
The goal of this project is to come up with as many valid word combinations as possible given a grid of letters. Since the game ascribes higher scores to longer sequences, the longer the words, the better. Most importantly, we need to find these solutions within 30 seconds.
A naïve brute-force approach would be to traverse the grid to recover all possible sequences of letters, then check if these letters are in a source-of-truth list of vocabulary. Concretely, we can use any graph traversal algorithm like DFS to explore the grid and use a Python set for all English words to achieve amortized $O(1)$ lookup. Unfortunately, after a few iterations, I realized that this brute force approach is too inefficient given the 30 second time crunch.
One glaring inefficiency with the above approach is that we end up wastefully exploring infelicitous paths, i.e., paths which we already know will provide no solution. For instance, if we know ahead of time that there exists no word that starts with the prefix “xyz”, then there is no point in exploring “xyza” or “xyzb.” Instead, we can terminate the search and move onto paths where there is hope.
Unfortunately, the built-in Python set does not provide prefix lookup. Instead, a more suitable data structure is a trie, also known as a prefix tree. A trie not only gives us speedy lookup, but also allows us to efficiently query words that start with a given prefix. If there is no word that starts with the prefix, we exit the search sequence, which effectively amounts to DFS backtracking with pruning.
Python does not provide a built-in trie implementation. Although third-party packages exist, I decided to implement my own.
class Trie:
def __init__(self) -> None:
self.root = {}
self.delimiter = "*"
def insert(self, word: str) -> None:
if self.contains(word):
return
pointer = self.root
word += self.delimiter
for char in word:
if char not in pointer:
pointer[char] = {}
pointer = pointer[letter]
Internally, this trie implementation uses a nested dictionary to store words as a sequence of letters. We use an asterisk to mark the end of a word. For instance, adding the word “cat” to an empty trie will yield the following result:
>>> from trie import Trie
>>> t = Trie()
>>> t.insert("cat")
>>> t.trie
{'c': {'a': {'t': {'*': {}}}}}
Once we insert “car”, the “ca” prefix will be preserved, and we will see an additional “r” node.
>>> t.insert("car")
>>> t.trie
{'c': {'a': {'t': {'*': {}}, 'r': {'*': {}}}}}
Now that we have a trie, we can store the list of English words in this data structure. Quite simply, we read the text file and store its content in the trie.
def get_dictionary() -> Trie:
dictionary = Trie()
with open("dictionary.txt") as f:
for word in f:
word = word.strip()
dictionary.insert(word)
return dictionary
Now that the trie dictionary is ready, the next step is to traverse the board and retrieve all valid solutions. I took inspiration from DFS backtracking templates used to solve common problems, such as sudoku. For each cell in the game grid, we want to check for valid words that start with that cell. The solve(grid)
function accepts a grid and calls the traverse(...)
function to check for words starting at each index.
from typing import Dict, List, Tuple
def solve(grid: List[List[str]]) -> Dict[str, List[Tuple[int, int]]]:
solutions = {}
dictionary = get_dictionary()
# BOARD_SIZE == 4
for i in range(BOARD_SIZE):
for j in range(BOARD_SIZE):
if board[i][j] in dictionary.root:
traverse(grid, i, j, "", [], solutions, dictionary.root)
return solutions
Although the function is named solve(...)
, the actual heavy lifting is performed by the traverse(...)
function, which recursively calls itself to perform DFS. Specifically, the traverse(...)
function populates the solutions
dictionary, which will contain valid words as keys and index sequences as values.
from collections.abc import Generator
def get_neighbors(i: int, j: int) -> Generator[int, int]:
for delta_i in range(-1, 2, 1):
for delta_j in range(-1, 2, 1):
if delta_i == delta_j == 0:
continue
next_i = i + delta_i
next_j = j + delta_j
if 0 <= next_i < BOARD_SIZE and 0 <= next_j < BOARD_SIZE:
yield (next_i, next_j)
def traverse(
grid: List[List[str]],
i: int,
j: int,
word: str,
order: List[Tuple[int, int]],
solutions: Dict[str, List[Tuple[int, int]]],
pointer: Dict[str: Any],
) -> None:
char = grid[i][j]
word += char
order.append((i, j))
prev = pointer
pointer = pointer[char]
if "*" in pointer:
solutions[word] = order
del pointer["*"]
if not pointer:
del prev[char]
return
grid[i][j] = None
for next_i, next_j in get_neighbors(i, j):
if (
grid[next_i][next_j] is not None
and grid[next_i][next_j] in pointer
):
traverse(grid, next_i, next_j, word, order.copy(), solutions, pointer)
grid[i][j] = char
To prevent the algorithm from visiting cells it has previously visited (it’s illegal to duplicate a character by revisiting a letter we’ve already used in the current sequence), we mark the visited cell as None
and recursively call traverse(...)
on neighboring cells, which is obtained via get_neighbors(i, j)
. Once all paths have been consumed, we unmark the cell back to its original value. This marking and unmarking is at the heart of backtracking. Notice that the implicit base case for this function is if no neighbors exist.
Also worthy of note is the use of the dictionary
trie. The return
in the middle of the function is where pruning occurs: if there is no word that starts with word
as its prefix, there is no need to further venture down this path. Moreover, if word
itself is in the vocabulary, we add it to solutions
. Note that it is possible that multiple paths exist for the same word, but since we don’t care which path, there is no need to record all of them.
Now that we have all the core algorithms ready, all we need is a surface-level API that will allow the user to interact with these functions. Although it would be nice to have a GUI component, for sake of simplicity I decided to make this a Python script. I also decided that the easiet way for a user to input the grid to the script is in raster scan order, which is a fancy way of saying left to right, top to bottom. Therefore, the 2D grid would be flattened to a line of 16 characters. Internally, we still want to parse the board as a grid: hence the make_grid(board)
function, where board
is the line of 16 characters inputted by the user.
def make_grid(board: str) -> List[List[str]]:
grid = [[] for _ in range(BOARD_SIZE)]
for i, char in enumerate(board):
grid[i // BOARD_SIZE].append(char)
return grid
Now we are truly done! All we need is to (1) create the grid, (2) call the solve(grid)
function, and (3) sort answers by word length and print them in order to the user.
def main(board: str) -> None:
grid = make_grid(board)
solutions = solve(grid)
for i, (word, order) in enumerate(
sorted(solutions.items(), key=lambda x: len(x[0]), reverse=True)
):
if i == SHOW_TOP_K:
break
print(word, order)
if __name__ == "__main__":
board = input()
assert len(board) == 16
main(board)
Here is a sample top-10 result with the example board shown at the very beginning of this blog post.
jaketae:wordhunt $ python main.py
oatrihpshtnrenei
haptene [(1, 1), (0, 1), (1, 2), (2, 1), (3, 2), (3, 1), (3, 0)]
haptens [(1, 1), (0, 1), (1, 2), (2, 1), (3, 2), (2, 2), (1, 3)]
pterins [(1, 2), (2, 1), (3, 2), (2, 3), (3, 3), (2, 2), (1, 3)]
staithe [(1, 3), (0, 2), (0, 1), (1, 0), (2, 1), (2, 0), (3, 0)]
tenners [(2, 1), (3, 0), (3, 1), (2, 2), (3, 2), (2, 3), (1, 3)]
tapnet [(0, 2), (0, 1), (1, 2), (2, 2), (3, 2), (2, 1)]
hapten [(1, 1), (0, 1), (1, 2), (2, 1), (3, 2), (3, 1)]
pterin [(1, 2), (2, 1), (3, 2), (2, 3), (3, 3), (2, 2)]
staith [(1, 3), (0, 2), (0, 1), (1, 0), (2, 1), (2, 0)]
sprent [(1, 3), (1, 2), (2, 3), (3, 2), (3, 1), (2, 1)]
There is no way I would have come up with some of these words.
Today, we have seen one very practical application of algorithms: beating your girlfriend in Word Hunt. While the real test is to use this script in a game against her, preliminary results appear promising.
I hope you enjoyed reading this post. See you in the next one!
]]>“Turn right at 130 Prospect Street.”
If you’ve used Google maps before, you will recall the familiar, smooth voice of the navigation assistant. At first glance, the voice appears to be a simple replay of human recordings. However, you will quickly realize that it is impossible to record the names of millions of streets, not to mention the billions of driving context in which they can appear.
Modern software, such as Google maps or voice assistant, are powered by neural text-to-speech (TTS), a powerful technology that synthesize human-sounding voices using machine learning. In this blog post, we will dive deep into a NeurIPS 2020 paper Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search, which demonstrates one of the many ways in which deep neural networks can be used for natural TTS.
Moden neural TTS pipelines are typically composed of two components: an accoustic feature generator and a vocoder. The acoustic feature generator accepts text as input and outputs an acoustic representation, such as a mel-spectrogram. The second stage of the pipeline, neural vocoders accept acoustic representations as input and outputs raw waveform. More generally, let $f$ and $g$ denote an acoustic feature generator and vocoder. Given an input text $T$, neural TTS can be understood as a composite function that outputs a waveform $W$ via
\[\begin{aligned} &X = f(T) \\ &W = g(X), \end{aligned}\]where $X$ denotes the intermediate acoustic representation. Schematically, $g \cdot f$ fully defines the two-stage TTS process.
In this blog post, we will explore the first stage of the pipeline, the acoustic feature generator, exmplified by Glow-TTS. This post will proceed as follows. Firstly, we discuss generative flow models, which is the first core component of Glow-TTS. Secondly, we discuss the monotonic alignment search algorithm. Thirdly, we discuss the Glow-TTS pipeline as a whole by putting flow and MAS into a single picture. Last but not least, we conclude by considering some of the limitations of Glow-TTS and refer to more recent literature that points to exciting directions in the field of neural TTS.
Text-to-speech is a conditional generative task, in which a model is given a sequence of tokens and produces a stream of utterance that matches the input text. Many neural TTS models employ generative models at their core, such as GANs, VAEs, transformers, or diffision models, often borrowing from breakthroughs in other domains such as computer vision.
Glow-TTS is based on normalizing flow, which is a class of well-studied generative models. The theoretical basis of normalizing flows is the change of variables formula. Let $\mathbf{X}$ and $\mathbf{Y}$ denote random variables, each with PDF $f_\mathbf{X}$ and $f_\mathbf{Y}$, respectively. Let $h$ denote some invertible transformation such that $\mathbf{Y} = h(\mathbf{X})$. Typically, $f_\mathbf{X}$ is a simple, tractable prior distribution, such as a standard Gaussian, and we seek to apply $h$ to model some more complicated distribution given by $\mathbf{Y}$. Then, the change of variables formula states that
\[\begin{aligned} f_\mathbf{Y}(\mathbf{y}) &= f_\mathbf{X}(\mathbf{x}) \bigg| \text{det} \frac{d \mathbf{x}}{d \mathbf{y}} \bigg| \\ &= f_\mathbf{X}(h^{-1}(\mathbf{y})) \bigg| \det \frac{d \mathbf{x}}{d \mathbf{y}} \bigg| \\ &= f_\mathbf{X}(h^{-1}(\mathbf{y})) \bigg| \det \frac{d h^{-1}(\mathbf{y})}{d \mathbf{y}} \bigg|, \end{aligned}\]where $\det$ denotes the determinant and the derivative term represents the Jacobian.
A variation of this formula that allows for sampling from the base distribution can be written as follows:
\[\begin{aligned} f_\mathbf{Y}(\mathbf{y}) &= f_\mathbf{X}(\mathbf{x}) \bigg| \det \frac{d h^{-1} \mathbf{y}}{d \mathbf{y}} \bigg| \\ &= f_\mathbf{X}(\mathbf{x}) \bigg| \det \left( \frac{d h(\mathbf{x})}{d \mathbf{x}} \right)^{-1} \bigg| \\ &= f_\mathbf{X}(\mathbf{x}) \bigg| \det \frac{d h(\mathbf{x})}{d \mathbf{x}} \bigg|^{-1}. \end{aligned}\]The intuition behind the change of variables formula is that the probability mass of an interval in $\mathbf{X}$ should remain unchanged in the transformed $\mathbf{Y}$ space. The determinant of the Jacobian is a corrective term that accounts for the slope or the “sensitivity” of the transformation given by $h$.
Normalizing flow models can then be understood as a collection of nested invertible transformations, i.e., $h_1 \cdot h_2 \cdots h_n$, where $n$ denotes the number of flow layers in the model.^{1} To better understand what this composite transformation achieves, let’s apply a logarithm to the change of variable formula.
\[\log f_\mathbf{Y} (\mathbf{y}) = \log f_\mathbf{X} (\mathbf{x}) - \log \bigg| \det \frac{d h(\mathbf{x})}{d \mathbf{x}} \bigg|.\]To simplify notation, let $p_i$ denote the PDF of the $i$-th random variable in the composite transformation. Then, the nested transformation can be expressed as
\[\begin{aligned} \log f_n(\mathbf{x}_n) &= \log f_{n - 1}(\mathbf{x}_{n - 1}) - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 1})}{d \mathbf{x}_{n - 1}} \bigg| \\ &= \log f_{n - 2}(\mathbf{x}_{n - 2}) - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 1})}{d \mathbf{x}_{n - 1}} \bigg| - \log \bigg| \det \frac{d h(\mathbf{x}_{n - 2})}{d \mathbf{x}_{n - 2}} \bigg| \\ &= \cdots \\ &= \log f_0(\mathbf{x}_0) - \sum_{i = 1}^n \log \bigg| \det \frac{d h(\mathbf{x}_i)}{d \mathbf{x}_i} \bigg|. \end{aligned}\]The immediate implication of this exposition is that a repeated application of the change of variables formula provides a direct way of computing the likelihood of an observation from some complex, real-data distribution $f_n$ given a prior $f_0$ and a set of invertible transformation $h_1, h_2, \dots, h_n$. This conclusion illustrates the power of normalizing flows: it offers a direct way of measuring the likelihood of complex, high-dimensional data, such as ImageNet images, starting from a simple distribution, such as an isotropic Gaussian. Since the likelihood can directly be obtained, flow models are trained to maximize the log likelihood, which is exactly the expression derived above.
Although direct likelihood computation is a marked advantage of flow over other generative models, it comes with two clear limitations:
A number of methods have been proposed to satisfy these constraints. One of the most popular method is the affine coupling layer. Let $d$ denote the cardinality of the embedding space. Given an input $\mathbf{x}$ and and output $\mathbf{z}$, the affine coupling layer can schematically be written as
\[\begin{aligned} \mathbf{z}_{1:d/2} &= \mathbf{x}_{1:d/2} \\ \mathbf{z}_{d/2:d} &= \mathbf{x}_{d/2:d} \odot s_\theta(\mathbf{x}_{1:d/2}) + t_\theta(\mathbf{x}_{1:d/2}) \\ &= \mathbf{x}_{d/2:d} \odot s_\theta(\mathbf{z}_{1:d/2}) + t_\theta(\mathbf{z}_{1:d/2}). \end{aligned}\]In other words, the affine coupling layer implements a special transformation in which the top half of $\mathbf{z}$ is simply copied from $\mathbf{x}$ without modification. The bottom half undergoes an affine transformation, where the weights and biases are computed from the top half of $\mathbf{x}$. We can easily check that this transformation is indeed invertible:
\[\begin{aligned} \mathbf{x}_{1:d/2} &= \mathbf{z}_{1:d/2} \\ \mathbf{x}_{d/2:d} &= s_\theta^{-1}(\mathbf{z}_{1:d/2})(\mathbf{z}_{d/2:d} - t_\theta(\mathbf{z}_{1:d/2})) \end{aligned}.\]Coincidentally, the affine coupling layer is not only invertible, but it also enables efficient computation of the Jacobian determinant. This comes from the fact that the top half of the input is unchanged.
\[\begin{align} \mathbf{J} &= \begin{pmatrix} \frac{d \mathbf{z}_{1:d/2}}{d \mathbf{x}_{1:d/2}} & \frac{d \mathbf{z}_{1:2/d}}{d \mathbf{x}_{2/d:d}} \\ \frac{d \mathbf{z}_{2/d:d}}{d \mathbf{x}_{1:2/d}} & \frac{d \mathbf{z}_{d/2:d}}{d \mathbf{x}_{d/2:d}} \end{pmatrix} \\ &= \begin{pmatrix} \mathbb{I} & 0 \\ \frac{d \mathbf{z}_{2/d:d}}{d \mathbf{x}_{1:2/d}} & \text{diag}(s_\theta(\mathbf{x}_{1:d/2})) \end{pmatrix}. \end{align}\]Although $\mathbf{J_{21}}$ contains complicated terms, we do not have to consider them when computing $\det \mathbf{J}$: the determinant of a lower triangular matrix is simply the product of its diagonal entries. Hence, $\det \mathbf{J} = \mathbf{J_{11}} \times \mathbf{J_{22}}$, which is computationally tractable.
In practice, flow layers take a slightly more complicated form than the conceptual architecture detailed above. One easy and necessary modification is to shuffle the indices that are unchanged at each layer; otherwise, the top half of the input representation would never be altered even after having passed through $n$ layers. Another sensible modification would be to apply a more complicated transformation. For example, Real NVP proposes the following schema:
\[\begin{aligned} \mathbf{z}_{1:d/2} &= \mathbf{x}_{1:d/2} \\ h &= a \times \text{tanh}(s_\theta(\mathbf{x}_{1:d/2})) + b \\ \mathbf{z}_{d/2:d} &= \text{exp}(h) \times \mathbf{x}_{d/2:d} + g_\theta(\mathbf{x}_{1:d/2}). \end{aligned}\]To summarize:
Now that we have understood how flow works, let’s examine how flow is used in Glow-TTS.
Glow-TTS uses a flow-based decoder that transforms mel-spectrograms into a latent representation. As can be seen below in the architecture diagram, Glow-TTS accepts ground-truth mel-spectrograms (top of figure) and ground-truth text tokens (bottom of figure, shown as “a b c”) during training. Then, it runs the monotonic alignment search algorithm, which we will explore in the next section, to find an alignment between text and speech. The main takeaway is that the flow-based decoder transforms mel-spectrograms $\mathbf{y}$ to some latent vector $\mathbf{z}$, i.e., $f(\mathbf{y}) = \mathbf{z}$.
At a glance, it might not be immediately clear why we might want to use a flow model for the decoder instead of, for instance, a CNN or a transformer. However, the inference procedure makes clear why we need a flow-based model as the decoder. To synthesize a mel-spectrogram during inference, we estimate latent representations from user input text, then pass it on to the decoder. Since the decoder is invertible, we can reverse flow through the decoder to obtain a prediced mel-spectrogram, i.e., $f^{-1}(\hat{\mathbf{z}}) = \hat{\mathbf{y}}$, where $\hat{\cdot}$ denotes a prediction (as opposed to a ground-truth). In Glow-TTS, invertability offers an intuitive, elegant way of switching from training to inference.
The part that remains unexplained is how the model learns the latent representations and the relationship between text and acoustic features. This is explained by monotonic alignment search, which is the main topic of the next section.
Proposed by Kim et. al., Monotonic Alignment Search (MAS) is an algorithm for efficiently identifying the most likely alignment between speech and text.
Text-to-speech alignment refers to the correspondence between text and spoken audio. Consider a simple input, “hello!”, accompanied by a human recording of that sentence. We could imagine that the first 0.5 seconds of the audio corresponds to the first letter “h,” followed by 0.7 seconds of “e,” and so on. The process of attributing a specific text token to some time interval within the audio can be described as alignment search.
Finding an accurate alignment between speech and text is an incredibly important task in TTS. If an alignment discovered by the model is inaccurate, it could mean that the model skips words or repeats certain syllables, both of which are failure nodes we want to avoid. One of the most salient features of MAS is that it prevents such failures by preemptively enforcing very specific yet sensible inductive biases into the alignment search algorithm.
Let’s begin by enumerating a list of common sense intuition we have about TTS alignments.
Many previous alignment search methods do not necessarily enforce these constraints. For instance, Tacotron 2 uses sequence-to-sequence RNN attention to autoregressively build the alignment between speech and text. However, autoregressive alignment search often fails when long input text are fed into the model since errors can accumulate throughout the text sequence, yielding a highly inaccurate alignment at the end of the iteration. On the other hand, MAS is not only non-autoregressive, but also designed specifically so that the discovered alignment will never violate the set of inductive biases outlined above. This makes the model much more robust, even when the input sequence length is arbitrairly long.
At the heart of MAS is dynamic programming (DP), a common programming technique used to optimize runtime on problems that can be decomposed into recurring sub-problems that share the same structure as its parent. DP offers a reasonably efficient way of solving many problems, usually in $O(n^d)$ runtime, where $n$ is the size of the input and $d$ denotes DP dimensionality. While this section will not attempt to explain DP in full, we will consider a toy problem to motivate DP specifically in the context of MAS.
Consider a classic dynamic programming problem, where the goal is to find a monotonic path that maximizes the sum of scores given some score matrix. Here, “monotonic” means either moving from the current position diagonally down, or jumping to the right cell within the same row. While there might be many ways to approach this problem, here is one possible solution.
import copy
def find_maximum_sum_path(scores):
# preliminary variables
num_rows = len(scores)
num_cols = len(scores[0])
# copy to avoid overriding `scores`
scores2 = copy.deepcopy(scores)
# base case for first row
for j in range(1, num_cols):
scores2[0][j] += scores2[0][j - 1]
# dynamic programming
for i in range(num_rows):
for j in range(i, num_cols):
scores2[i][j] += max(scores2[i - 1][j - 1], scores2[i][j - 1])
# backtracking
# create `path` to return
i = num_rows - 1
path = [[0 for _ in range(num_cols)] for _ in range(num_rows)]
for j in reversed(range(num_cols)):
path[i][j] = 1
if i != 0 and (i == j or scores2[i][j - 1] < scores2[i - 1][j - 1]):
i -= 1
return path
Given the following scores
, the function returns the following result:
>>> grid = [
[1, 3, 1, 1],
[1, 2, 2, 2],
[4, 2, 1, 0],
]
>>> find_maximum_sum_path(grid)
[
[1, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]
]
It is not difficult to perform a manual sanity to check that the returned result is indeed the path that maximizes the sum of scores while adhering to the monotonicity constraint.
Let’s take a step back and revisit the model architecture diagram presented above. On the left side of the diagram, we see an illustration of monotonic alignment search in action. Notice that this is exactly the problem we solved above: given some matrix of scores, find a monotonic path that maximizes the sum. Now, only a few missing pieces remain:
Turns out that the two questions are closely related, and answering one will shed light on the other.
Recall that Glow-TTS deals with two input modalities during training: a string of text and its corresponding mel-spectrogram. The mel-spectrogram is decoded through the flow-based decoder. Similarly, the text is fed to a text encoder network, which outputs $\mathbf{\mu}$ and $\mathbf{\sigma}$ for each token of text. In other words, given ["h", "e", "l", "l", "o"]
, we would have a total of five mean and standard deviation vectors corresponding to each letter.^{2} We can denote them as $\mathbf{\mu_1}, \mathbf{\mu_2}, \dots, \mathbf{\mu_5}$, and $\mathbf{\sigma_1}, \mathbf{\sigma_2}, \dots, \mathbf{\sigma_5}$. Let’s also assume in this example that the corresponding mel-spectrogram spans a total of 100 frames. The output of the flow decoder would also be 100 vectors, denoted as $\mathbf{z_1}, \mathbf{z_2}, \dots, \mathbf{z_{100}}$.
Using these quantities, we can then construct a likelihood score matrix $P \in \mathbb{R}^{5 \times 100}$. The entries of the probability score matrix are computed via $P_{ij} = \log(\phi(\mathbf{z_j}; \mu_i, \sigma_i))$, where $\phi$ denotes the normal probability density function. Since $\sigma$ is a vector instead of a matrix, we assume an isotropic Gaussian, i.e., the covariance matrix is diagonal. The intuition is that the value of $P_{ij}$ indicates how likely it is that the $i$-th character matches or aligns with the $j$-th mel-spectrogram frame. If the two pairs of text and audio match, the probability score will be high, and vice versa. Log likelihood is used so that summation of scores effectively models a product in probability space.
Given this context, we can now apply the solution to the monotonic path sum problem motivated in the previous section. Instead of some arbitrary scores
matrix, we create the probability score matrix $P$ and use DP to discover the most likely monotonic alignment between speech and text. The alignment will satisfy the inductive biases we identified earlier due to the inherent design of MAS.
It is worth noting that MAS is a generic alignment search algorithm that is independent of the flow-based model design. In particular, MAS was used without the flow decoder in Grad-TTS. Popov et. al. proposed using mel-spectrogram frames directly to measure the probability score given the mean and variance prediced from text. In other words, instead of using $\mathbf{z}$, mel-spectrogram frames $\mathbf{y}$ were used. Grad-TTS is notable in its use of score-based generative models, which fall under the larger category of diffusion-based probabilistic models.
We can finally put flow and MAS together to summarize the overall pipeline of Glow-TTS.
Given a pair of text and mel-spectrogram $(T, \mathbf{y})$, we feed $T$ into the text encoder $f_\text{text}$ and mel-spectrogram $\mathbf{y}$ into the flow-based decoder $f_\text{mel}$ to obtain $f_\text{mel}(\mathbf{y}) \in \mathbb{R}^{D \times L_\text{mel}}$ and $f_\text{text}(T) = (\mu, \sigma)$, where $\mu, \sigma \in \mathbb{R}^{D \times L_\text{text}}$ and $D$ denotes the size of the embedding. We can then use MAS to obtain the most likely monotonic alignment $A^* \in \mathbb{R}^{L_\text{text} \times L_\text{mel}}$. Since Glow-TTS is a flow-based model, which enables direct computation of likelihood, the model is simply trained to maximize the value of the log-likelihood given by the sum of the entries of the log-likelihood score matrix $P$. $A^\star$ can intuitively be understood as a binary mask used to index $P$. Schematically, the final log-likelihood could be written as $l = \sum_{i = 1}^{L_\text{text}} \sum_{j = 1}^{L_\text{mel}}(P \odot A^\star)_{ij}$, where $\odot$ denotes a Hadamard product, or an element-wise product of matrices. Since optimization in modern machine learning are typically framed as a minimizing problems, we minimize the negative log-likelihood.
Although not discussed in the sections above, Glow-TTS requires training a small sub-model, called a duration predictor, for inference. Because we do not have access to the ground-truth mel-spectrogram during inference, we need a model that can predict the best alignment $A^*$ purely from text. This task is carried out by the duration predictor, which accepts $T$ as input and is trained to maximize the L2 distance between its predicted alignment $\hat{A}$ and the actual $A^\star$ discovered by MAS.
In the context of inference, the model has to output a predicted mel-spectrogram $\hat{\mathbf{y}}$ conditioned on the input text $T$. First, we use the learned text encoder to obtain mean and variance, i.e., $f_\text{text}(T) = (\mu, \sigma)$. Then, we use the duration predictor to obtain a predicted alignment $\hat{A}$. We can then sample from the $\mathcal{N}(\mu, \sigma^2)$ distribution according to $\hat{A}$. Continuing the earlier example of T = ["h", "e", "l", "l", "o"]
, let’s say that A_star = [1, 3, 2, 1, 1]
. This means that we have to sample from $\mathcal{N}(\mu_\text{h}, \sigma_\text{h})$ once, $\mathcal{N}(\mu_\text{e}, \sigma_\text{e})$ three times, and so on. By concatenating the results of sampling, we obtain $\hat{\mathbf{z}} \in \mathbb{R}^{D \times \hat{L_\text{mel}}}$, where $\hat{L_\text{mel}}$ denotes the length of the predicted mel-spectrogram frames, which is effectively sum(A_star)
. Once we have $\hat{\mathbf{z}}$, we finally use the flow decoder to invert it into the mel-spectrogram space, i.e., $f_\text{mel}^{-1}(\hat{\mathbf{z}}) = \hat{\mathbf{y}}$.
Sample diversity is an important concern in neural TTS. Just like humans can read a single sentence in many different ways by varying tone, pitch, and timbre, preferably, we want a TTS model to be able to produce diverse samples. One way to achieve this in Glow-TTS is by varying the temperature parameter during sampling. In practice, sampling is performed thorugh the reparametrization trick:
\[\epsilon \sim \mathcal{N}(0, 1) \\ \mathbf{z} = \mu + \epsilon \cdot \sigma^2.\]Through listening tests and pitch contours, Kim et. al. show that varying $\epsilon$ achieves diversity among samples produced by Glow-TTS.
A marked advantage of Glow-TTS is that it is a parallel TTS model. This contrasts with existing autoregressive baselines, such as Tacotron 2. While autoregressive models require an iterative loop to condition the output of the current timestep on that from the previous timestep, parallel models produce an output in a single pass. In other words, parallel models run in constant time, whereas the runtime complexity of autoregressive models scales linearly with respect to the length of the input sequence. This is clear in the comparison figure taken from the Glow-TTS paper.
Another pitfall of autoregressive models is that errors can accumulate throughout the iterative loop. If the model misidentifies an alignment between speech and text early on in the input sequence, later alignments will also likely be incorrect. In the case of parallel models, error accumulation is not possible since there is no iterative loop to begin with. Moreover, alignments found by Glow-TTS are made even more robust due to the design of MAS, which systematically identifies only those alignments that satisfy the monotonicity inductive bias. In the figure below, also taken directly from the Glow-TTS paper, Kim et. al. show that the Glow-TTS maintains a consistent character error rate, while that of Tacotron 2 increases proportionally to the length of the input sequence.
Glow-TTS achieves competitive results on mean opnion score (MOS) listening tests. MOS tests are typically performed by randomly sampling a number of people and providing them to rate an audio sample from a scale of 1 to 5, where higher is better.
In the results table shown below, GT (ground-truth) is rated most highly at 4.54. WaveGlow is a neural vocoder that transforms mel-spectrograms to waveform. GT (Mel + WaveGlow) received 4.19, marginally below the GT waveform score. This is because using a neural vocoder necessarily introduces quality degradations and artifacts. Since even the best neural TTS acoustic feature generator would not be able to produce a mel-spectrogram that sounds more natural than a human recording, 4.19 can be considered as the theoretical upperbound for any TTS model and WaveGlow combination. Glow-TTS comes pretty close to 4.19, scoring approximately 4 across various temperature parameters. While the difference of 0.19 certainly suggests room for improvement, it is worth mentioning that Glow-TTS outperforms the Tacotron 2, which has been considered the competitive SOTA TTS model for a long time.
An emerging trend in neural TTS literature is end-to-end TTS modeling. Instead of the traditional two-stage pipeline composed of an acoustic feature generator and a neural vocoder, end-to-end models produce raw waveforms directly from text without going to the intermediate mel-spectral representation. One prime example is VITS, an end-to-end speech model developed by the authors of Glow-TTS published in ICML 2021. VITS is a combination of Glow-TTS and HiFi-GAN, which is a neuarl vocoder. VITS uses largely the same MAS algorithm as Glow-TTS, and uses a variational autoencoding training scheme to combine the feature generator and the neural vocoder.
A benefit of using end-to-end modeling is that the model is relieved of the mel-spectral information bottleneck. Mel-spectrogram is a specific representation of information defined and crafted according to human knowledge. However, the spirit of deep learning is that no manual hand-crafting of features is necessary, provided sufficient data and modeling capacity. End-to-end models allow the model to choose its own intermediate representation that best accomplishes the task of synthesizing natural-sounding audio. Indeed, VITS outperforms Tacotron and Glow-TTS by considerable margins and almost matches ground-truth MOS ratings. This is certainly an exciting development, and we can expect more lines of work in this direction.
Glow-TTS is a flow-based neural TTS model that demonstrated a method of leveraging the invertability of flow to produce mel-spectrograms from text-derived latent representations. By projecting mel-spectrograms and text into a common latent space and using MAS and maximum likelihood-based training, Glow-TTS is able to learn robust, hard monotonic alignments between speech and text. Similar to Tacotron 2, Glow-TTS is now considered a competitive baseline and is referenced in recent literature.
Neural TTS has seen exciting developments over the past few years, including general text-to-speech, voice cloning, singing voice synthesis, and prosody transfer. Moreover, given the rapid pace of development in other fields, such as natural language processing, automatic speech recognition, and multidmodal modeling, we could see more interesting models that combine different approaches and modalities to perform a wide array of complex tasks. If anything remains clear, it is that we are living at an exciting time in the era of machine learning, and that the next few years will continue to see breakthroughs and innovations that will awe and surprise us, just like people a few decades ago would marvel at the simplest words:
“Turn right at 130 Prospect Street.”
While there are variations of normalizing flows, such as continuous flows or neural ODEs, for sake of simplicity, we only consider discontinuous normalizing flow. ↩
In practice, most TTS models, including Glow-TTS, use phonemes as input instead of characters of text. We illustrate the example using characters for simplicity. ↩
2021 was, in some ways, very similar to 2020. Despite the development and proliferation of vaccines, COVID-19 raged on, morphing into a new variant every few months. Masks and social distancing are now deeply embedded into our daily lives. Although booster shots and pill-type medications might change the dynamics of the pandemic, I personally think COVID is here to stay, at least for the foreseeable future.
After being discharged from the army in March of 2021, I spent roughly 6 months working as an intern at Neosapience, a Korean startup specializing in voice-over services and metaverse characters. This was also when I left ReRent, a hospitality startup that I was fortunate enough to have worked for since the summer of 2020. ReRent immensely helped me learn and grow as a software developer, versed in git
and GitHub, general web development, and Django, which has since become my favorite Python backend framework. It is also where I met valuable teammates, some of whom I met in person at Yale.
The transition from ReRent to Neosapience was a lot more than just a change of jobs. At Neosapience, I worked on machine learning research–an art of its own entirely different from backend web development. Specifically, I was tasked with the job of developing a singing voice synthesis model that, given lyrics and melodies, could “sing.” I still remember the frustration I felt when I was first trying to reproduce a reference paper I was provided as a baseline. There were parts of the paper that were ambiguous. The fact that it was a GAN-based model certainly did not help. I reached out to the authors in the hopes of gaining clarity, but received no response. Although I extrapolated parts of the model and trained it for a few days, the model only produced barely audible mumbles that could not be farther from the act of singing. I learned that ML was hard.
Thankfully, I was fortunate enough to have had more experienced co-workers as mentors who provided valuable pieces of advice. One of them suggested that I design a model of my own instead of blindly trying to reproduce the paper. As a demo of sorts, he showed me that a simple CNN model could sing better than the GAN I was trying to reproduce, with just a few minutes of training. Inspired by his progress, I began designing my own modules to experiment with a host of different architectures: CNNs, RNNs, transformers, and combinations thereof. I also explored various famous CNN architectures, such as InceptionNet and ResNeXT in search of inspiration and ideas.
Unexpectedly, the biggest success came from a very experimental model that was a direct adaptation of MLP-Mixer, an architecture composed entirely of multi-layer perceptrons, or nn.Linear
layers in PyTorch. This was a paper I presented during one of our weekly paper-reading meetings. Although the quality of results produced by the final model still contained audible artifacts, nonetheless we saw novelty in the fact that it was the first voice synthesis model exclusively composed of linear layers. This project culminated in my first ever publication MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis in IEEE Machine Learning for Signal Processing workshop, now available on IEEE Xplore. By the end of my internship, I felt a lot more comfortable with various ML concepts and their implementations. This was also when I was involved with Hugging Face’s Flax/JAX community week event where my teammates and I developed KoCLIP, as well as BigScience, a huge project by Hugging Face to reproduce a GPT-3-sized language model.
I came back to Yale with the explicit intent of majoring in Computer Science and Mathematics. While this was not a trivial decision, it was very clear and obvious to me that this was the academic path I wanted to pursue. I took CPSC 223, which is Yale’s signature data structures course taught in… barebones C. malloc
and free
are probably the functions I used the most this year, perhaps with the exception of print
/printf
s I used for lazy debugging. On top of CS classes, I also continued my involvement with ML in a few ways. For one thing, I co-authored my second paper, EdiTTS: Score-based Editing for Controllable Text-to-Speech, with a co-worker at Neosapience. This was the first project in which I used Amazon Mechanical Turk for MOS measurements. I’m still waiting on the final decision from a conference to which I submitted this paper, but I’m happy about how it came out regardless.
More importantly, I was extremely fortunate to be given the opportunity to work as a software engineering intern at Hugging Face. This was an unbelievable achievement for me that I knew I did not deserve. As a self-taught newcomer and student to the field of ML, I only dreamed about working at Hugging Face when I was first learning about transformers. I still have not produced much output at HF largely due to the fact that my internship was part-time and very low time commitment-wise, but I’m still excited for the month of January, which is when I will be dedicating myself full time to Hugging Face and BigScience. I would also like to express gratitude to the engineer at Hugging Face who referred me to this position, and whom I now consider a mentor, Stas Bekman.
This semester was perhaps the hardest one yet at Yale. All the classes I took either required a lot of effort or time commitment. Admittedly to fulfill my distribution requirement, I went out my ways and took HIST 271: European Intellectual History since Nietzsche, where I learned a ton about philosophy, from the Enlightenment all the way up to post-Modernism. I also enrolled in ASTR 110: Planets and Stars, which I frankly took for an easy science credit, only to realize that weekly problem sets took up more time than I had anticipated. MATH 241: Probability Theory was easy at first, but ramped up quite quickly at the end of the semester, to the point that I was floundering about during finals week. Nonetheless, I’m glad that the semester is over, and that I came out of it feeling more learned and knowledgable than I was five months ago.
2021 was surely a roller coaster ride. It was surely a fruitful one, but it is also a miracle how it turned out the way it did. With experience, memories, and gratefulness at heart, I cannot wait to see what 2022 has in store.
]]>Given a parametrized real-valued function $f_\theta(\mathbf{x})$, we can derive a probability model $p_\theta(\mathbf{x})$ by applying a normalization term $Z_\theta$.
\[p_\theta (\mathbf{x}) = \frac{e^{- f_\theta (\mathbf{x})}}{Z_\theta} \\ Z_\theta = \int e^{- f_\theta (\mathbf{x})} \, d \mathbf{x}.\]In practice, $f_\theta$ is often an energy-based model (EBM).
We can then define the likelihood function as follows:
\[\log p_\theta (\mathbf{x}) = - f_\theta (\mathbf{x}) - \log Z_\theta.\]However, one glaring problem with this formulation is that $Z_\theta$ is often intractable. Score-matching presents an elegant solution to bypass this problem.
To eliminate the intractable term, we consider the score, which is defined as the gradient of the log likelihood with respect to the random variable $\mathbf{x}$. Note that we are not taking the gradient with respect to the parameter $\theta$, which is typically the object of interest in processes such as MLE.
\[\nabla_\mathbf{x} \log p_\theta (\mathbf{x}) = - \nabla_\mathbf{x} f_\theta (\mathbf{x}).\]The goal of score-matching, then, is to minimize the difference between $p_\text{data}$ and $p_\theta$ by optimizing the Fisher divergence. For sake of simplicity, we consider the 1-D case.
\[\begin{align} &\frac12 \mathbb{E}_{p_\text{data}} \lVert \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \rVert^2_2 \\ &= \frac12 \int p_\text{data} (x) \left( \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \right)^2 \, dx \\ &= \frac12 \int p_\text{data}(x) (\nabla_x \log p_\text{data}(x))^2 \, dx + \frac12 \int p_\text{data} (x) (\nabla_x \log p_\theta (x))^2 \, dx \\ & - \int p_\text{data}(x) \nabla_x \log p_\text{data}(x) \nabla_x \log p_\theta (x) \, dx . \end{align}\]The equalities simply follow from the integral definition of expectation. Note that the first term is simply a constant and can be ignored during optimization.
Applying integration by parts on the last term,
\[\begin{align} & \int p_\text{data}(x) \nabla_x \log p_\text{data}(x) \nabla_x \log p_\theta (x) \, dx \\ &= \int p_\text{data}(x) \frac{\nabla_x p_\text{data}(x)}{p_\text{data} (x)} \nabla_x \log p_\theta (x) \, dx \\ &= \int \nabla_x \log p_\theta (x) \nabla_x p_\text{data} (x) \, dx \\ &= p_\text{data}(x) \nabla_x \log p_\theta(x) \bigg|^\infty_{- \infty} - \int p_\text{data}(x) \nabla^2_x \log p_\theta (x) \, dx \\ & \approx - \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x)]. \end{align}\]Putting all terms together,
\[\begin{align} &\frac12 \mathbb{E}_{p_\text{data}} \lVert \nabla_x \log p_\text{data} (x) - \nabla_x \log p_\theta (x) \rVert^2_2 \\ &= \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x)] + \frac12 \mathbb{E}_{p_\text{data}} [(\nabla_x \log p_\theta (x))^2] + \text{const.} \\ &= \mathbb{E}_{p_\text{data}}[\nabla^2_x \log p_\theta (x) + \frac12 (\nabla_x \log p_\theta (x))^2] + \text{const.} \end{align}\]We can easily extend this into a multidimensional context, the result of which is
\[\mathbb{E}_{p_\text{data}} \left[\text{tr}(\nabla^2_\mathbf{x} \log p_\theta (\mathbf{x})) + \frac12 \lVert \nabla_\mathbf{x} \log p_\theta (\mathbf{x}) \rVert^2_2 \right] + \text{const.}\]We are specifically interested in instances where $f_\theta$ is parametrized as a neural network. Recall that
\[\nabla_\mathbf{x} \log p_\theta (\mathbf{x}) = - \nabla_\mathbf{x} f_\theta (\mathbf{x}).\]Therefore, we can rewrite the score-matching objective as
\[\mathbb{E}_{p_\text{data}} \left[\text{tr}(\nabla^2_\mathbf{x} f_\theta (\mathbf{x})) + \frac12 \lVert \nabla_\mathbf{x} f_\theta (\mathbf{x}) \rVert^2_2 \right] + \text{const}.\]While the first-order gradient can be simply obtained via backpropagation, $\text{tr}(\nabla^2\mathbf{x} f\theta (\mathbf{x}))$ is very computationally costly. To circumvent this problem, the authors propose random projection, which reduces dimensionality of data down to scalars. Quoting Yang Song:
We propose sliced score matching to greatly scale up the computation of score matching. The motivating idea is that one dimensional data distribution is much easier to estimate for score matching. We propose to project the scores onto random directions, such that the vector fields of scores of the data and model distribution become scalar fields. We then compare the scalar fields to determine how far the model distribution is from the data distribution. It is clear to see that the two vector fields are equivalent if and only if their scalar fields corresponding to projections onto all directions are the same.
The random projection version of Fisher divergence is
\[\frac{1}{2}\mathbb{E}_{p_\text{data}}[(\mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\text{data}(\mathbf{x}) - \mathbf{v}^\intercal \nabla_\mathbf{x} \log p_\theta(\mathbf{x}) )^2].\]Intuitively, the equation forces the two distributions to get closer according to some random projection $\mathbf{v}$. Since the projection is random, there exists a guarantee that optimizing this quantity will bring $p_\theta$ closer to the real data distribution.
The sliced score-matching objective under this revised Fischer divergence is
\[\mathbb{E}_{p_\text{data}}\bigg[\mathbf{v}^\intercal \nabla_{\mathbf{x}}^2 \log p_\theta(\mathbf{x})\mathbf{v} + \frac{1}{2} (\mathbf{v}^\intercal\nabla_\mathbf{x} \log p_\theta(\mathbf{x}))^2 \bigg] + \text{const}.\]The problem has now been reduced into computationally tractable form.
This post was originally written in July, but polished into its current final form in December. If you spot any rough edges or details I missed, please feel free to reach out to me with corrections.
]]>We want a model that satisfies the following:
The two conditions are somewhat related in the sense that once you have a function (or a neural network that approximates such a function) that maps complex distributions to a tractable latent space, sampling can be performed immediately given that the mapping function is invertible. Invertibility is not something that can be easily assumed in deep learning and thus calls for some specific architectural decisions. Nonetheless, I find this formulation highly compelling and intuitive.
To fully understand the mechanics of flow, we need to first revisit the change of variables formula. Let $X$ denote a random variable, and $f_\theta$, some monotonic, invertible function that maps $X$ to a latent space $Z$. In the simplest case, $f_\theta$ might be the CDF of $X$, and $Z$ might be a uniform distribution $U(0, 1)$. More generally, we have
\[z = f_\theta(x)\]Note that there exists a one-to-one correspondence between the two random variables, which is important to guarantee invertability.
Let $p(\cdot)$ denote the PDF of some random variable. Naively, one might think that
\[p(x) \, dx = p(z) \, dz\]However, this fails to take into account the fact that a small change in $x$ may or may not be equally spread out in $z$ space. Hence, we need a correcting factor, which is the derivative of $z$ w.r.t. $x$.
\[p(x) = p(z) \left\lvert \frac{\partial f_\theta(x)}{\partial x} \right\rvert \tag{1}\]More formally, we can see this by considering the derivative of the CDF, which we will denote as $P(\cdot)$.
\[\begin{align} P(Z \leq z) &= P(f_\theta(X) \leq z) \\ &= P(X \leq f_\theta^{-1}(z)) \end{align} \tag{2}\](2) holds if $f$ is a monotonically increasing function. If it is a monotonically decreasing function, then
\[P(Z \leq z) = 1 - P(X \leq f_\theta^{-1}(z))\]Deriving both sides of the equation by $z$, we get
\[\begin{align} p(z) &= \pm \, p(f_\theta^{-1}(z)) \frac{\partial f_\theta^{-1}(z)}{\partial z} \\ &= p(x) \left\lvert \frac{\partial x}{\partial z} \right\rvert \\ \end{align} \tag{3}\]Rearranging (3) yields (1).
In a multi-dimensional context, the absolute value of the partial derivative term is effectively the determinant of the jacobian matrix.
\[p(x) = p(z) \frac{\text{vol}(dz)}{\text{vol}(dx)} = p(z) \left\lvert \text{det} \frac{dz}{dx} \right\rvert\]We can understand the determinant of a matrix as calculating the magnitude of volume change that it would produce as a linear transformation of coordinates. We can see this as a multivariate analogue of slope or the gradient.
Flow is nothing more than a neural network that models $f_\theta$. It takes a random variable living in some complex intractable space and sends it to a tractable dimension. In the case of normalizing flows, the target latent distribution is a normal distribution.
As is the case with any likelihood model, the goal is to fit a model that maximizes the log likelihood of data. Therefore, the objective is
\[\max \sum_i \log p(x_i) \tag{4}\]We can substitute the likelihood with an expression using the latent transformed variable in (1). Then, (4) is equivalent to
\[\max \sum_i \log p(f_\theta(x_i)) + \log \, \left\lvert \text{det} \frac{d f_\theta(x_i)}{d x} \right\rvert\]We train the flow model to minimize negative log likelihood, or equivalently, maximize log likelihood.
A few remarks:
Up to this point, you might think that the flow model is a very intricate machinery that comes with many constraints, e.g. invertability, easy jacobian calculation, and etc. Nonetheless, I think it has some clear advantages in two aspects.
To sample from a flow model, all we have to do is sample from the posterior distribution, such as a normal or Gaussian, then simply send it down an inverse flow.
One salient characteristic of a flow is that a combination of flows is also a flow. If you have a set of invertible, differentiable functions, a stack of such functions will also be differentiable and invertible.
\[z = f_k \circ f_{k - 1} \circ \cdots \circ f_1(x) \\ x = f_1^{-1} \circ f_2^{-1} \circ \cdots \circ f_k^{-1} (z)\]A capacity of a single flow layer is most likely limited, but a deep stack gives it enough expressional power to handle highly complex prior distributions.
Flow models must be invertible, which leads to some important considerations when motivating their architecture. For instance, we cannot use ReLU activations since they violate the invertability requirement. Moreover, the jacobian should be easy to compute.
The beautiful part of flow is that there is a simple way to resolve both conundrums: affine coupling layers. Let $d$ denote the cardinality of the embedding space on which we are applying a flow model. Then, the affine coupling layer can schematically be written as
\[z_{1:d/2} = x_{1:d/2} \\ \begin{align} z_{d/2:d} &= x_{d/2:d} \odot s_\theta(x_{1:d/2}) + t_\theta(x_{1:d/2}) \\ &= x_{d/2:d} \odot s_\theta(z_{1:d/2}) + t_\theta(z_{1:d/2}) \end{align} \tag{5}\]In plain language, we can consider $f_\theta$ as a special transformation in which the top half of $z$ is just copied from $x$ without modification. The bottom half undergoes an affine transformation, where the weights and biases are computed from the top half of $x$. We can easily check that this transformation is indeed invertible:
\[x_{1:d/2} = z_{1:d/2} \\ x_{d/2:d} = s_\theta^{-1}(z_{1:d/2})(z_{d/2:d} - t_\theta(z_{1:d/2})) \tag{6}\]Affine coupling layers are invertible only because the top half of $z$ is equal to that of $x$. This demystifies the copying operation in (5), which may have appeared somewhat unintuitive and awkward initially.
In practice, it appears that flow layers take a slightly more complicated form than the conceptual architecture detailed above. For example, Real NVP proposes the following schema.
\[z_{1:d/2} = x_{1:d/2} \\ h = a \times \text{tanh}(s_\theta(x_{1:d/2})) + b \\ z_{d/2:d} = \text{exp}(h) \times x_{d/2:d} + g_\theta(x_{1:d/2})\]where $a$ and $b$ are learned parameters, and $s_\theta$ and $g_\theta$ are some affine transformations, such as a multi-layer perceptron.
Earlier, we noted that the determinant of the jacobian matrix must be easy to compute. This is a non-trivial constraint that does not hold true in many cases.
Fortunately, it turns out that the jacobian is very easy to compute given an affine coupling layer. We can somewhat intuit this by considering the copy-and-paste operation that is applied to the top half of the input. Given this operation, we can see that the the upper left quadrant of the jacobian will simply be an identity matrix.
\[\begin{align} \frac{\partial z}{\partial x} &= \begin{pmatrix} \frac{\partial z_{1:d/2}}{\partial x_{1:d/2}} & \frac{\partial z_{1:2/d}}{\partial x_{2/d:d}} \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \frac{\partial z_{d/2:d}}{\partial x_{d/2:d}} \end{pmatrix} \\ &= \begin{pmatrix} I & 0 \\ \frac{\partial z_{2/d:d}}{\partial x_{1:2/d}} & \text{diag}(s_\theta(x_{1:d/2})) \end{pmatrix} \end{align}\]Although there are still complicated terms in the third quadrant of the jacobian, we do not have to consider them to compute the determinant of the jacobian: the determinant of a lower triangular matrix is simply the product of its diagonal entries. Hence, the determinant of the jacobian simply collapses to the product of the entries in the fourth quadrant. Hence, we see how the affine transform layer satisfies both the invertability and the jacobian determinant requirements.
This is my attempt at a simple implementation of an affine transform layer. Although I could have combined the forward()
and inverse()
functions to remove duplicate lines of code, for clarity’s sake, I left them separate.
import torch
from torch import nn
class AffineCouplingLayer(nn.Module):
def __init__(self, hidden_size):
super().__init__()
half_size, remainder = divmod(hidden_size, 2)
assert remainder == 0, print(
f"Expected `hidden_size` to be even, but received {hidden_size}"
)
self.fc = nn.Linear(half_size, hidden_size)
def forward(self, x, inverse=False):
if inverse:
return self.inverse(x)
x1, x2 = x.chunk(2, dim=1)
z1 = x1
s, t = self.fc(x1).chunk(2, dim=1)
z2 = x2 * s + t
z = torch.cat((z1, z2), dim=1)
det = s.prod(dim=-1).abs()
return z, det
def inverse(self, z):
z1, z2 = z.chunk(2, dim=1)
x1 = z1
s, t = self.fc(z1).chunk(2, dim=1)
x2 = (z2 - t) / s
x = torch.cat((x1, x2), dim=1)
return x
This implementation is a close transcription of (5). z1
denotes $z_{1:d/2}$; z2
, $z_{d/2:d}$, and ditto the x
s. The fully-connected layer self.fc
acts as an affine transform. We condition the output z2
on the result of the affine transform applied on x1
. The inverse()
is a transcription of (6).
We can perform a quick sanity check on this implementation by performing a forward pass, as well as an inverse path, and verifying that inverting the output of the forward pass recovers the original input.
batch_size = 8
hidden_size = 10
half_size = hidden_size // 2
x = torch.randn(batch_size, hidden_size)
l = AffineCouplingLayer(hidden_size)
z, det = l(x)
z.shape
torch.Size([8, 10])
We also get the determinant, which are scalar values. We get 8 values, which equals the batch size in the example input.
det.shape
torch.Size([8])
We can check that the affine coupling layer only transforms the top half of the input.
torch.equal(x[:,:half_size], z[:,:half_size])
True
Trivially, we can also verify that the rest of the output has been modified by the layer.
torch.equal(x[:,half_size:], z[:,half_size:])
False
Most importantly, we can see that the layer is indeed invertable; that is, it recovers the original input given the output of the layer z
.
torch.allclose(x, l(z, inverse=True))
True
We use torch.allclose()
instead of torch.equal()
due to floating point errors that can cause subtle changes in values. This is merely a technicality and does not affect the conclusion that affine coupling layers are fully invertable.
In this post, we discussed flow models. I personally find flow-based models extremely interesting, simply because deep neural networks are normally not something that we can invert like a simple mathematical function. After all, the precise reason why we use deep neural networks is that we want to model complex non-linear functions. Flow models seem to go against this intuition in some sense, while providing us with the tools to handle highly complex data distributions to tractable posteriors.
I hope you enjoyed reading this post. Catch you up in the next one!
]]>One important property of the logarithm is that it is a concave function. A function $f$ is concave if it satisfies the following property:
\[f\left( \sum \nolimits_i w_i x_i \right) \geq \sum \nolimits_i f(w_i x_i) \tag{1}\]In other words, if the function evaluated at some weighted sum of values is always greater or equal to the sum of the values evaluated by the function, the function is concave.
As a short detour, we discussed a similar concept in the context of variational autoencoders and Jenson’s inequality in an earlier post. In that post, I introduced the definition of convexity as follows:
\[\mathbb{E}[f(x)] \geq f(\mathbb{E}[x]) \tag{2}\]While the notations used are slightly different, it is easy to see that the this definition is almost the exact reverse of (1). A trivial result of this is that a concave function is convex if and only if it is linear.
Given this understanding, we can now revisit the logarithm and quickly verify that it is a concave function.
Before diving into a soup of equations, it’s important to remind ourselves of the problem setup. While ELBO is probably most commonly referenced in the context of variational autoencoders, I have recently seen it being mentioned in diffusion models as well. ELBO is a broad concept that can be applied to discuss any model with hidden latent representations, which we will denote as $h$ henceforth.
More concretely, given a model $p(x, h)$, we can write
\[\begin{align} \log p(x) &= \log \left( \sum_{h} p(x, h) \right) \tag{2} \\ &= \log \left( \sum_{h} q(h \vert x) \frac{p(x, h)}{q(h \vert x)} \right) \tag{3} \\ & \geq \sum_{h} q(h \vert x) \log \frac{p(x, h)}{q(h \vert x)} \tag{4} \\ &= \sum_{h} q(h \vert x) \log p(x, h) - \sum_{h} q(h \vert x) \log q(h \vert x) \tag{5} \\ &= \mathbb{E}_q [\log p(x, h) - \log q(h \vert x)] \tag{6} \end{align}\](2) follows from the law of total probability, (3) is a simultaneous application of multiplication and division, (4) follows from the concavity of logarithms, (5) is an algebraic manipulation using the properties of logarithms, and (6) is a rewriting of the expression as an expectation under $q(h \vert x)$.
In the formulation above, $q(h \vert x)$ can be understood as an approximation of a true distribution $p(h \vert x)$. Note that when $q(h \vert x) = p(h \vert x)$, we have an exact equality. Since
\[\log p(x, h) = \log p(h \vert x) + \log p(x)\]We can substitute $q$ for $p$ and rewrite (5) as
\[\begin{align} \log p(x) &= \sum_h p(h \vert x) (\log p(h \vert x) + \log p(x)) - \sum_h p(h \vert x) \log p(h \vert x) \\ &= \sum_h p(h \vert x) \log p(x) \end{align}\]Since $p(x)$ does not depend on $h$, we can pull out the term from the summation, treating it as a constant, leaving us with
\[\log p(x) \sum_h p(h \vert x)\]Using the law of total probability, we see that the summation totals to 1, leaving us with $\log p(x)$, which is what ELBO seeks to approximate.
Variational lower bounds are extremely useful when dealing with models whose interactions between $x$ and the hidden representation $h$ are complex, rendering (2) computationally intractable. Therefore, to train such models, we seek to maximize the log likelihood by pushing the lower bound up.
Recall the definition of KL divergence:
\[\begin{align} D_\text{KL}(q \parallel p) &= \sum_{x \in X} q(x) \log \left( \frac{q(x)}{p(x)} \right) \\ &= - \sum_{x \in X} q(x) \log \left( \frac{p(x)}{q(x)} \right) \\ \end{align}\]We can see the resemblance between this definition and the definition of ELBO as written in (4), which was
\[\log p(x) \geq \sum_{h} q(h \vert x) \log \frac{p(x, h)}{q(h \vert x)} \tag{4}\]The nice conclusion to this story is that
\[\log p(x) - \text{ELBO} = D_\text{KL}(q(h \vert x) \parallel p(h \vert x)) \tag{7}\]This is a nice interpretation, since KL divergence is by definition always greater or equal to zero. Hence, we can confirm that
\[\log p(x) \geq \text{ELBO}\]In this section, we sketch a quick proof for (7).
\[\begin{align} D_\text{KL}(q(h \vert x) \parallel p(h \vert x)) &= \mathbb{E}_q [\log q(h \vert x) - \log p(h \vert x) ] \\ &= \mathbb{E}_q [\log q(h \vert x) - \log p(x, h) + \log p(x) ] \\ &= \mathbb{E}_q [\log q(h \vert x) - \log p(x, h)] + \log p(x) \\ \end{align}\]Notice that the expectation is the sign-flipped version ELBO term we derived above.
\[\mathbb{E}_q [\log p(x, h) - q(h \vert x)] \tag{6}\]Therefore, we have
\[D_\text{KL}(q(h \vert x) \parallel p(h \vert x)) = - \text{ELBO} + \log p(x) \\ \implies \log p(x) - \text{ELBO} = D_\text{KL}(q(h \vert x) \parallel p(h \vert x))\]Since we have already seen how ELBO comes up in VAEs, it might be more helpful to take a look at another more recent example I came across while reading Denoising Diffusion Probabilistic Models, or DDPM for short. The intent of this section is not to go over what DDPMs are, but rather to show a sneak peak into how ELBO is mentioned in the paper.
In the paper, the authors write
Training is performed by optimizing the usual variational bound on negative log likelihood: \(\begin{align} \mathbb{E}[- \log p_\theta(\mathbf{x}_0)] & \leq \mathbb{E}_q \left[ - \log \frac{p_\theta (\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \right] \tag{8} \\ &= \mathbb{E}_q \left[ - \log p(\mathbf{x}_T) - \sum_{t \geq 1} \log \frac{p_\theta (\mathbf{x}_{t - 1} \vert \mathbf{x}_t)}{q(\mathbf{x}_t \vert \mathbf{x}_{t - 1})} \right] \tag{9} \\ & := L \end{align}\)
Equation tags have been added for the purposes of this post.
Admittedly, this does look confusing at first sight, but at its core is the definition of ELBO which we have derived in this post, plus some details inherent to DDPMs, such as Markov chain diffusion. In light of the topic of this post, I will attempt to give the simplest possible explanation of the later while focusing on the former.
To make things a little more familiar, let’s rewrite (6) to look more like the one presented in the DDPM paper.
\[\begin{align} \log p(x) & \geq \mathbb{E}_q [\log p(x, h) - \log q(h \vert x)] \tag{6} \\ & \geq \mathbb{E}_q \left[ \log \frac{p(x, h)}{q(h \vert x)} \right] \tag{6-1} \\ \end{align}\]It is not difficult to see that simply flipping sign on both sides results in an expression that closely resembles (8). We also see a one-to-one correspondence between the variables used in this post and the ones in the paper. Namely, $\mathbf{x_0}$ corresponds to $x$, the ground-truth data, and $\mathbf{x}_t$ is the hidden representations of the model.
DDPMs work by starting out with some GT data $\mathbf{x}_0$, then gradually adding Gaussian noise through a Markov chain process. This gradually “breaks” signals originally present in the data, and send the ground truth data to an approximately isotropic distribution. This process is illustrated below. The figure was taken from the author’s website.
A neural network is then trained to reverse this Markov chain process by recovering the original signal from the noise. The overall intuition is, in some sense, similar to that of GANs or VAEs, where a network learns to map latent dimensions to the data distribution. An obvious difference is that DDPMs iteratively recover the data, whereas GAN generators usually go directly to the data distribution. The slicing and summation notation in (9) exists precisely due to this iterative nature of the DDPM generative process.
Topics like ELBO and KL divergence are one of those concepts that I always think I understand, but do not in reality. The mathematical details underlying those concepts are always intriguing to look at.
While this post in no way covers the entirety of the topic, I hope this will lay a solid foundation for those who want to better understand the mathematics behind latent variable models, such as variational autoencoders, DDPMs and the likes. Personally, I am starting to discover a newfound fascination for DDPMs, and hope to write more about them in the near future.
I hope you enjoyed reading this post. Catch you up in the next one!
]]>Lately I’ve been realizing how powerful a force inertia is. It was easy to churn out posts every week when blogging was part of my personal norm, almost a habit if you will. Then, when perturbations were introduced to my life, I lost equilibrium and regrettably stopped writing on a regular basis. While I continued studying and committing to new and old repositories on my GitHub, for some inexplicable reason I found it difficult to restart something that I had stopped engaging with. Inertia is insidious, yet it concretizes with time, turning into a substance forceful enough to transform the definition of what personal norm entails.
Today, I was trying to wrap my head around the basics of stochastic differential equations and diffusion models (both of which I still do not understand) until I came across the term “score-based models.” The term “score” comes from Fischer’s score, which I had written about some time in the past. It’s an odd feeling when you realize that yourself a few months back was bright enough to understand concepts that the current self finds abstract and incomprehensible. But this wasn’t the only time I looked up something on my own blog. While there were also times when I spotted my own past mistakes, more often or not I found myself using my own writing as reference in an attempt to recall some concept or understanding from distant memory.
The conclusion of this admittedly verbose, ostensibly pointless post, is that documenting one’s intellectual journey is definitely a worthy endeavor. While the format of this post may appear as a self-promotion of sorts, the intended audience is really my future self, who I hope does not succumb to inertia or, put more bluntly, laziness. So here’s to another round of blogging!
]]>Despite its fancy and somewhat intimidating name, the Nyström method has an intuitive explanation. The idea is that, if we know the distance between point A and point B, as well as that between point B and point C, then we can approximate the distance between points A and C as some sort of addition of the two quantities. Of course, if we were discussing distances in the context of one-dimensional space, namely the real number line, we would not only be able to approximating the distance; we would know the exact quantity. However, in high-dimensional space, this is somewhat more difficult, and we can only resort to approximations.
To put things into context, let’s say we want to approximate the attention matrix in the transformer architecture. The Nyström method begins by selecting what the authors of the paper refer to as landmarks. Basically, if we have an attention matrix $A \in \mathbb{R}^{L \times L}$, then we select a few landmark rows and columns to use as the basis or pivot point for our approximation. The goal, then, is to select as few landmarks as possible while being able to approximate the attention matrix as accurately as possible.
For sake of simplicity, let’s say we select the first row and column to be our landmarks. Then, the goal is to approximate the inner sub-matrix $A_\text{sub} \in \mathbb{R}^{(L - 1) \times (L - 1)}$. How might we go about it?
As stated earlier, the intuition is that we use the landmarks as pivot points. Since we selected the first rows and columns as our landmarks, we have access to $q_1 k_n^\top \forall n \leq L$, as well as $q_n k_1\top \forall n \leq L$ (for simplicity, we ignore the normalizing square root). If we remind ourselves of the motivation behind the transformer’s key-value-query architecture, we can consider attention as a way of calculating the distance or relevance between pairs of tokens in a given sequence. Put differently, the landmarks tell us the distance between the first query and all other keys, as well as the distance between the first key and all other queries.
Without loss of generality, we can approximate the distance between any $i$th key and the $j$th query using these landmarks. The way we do this is somewhat similar to the point A, B, C example we briefly discussed earlier. Namely, we start by looking at the distance between the $i$th key and the first query. Then, we also look at the attention value between the first key and the $j$th query. Note that connecting the two dots kind of gives us a sense of how related the $i$th query and $j$ query are. To remove the redundancy, we divide the product by the self-attention of the first token, or the attention score between the first key and query.
\[A_{ij} = \frac{q_i k_1^\top \cdot q_1 k_j^\top}{q_1 k_1^\top} \tag{1}\]Of course, if we have multiple landmarks, we can easily expand the expression above into matrix form. The tilde indicates landmark rows and columns.
\[\tilde{A} = Q \tilde{K}^\top \times (\tilde{Q} \tilde{K}^\top)^\star \times \tilde{Q} K \tag{2}\]The star expression ($\star$) denotes the Moore-Penrose pseudo-inverse.
Now that we have a general intuition of how Nyström approximation works in the context of attention, let’s get into some basic implementation.
The goal here is to see that Nyström approximation can indeed yield reasonably accurate results, and that the larger the number of key landmarks, the better the approximation. Consider this as a form of Monte Carlo experiment.
Let’s begin by importing some modules.
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format="retina"
For sake of simplicity, we assume a very basic model with a hidden dimension of 2, and some data points whose sequence length is 5. For simplicity, we omit the batch dimension.
Then, in the context of attention, we would end up with the following keys and query tensors, as well as a five-by-five square attention matrix.
d_model = 2
seq_len = 5
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
A = Q @ K.T
A.shape
(5, 5)
The goal, then, is to approximate this square attention matrix.
A
array([[ 2.29571874, -0.7373519 , 0.32730778, -0.84730782, -1.16558083],
[ 1.4346883 , -0.32765206, 0.80095764, -0.39437617, 0.17889744],
[ 1.38973136, -0.61066937, -0.53783773, -0.67968999, -1.82523199],
[-1.80977456, 0.1036656 , -2.39735444, 0.18320197, -2.33569844],
[ 1.36516091, -0.40695455, 0.33580143, -0.47186895, -0.47836287]])
Let’s begin our approximation by assuming the worst case, in which we only have access to one landmark. This brings us to equation (1) where essentially all operations were done on vectors instead of matrices.
num_landmarks = 1
Q_tilde = Q[:num_landmarks]
K_tilde = K[:num_landmarks]
Recalling equations (1) and (2), we can now write the approximation of the attention matrix as follows.
\[\tilde{A} = Q \tilde{K}^\top \times (\tilde{Q} \tilde{K}^\top)^\star \times \tilde{Q} K\]A_tilde = (Q @ K_tilde.T) @ np.linalg.pinv(Q_tilde @ K_tilde.T) @ (Q_tilde @ K.T)
A_tilde.shape
(5, 5)
The dimensionality seems to match that of the original attention matrix, as expected. If we print out the approximation, we should expect to see exact matches in the first row and column; the rest of the four-by-four region of the matrix should roughly be similar to that of the original.
A_tilde
array([[ 2.29571874, -0.7373519 , 0.32730778, -0.84730782, -1.16558083],
[ 1.4346883 , -0.46080128, 0.20454799, -0.52951722, -0.72841901],
[ 1.38973136, -0.44636176, 0.19813834, -0.51292444, -0.7055935 ],
[-1.80977456, 0.58127361, -0.25802521, 0.66795471, 0.91885757],
[ 1.36516091, -0.43847008, 0.19463525, -0.50385594, -0.69311861]])
We can indeed quickly verify that the first row and column are exact matches; however, the rest of the 16 elements are somewhat difficult to compare. We can more systematically calculate the difference between two matrices by using a norm, such as the Frobenius norm.
np.linalg.norm(A - A_tilde)
4.33185890598477
If we look at the raw value of the subtraction, we can see that the approximation isn’t too bad.
A - A_tilde
array([[ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00],
[-2.22044605e-16, 1.33149223e-01, 5.96409654e-01,
1.35141056e-01, 9.07316456e-01],
[ 0.00000000e+00, -1.64307605e-01, -7.35976069e-01,
-1.66765549e-01, -1.11963848e+00],
[ 0.00000000e+00, -4.77608006e-01, -2.13932924e+00,
-4.84752738e-01, -3.25455600e+00],
[ 0.00000000e+00, 3.15155316e-02, 1.41166181e-01,
3.19869853e-02, 2.14755744e-01]])
Let’s extend this little trial with one landmark to larger matrices. For ease of execution and implementation, I’ve basically wrapped each step outlined above as functions.
The first function, norms_by_landmarks
, receives query and key matrices, then approximates the attention matrix while varying the number of landmarks. The Frobenius norm is used to measure how good the approximation is. Theoretically, we should expect to see a downward-sloping pattern.
def norms_by_landmarks(Q, K):
result = []
A = Q @ K.T
for num_landmarks in range(1, len(Q) + 1):
Q_tilde = Q[:num_landmarks]
K_tilde = K[:num_landmarks]
A_tilde = (Q @ K_tilde.T) @ np.linalg.pinv(Q_tilde @ K_tilde.T) @ (Q_tilde @ K.T)
result.append(np.linalg.norm(A - A_tilde))
return np.asarray(result)
The second function, run_experiment
, is a wrapper around the first one. It repeatedly conducts the same experiment for a specified number of iterations. The purpose of repetition is essentially remove the possibility of luck, where some randomly initialized key and query matrices are configured in such a way that the Nyström approximation performs too well or poorly on a given task. By repeating the experiment and averaging the result—which is the spirit behind Monte Carlo approximations—we can have more confidence in our final result.
def run_experiments(d_model, seq_len, num_iter=10):
result = 0
for _ in range(num_iter):
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
norm = norms_by_landmarks(Q, K)
result += norm
return result / num_iter
Here, we assume a sequence length of 50, and the hidden size of the model (or the embedding size) to be 10. And off we go!
norms = run_experiments(d_model=10, seq_len=50)
plt.plot(range(len(norms)), norms)
plt.show()
While there is some noise in the final outcome, we do see that beyond a certain dimension, the approximation yields near exact results. In this case, it seems to happen around 10 landmarks.
Transformers have now taken over much of the ML world, even beyond NLP. Recently, I came across a paper titled Pretrained Transformers are Universal Computation Engines. Apparently, pretrained transformer LMs can perform extremely well on tasks with minimal fine-tuning. Specifically, even if the feedforward and attention portion of the network frozen—which amounts to nearly 99 percent of the entire model architecture—transformer LMs can be micro-tuned to a wide array of tasks that are even not specifically NLP-related.
While there is certainly a possibility that a new SOTA model architecture will be announced by researchers in the new future, similar to how transformers made LSTMs obsolete in many fields, I think transformers are here to stay around for longer. And it’s certainly interesting to see attempts to make it even better, lighter, and faster. Nyströmformer was one such attempt, and I hope to see more.
]]>Let’s dive right into it!
If you’re already familiar with transformers, you probably know that transformers process inputs in parallel at once. This is one of the many reasons why transformers have been immensely more successful than RNNs: RNNs are unable to factor in long-range dependencies due to their recurrent structure, whereas transformers do not have this problem since they can see the entire sequence as it is being processed. However, this also means that transformers require positional encodings to inform the model about where specific tokens are located in the context of a full sequence. Otherwise, transformer would be entirely invariant to sequential information, considering “John likes cats” and “Cats like John” as identical. Hence, positional encodings are used to signal the absolute position of each token.
While absolute positional encodings work reasonably well, there have also been efforts to exploit pairwise, relative positional information. In Self-Attention with Relative Position Representations, Shaw et al. introduced a way of using pairwise distances as a way of creating positional encodings.
There are a number of reasons why we might want to use relative positional encodings instead of absolute ones. First, using absolute positional information necessarily means that there is a limit to the number of tokens a model can process. Say a language model can only encode up to 1024 positions. This necessarily means that any sequence longer than 1024 tokens cannot be processed by the model. Using relative pairwise distances can more gracefully solve this problem, though not without limitations. Relative positional encodings can generalize to sequences of unseen lengths, since theoretically the only information it encodes is the relative pairwise distance between two tokens.
Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys.
\[e_{ij} = \frac{x_i W^Q (x_j W^K + a_{ij}^K)^\top}{\sqrt{d_z}} \tag{1}\]The softmax operation remains unchanged from vanilla self-attention.
\[\alpha_{ij} = \frac{\text{exp} \space e_{ij}}{\sum_{k = 1}^n \text{exp} \space e_{ik}}\]Lastly, relative positional information is supplied again as a sub-component of the values matrix.
\[z_i = \sum_{j = 1}^n \alpha_{ij} (x_j W^V + a_{ij}^V) \tag{2}\]In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.
In Huang et al., also known as the music transformer paper, the authors pointed out that calculating relative positional encodings as introduced in Shaw et al. requires $O(L^2D)$ memory due to the introduction of an additional relative positional encoding matrix. Here, $L$ denotes the length of the sequence, and $D$, the hidden state dimension used by the model. Huang et al. introduced a new way of computing relative positional encoding via a clever skewing operation.
To cut to the chase, below is the relative attention mechanism suggested by the authors in Huang et al.
\[\text{RelativeAttention} = \text{Softmax} \left( \frac{Q K^\top + S_{rel}}{\sqrt{D_h}} \right) V \tag{3}\]It seems that in the music transformer paper, the authors dropped the additional relative positional embedding that corresponds to the value term and focus only on the key component. In other words, the authors only focus on (1), not (2).
The notations in (1), (2), and (3) were each borrowed verbatim from the authors of both papers. Hence, there is some notational mixup that requires attention. Specifically, $S^{rel}$ in the music transformer paper is simply
\[S_{rel} = Q R^\top\]where
\[R_{ij} = a_{ij}^K\]In other words, (3) is just an expanded variant of (1).
To make things a little clearer, let’s review the dimensions of each tensor. First, from vanilla self-attention, we know that $Q \in \mathbb{R}^{H \times L \times D_h}$, where $H$ denotes the number of heads. Thus, $R \in \mathbb{R}^{H \times L \times D_h}$, and $S_{rel} \in \mathbb{R}^{H \times L \times L}$. $R$ is a matrix of relative positional embeddings. Intuitively, $R$ can also be understood as the result of passing a matrix of relative positional indices through an embedding layer. For concreteness, here is a dummy function that creates relative positional indices.
The skewing mechanism introduced in Huang et al., is ingenious, but it isn’t black magic. The technique could roughly be understood as a set of clever padding and matrix manipulation operations that ultimately result in $S_{rel}$ without explicitly creating or computing $R$. The reason why we might want to avoid calculating $R$ is that it is a huge memory bottleneck, as the matrix requires $O(L^2 d)$ extra space.
The method presented by Huang et al. could be seen as follows:
def relative_positions(seq_len):
result = []
for i in range(seq_len):
front = list(range(-i, 0))
end = list(range(seq_len - i))
result.append(front + end)
return result
Let’s see what the indices look like for a sequence of length five.
relative_positions(5)
[[0, 1, 2, 3, 4],
[-1, 0, 1, 2, 3],
[-2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1],
[-4, -3, -2, -1, 0]]
We can understand each row as indicating the current position of attention, and each index as representing the distance between the current token and the token corresponding to the index. A quick disclaimer that this example does not strictly follow the details outlined in Shaw et al. For instance, this function does not take into account $k$, or the width of the window. The 0-based indexing scheme is also from Huang et al. These minor details notwithstanding, having a clear sense of what $R$ is, I think, is very helpful in understanding relative attention, as well as the skewing mechanism introduced in Huang et al. For a fuller explanation of these concepts, I highly recommend this medium article.
Below is a visual summary of the skewing mechanism.
Personally, I found this diagram to be a bit confusing at first. However, with must staring and imagination, I slowly started to realize that the skewing is simply a way of transforming $QE_r^\top$ into $QR^\top$, where $E_r$ is the relative positional embedding matrix.
Instead of trying to explain this in plain text, I decided that implementing the the entire relative global attention would not only help with demonstration, but also cementing my own understanding of how this works.
This implementation of relative global attention was in large part influenced by Karpathy’s minGPT, which we discussed in this previous post, as well as Prayag Chatha’s implementation of the music transformer, available on GitHub here.
import math
import torch
from torch import nn
import torch.nn.functional as F
Below is a simple implementation of a relative global attention layer. I’ve deviated from Chatha’s implementation in a number of ways, but the most important and probably worth mentioning is how I treat the relative positional embedding matrix. In Shaw et al., the authors note that “[relative positional embeddings] can be shared across attention heads.” Hence, I’m using one Er
matrix to handle all heads, instead of creating multiple of them. This matrix is registered as a nn.Parameter
.
class RelativeGlobalAttention(nn.Module):
def __init__(self, d_model, num_heads, max_len=1024, dropout=0.1):
super().__init__()
d_head, remainder = divmod(d_model, num_heads)
if remainder:
raise ValueError(
"incompatible `d_model` and `num_heads`"
)
self.max_len = max_len
self.d_model = d_model
self.num_heads = num_heads
self.key = nn.Linear(d_model, d_model)
self.value = nn.Linear(d_model, d_model)
self.query = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
self.Er = nn.Parameter(torch.randn(max_len, d_head))
self.register_buffer(
"mask",
torch.tril(torch.ones(max_len, max_len))
.unsqueeze(0).unsqueeze(0)
)
# self.mask.shape = (1, 1, max_len, max_len)
def forward(self, x):
# x.shape == (batch_size, seq_len, d_model)
batch_size, seq_len, _ = x.shape
if seq_len > self.max_len:
raise ValueError(
"sequence length exceeds model capacity"
)
k_t = self.key(x).reshape(batch_size, seq_len, self.num_heads, -1).permute(0, 2, 3, 1)
# k_t.shape = (batch_size, num_heads, d_head, seq_len)
v = self.value(x).reshape(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
q = self.query(x).reshape(batch_size, seq_len, self.num_heads, -1).transpose(1, 2)
# shape = (batch_size, num_heads, seq_len, d_head)
start = self.max_len - seq_len
Er_t = self.Er[start:, :].transpose(0, 1)
# Er_t.shape = (d_head, seq_len)
QEr = torch.matmul(q, Er_t)
# QEr.shape = (batch_size, num_heads, seq_len, seq_len)
Srel = self.skew(QEr)
# Srel.shape = (batch_size, num_heads, seq_len, seq_len)
QK_t = torch.matmul(q, k_t)
# QK_t.shape = (batch_size, num_heads, seq_len, seq_len)
attn = (QK_t + Srel) / math.sqrt(q.size(-1))
mask = self.mask[:, :, :seq_len, :seq_len]
# mask.shape = (1, 1, seq_len, seq_len)
attn = attn.masked_fill(mask == 0, float("-inf"))
# attn.shape = (batch_size, num_heads, seq_len, seq_len)
attn = F.softmax(attn, dim=-1)
out = torch.matmul(attn, v)
# out.shape = (batch_size, num_heads, seq_len, d_head)
out = out.transpose(1, 2)
# out.shape == (batch_size, seq_len, num_heads, d_head)
out = out.reshape(batch_size, seq_len, -1)
# out.shape == (batch_size, seq_len, d_model)
return self.dropout(out)
def skew(self, QEr):
# QEr.shape = (batch_size, num_heads, seq_len, seq_len)
padded = F.pad(QEr, (1, 0))
# padded.shape = (batch_size, num_heads, seq_len, 1 + seq_len)
batch_size, num_heads, num_rows, num_cols = padded.shape
reshaped = padded.reshape(batch_size, num_heads, num_cols, num_rows)
# reshaped.size = (batch_size, num_heads, 1 + seq_len, seq_len)
Srel = reshaped[:, :, 1:, :]
# Srel.shape = (batch_size, num_heads, seq_len, seq_len)
return Srel
Much of the operations in forward
method are code translations of the equations we discussed above. The interesting bit happens in the skew
method. Basically, we pad $Q E_r^\top$ to the left, then reshape to shift all indices, then slice out the necessary portion of the matrix to obtain $Q R^\top$, or $S_{rel}$. This has the benefit of reducing the memory requirement; since we don’t have to calculate $R$ and can instead directly use $E_r$, which is a matrix that is needed anyway, the memory requirement is reduced to $O(Ld)$. This is what I personally think is one of the biggest contributions of Huang et al.
Let’s quickly check that the layer works as intended by quickly performing a basic tensor shape check.
batch_size = 8
seq_len = 100
d_model = 768
num_heads = 12
test_in = torch.randn(batch_size, seq_len, d_model)
l = RelativeGlobalAttention(d_model, num_heads)
l(test_in).shape
torch.Size([8, 100, 768])
We get an output of size (batch_size, seq_len, d_model)
, which is what we expect.
In this post, we discussed relative positional encoding as introduced in Shaw et al., and saw how Huang et al. was able to improve this algorithm by introducing optimizations.
Relative positional encodings were used in other architectures, such as Transformer XL, and more recently, DeBERTa, which I also plan on reviewing soon. Relative positioning is probably a lot closer to how we humans read text. While it is probably not a good idea to always compare and conflate model architectures with how the human brain works, I still think it’s an interesting way to think about these concepts.
This post was also a healthy exercise in that it really forced me to try to understand every single detail. Every sentence and diagram can be of huge help when you are trying to actually implement ideas that are outlined in published papers. I could see why Papers with Code became such a huge thing. It’s always helpful to see actual implementations and, even better, reproducible results. In this particular post, referencing music transformer implementations on GitHub and re-reading the paper many times really helped me nail down points that were initially confusing or unclear.
I hope you’ve enjoyed reading this post. Catch you up in the next one!
]]>