Introduction to seq2seq models

12 minute read

For a very long time, I’ve been fascinated by sequence-to-sequence models. Give the model a photo as input, it spits out a caption to go along with it; give it some English text, it can translate it into another language. Seq2seq models are also not only widely applicable in different contexts, but it also arguably laid the groundwork for other more advanced models that came after, such as attention and transformers. In studying seq2seq models, I found Ben Trevett’s sequence modeling tutorials extremely helpful. In particular, this post is heavily based off of this notebook. With that cleared up, let’s get started!


For this tutorial, we will need to import a number of dependencies, mainly from torch and torchtext. torchtext is a library that provides a nice interface to dealing with text-based data in PyTorch.

import random
import time

import torch
import torchtext
from torch import nn
from import BucketIterator, Field
from torchtext.datasets import Multi30k


torchtext includes a Field class, which essentially allows us to define some preprocessing steps to be applied on the data. We will be using the Multi30k dataset, which contains translations of short texts from many languages. In this tutorial, we will be using German and English, so we define preprocessing steps for each language. The preprocessing, as defined below, tells torchtext to:

  • Tokenize the dataset using spacy
  • Prepend each line with "<sos>" and "<eos>" tokens
  • Lowercase every word
SRC = Field(

TRG = Field(

If you get errors running this command, make sure that you have downloaded English and German models for spacy. This is required since torchtext internally uses spacy to tokenize the text.

!python -m spacy download en
!python -m spacy download de

We can now prepare the data by calling split() on the Multi30k dataset, using the fields we have defined above.

train_data, validation_data, test_data = Multi30k.splits(
    root="data", exts=(".de", ".en"), fields=(SRC, TRG)

Let’s quickly check how many data there are for each train, validation, and test split.

for data in (train_data, validation_data, test_data):

It’s also helpful to see what’s actually in each of these splits. Let’s take a look at the very first data in the training set.

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}

We see that each example is a dictionary containing English and German sentences. Note also that they have been tokenized; each translation is a list containing words and punctuations.

SRC.build_vocab(train_data, max_size=10000, min_freq=2)
TRG.build_vocab(train_data, max_size=10000, min_freq=2)

By calling build_vocab() on each field, we can tell torchtext which vocabulary to keep. Internally, this process triggers each field to have stoi, or token to index and itos, the reverse lookup.

print(TRG.init_token, TRG.eos_token, TRG.pad_token)
<sos> <eos> <pad>

We can see that the vocabulary includes not only the words actually in the dataset, but also special tokens such as start, end, and padding tokens.

print(f"Unique tokens in source (ger) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (eng) vocabulary: {len(TRG.vocab)}")
Unique tokens in source (ger) vocabulary: 7854
Unique tokens in target (eng) vocabulary: 5893

The vocab attribute gives us access to the entire vocabulary in each of the fields. We can see that the German vocabulary is slightly larger than English. Let’s also quickly check the stoi dictionary to see if we can obtain index values. We can see that the start token is the second in the lookup table.



Using BucketIterator, we can batch these field datasets into their PyTorch data loader equivalents. Let’s set a batch size and create some iterators.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, validation_iterator, test_iterator = BucketIterator.splits(
    (train_data, validation_data, test_data), batch_size=BATCH_SIZE, device=device

With batching, we get 128 examples at once. The zeroth index of the batch, as can be seen below, is a list containing 128 2’s. This is because 2 is the index that corresponds to the starting token. Since all examples start with "<sos>", we have 2’s at the zeroth index of the batch.

for batch in train_iterator:
tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2])

The data preparation steps might felt a little boring, as it’s just interacting with the torchtext API, but it’s necessary nonetheless. Now, it’s finally time for some modeling!


A typical sequence-to-sequence model is, by design, composed of an encoder and a decoder. The encoder takes in a sequence as input, processes it to formulate some hidden state, and eventually passes on that hidden state and cell state to the decoder. The underlying assumption here is that the hidden and cell states should be able to encode some long and short term memory of the encoder network. The decoder, by accepting these two states, should have an understanding of the input data. Based on this understanding, it is then trained to output a corresponding sequence.


Let’s first take a look at the implementation of the encoder network. It is a simple LSTM model that consists of an embedding layer and an LSTM layer. At the end of the forward pass, we return the hidden and cell states. Note that vocab_size is the total length of the index of the source language.

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout):
        super(Encoder, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, dropout=dropout)
    def forward(self, x):
        # x.shape == (src_seq_len, batch_size) == (36, 128)
        embedding = self.dropout(self.embed(x))
        # embedding.shape == (36, 128, embed_dim) == (36, 128, 256)
        outputs, (hidden, cell) = self.lstm(embedding)
        # outputs.shape == (36, 128, hidden_size) == (36, 128, 512)
        # hidden.shape, cell.shape == (2, 128, hidden_size) == (2, 128, 512)
        return hidden, cell


The decoder is similar to the encoder, but has a little more moving parts. On a high level, the decoder is also an LSTM model like the encoder. In particular, its LSTM layer is configured in such a way that it is able to accept the hidden and cell states from the encoder. One obvious difference is in terms of the vocabulary size, which is now the size of the target language, not the source language. Also, the decoder has a fully connected layer that acts as a classifier. Hence, the size of the output dimension is equal to the vocabulary size of the target language.

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout):
        super(Decoder, self).__init__()
        self.vocab_size = vocab_size
        self.dropout = nn.Dropout(dropout)
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, dropout=dropout)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x, hidden, cell):
        # x.shape == (128,)
        x = x.unsqueeze(0)
        # x.shape == (1, 128)
        embedding = self.dropout(self.embed(x))
        # embedding.shape = (1, 128, 256)
        outputs, (hidden, cell) = self.lstm(embedding, (hidden, cell))
        # outputs.shape == (1, 128, 512)
        outputs = outputs.squeeze(0)
        # outputs.shape == (128, 512)
        predictions = self.fc(outputs)
        # predictions.shape == (128, vocab_size)
        return predictions, hidden, cell


Now, it’s time to put the two models together in a sequence-to-sequence model. The overall flow of data looks as follows:

  • The encoder encodes input data
  • The hidden and cell states of the encoder is passed to the decoder
  • For each single time step, the decoder generates a prediction
  • The decoder uses its own hidden and cell states for the next time step
  • Depending on the teacher step ratio, the decoder uses either uses its previous prediction or the actual target data to generate the next prediction

I resorted to a convenient bullet point listing to summarize everything, but let’s break this down a bit. Within the forward pass of the seq2seq model, the encoder encodes input data, which, in this case, are German sentences. Then, the decoder accepts the hidden and cell states of the encoder, as well as the zeroth index of the target language batch. This zeroth index will simply be a bunch of starting tokens, as we saw earlier. Then, the decoder will generate a prediction using these starting tokens and encoder states.

The interesting part comes thereafter. We set some teacher force ratio, which is a number between zero and one. There are two ways through which the decoder can generate the next prediction. Either it can use its own prediction from the previous time step, or, as “teachers,” we can nudge the decoder in the correct direction by telling them what the correct prediction should have been from the previous time step. This teacher guidance is helpful, since at the beginning of training, the model might struggle to generate correct predictions using its own previous predictions.

Below is the full implementation of the model.

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    def forward(self, source, target, teacher_force_ratio=0.5):
        seq_len = target.size(0)
        batch_size = target.size(1)    
        outputs = torch.zeros(
            seq_len, batch_size, self.decoder.vocab_size
        hidden, cell = self.encoder(source)
        x = target[0]
        # x.shape == (128,)
        for t in range(1, seq_len):
            predictions, hidden, cell = self.decoder(x, hidden, cell)
            outputs[t] = predictions
            teacher_force = random.random() < teacher_force_ratio
            if teacher_force:
                x = target[t]
                x = predictions.argmax(1)
        return outputs


Now that we have a seq2seq model, let’s write some code for the training loop. Below are some preliminary quantities that we will to set up the encoder and decoder models.

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
HID_DIM = 512

model = Seq2Seq(encoder, decoder, device).to(device)

Let’s also initialize some weights with a uniform distribution.

def init_weights(model):
    for name, param in model.named_parameters():
        nn.init.uniform_(, -0.08, 0.08)

  (encoder): Encoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (embed): Embedding(7854, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.5)
  (decoder): Decoder(
    (dropout): Dropout(p=0.5, inplace=False)
    (embed): Embedding(5893, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc): Linear(in_features=512, out_features=5893, bias=True)

We see that this is quite a big model, with a total of 13898757 trainable parameters.

sum(p.numel() for p in model.parameters() if p.requires_grad)

Let’s define the optimizer and criterion for training. Since the model basically outputs logits for a distribution, we can consider it to be a classification problem. Hence we use cross entropy loss, with the minor caveat that we ignore the padding index.

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)
optimizer = torch.optim.Adam(model.parameters())


For each batch, we obtain the prediction output of the model. There are several details to take care of when calculating the loss. Namely, we need to cut off the first time step of the target and predicted outputs. This is because the output and predictions will look as follows, assuming no extraneous padding:

\[\hat{y} = [0, \hat{y_1}, \hat{y_2}, \dots, \hat{y_n}, \text{<eos>}] \\ y = [\text{<sos>}, y_1, y_2, \dots, y_n, \text{<eos>}]\]

If you examine the Seq2Seq model we designed earlier, you will see that t starts from 1, meaning that the output tensor’s zeroth index will be left untouched as zeros; hence the 0 in $\hat{y}$. Therefore, we need to slice the tensors and start from the first index. Moreover, we shape the tensors to be two-dimensional since cross entropy expects the predictions to be two-dimensional; the labels, one-dimensional. Last but not least, we clip the gradients to prevent exploding gradients.

def train(model, iterator, optimizer, criterion, clip):
    epoch_loss = 0
    for i, batch in enumerate(iterator):
        source = batch.src
        target = batch.trg
        # source.shape == (batch_seq_len, 128)
        # target.shape == (batch_seq_len, 128)
        output = model(source, target)
        # output.shape == (batch_seq_len, batch_size, vocab_size) == (25, 128, 5893)
        output = output[1:].reshape(-1, output.size(2))
        # output.shape == (3072, 5893)
        target = target[1:].reshape(-1)
        # target.shape == (3072)
        loss = criterion(output, target)
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        epoch_loss += loss.item()
    return epoch_loss / (i + 1)


Next, we evaluate the model. The structure is almost identical to that of the training loop, except that set the model to evaluation mode and execute every forward pass within the torch.no_grad() statement.

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            source = batch.src
            target = batch.trg
            output = model(source, target, teacher_force_ratio=0)
            target = target[1:].reshape(-1)
            output = output[1:].reshape(-1, output.size(2))
            loss = criterion(output, target)
            epoch_loss += loss.item()
    return epoch_loss / (i + 1)

Now, we train the model! We save the model weights if the validation loss is higher than the best validation loss prior to the current iteration.

CLIP = 1
best_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    validation_loss = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    total_time = end_time - start_time
    if validation_loss < best_loss:
        best_loss = validation_loss, './data/seq2seq/')
        f"Epoch [{epoch+1}{N_EPOCHS}], Time: {total_time}s, "
        f"Train Loss: {train_loss:.3f}, Val. Loss: {validation_loss:.3f}"


Because I wrote all of this code locally, I didn’t train the model. I did run the training loop a few time in Google Colab, but even that took a while. I decided that I’d save my GPU quota for more interesting models later.

The end goal of this tutorial was to gain a deeper understanding of how encoder-decoder sequence-to-sequence models are implemented. I remember reading about neural machine translation through the TensorFlow website around a year ago when I was first learning deep learning with Keras, and it’s just great to see that I’ve made enough progress to be able to understand, digest, and model a basic sequence-to-sequence NMT model with PyTorch.

But again, the model we have implemented today is extremely simple in terms of its design, and there are many more enhancements we can apply to it. In the coming posts, we will be taking a look at some better, more advanced seq2seq models that implement features like attention.

I hope you’ve enjoyed reading this post. Catch you up in the next one!