How to pad sequences of tokenized input to equal length?

Published on Aug. 22, 2023, 12:19 p.m.

To pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer, you can use the pad_sequence function from the torch.nn.utils module, or the pad_sequences function from the keras.preprocessing.sequence module, depending on your framework.

Here’s an example of how to use pad_sequence with the torch.nn.utils module to pad a list of tokenized sequences to equal length:

from torch.nn.utils.rnn import pad_sequence
import torch

tokenized_sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5]), torch.tensor([6])]
padded_sequences = pad_sequence(tokenized_sequences, batch_first=True)

In this example, we’ve created a list of three tokenized sequences represented by PyTorch tensors. We’ve then used pad_sequence with batch_first=True to pad the sequences to equal length along the first dimension.

You can also use the pad_sequences function from the keras.preprocessing.sequence module to achieve the same result with Keras. Here’s an example:

from keras.preprocessing.sequence import pad_sequences

tokenized_sequences = [[1, 2, 3], [4, 5], [6]]
padded_sequences = pad_sequences(tokenized_sequences, padding='post')

In this example, we’ve created a list of three tokenized sequences represented by regular Python lists. We’ve then used pad_sequences with padding='post' to pad the sequences to equal length and add padding at the end of each sequence.

I hope this helps! Let me know if you have any further questions.

Tags: