How to pad sequences of tokenized input to equal length?
Published on Aug. 22, 2023, 12:19 p.m.
To pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer, you can use the pad_sequence
function from the torch.nn.utils
module, or the pad_sequences
function from the keras.preprocessing.sequence
module, depending on your framework.
Here’s an example of how to use pad_sequence
with the torch.nn.utils
module to pad a list of tokenized sequences to equal length:
from torch.nn.utils.rnn import pad_sequence
import torch
tokenized_sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5]), torch.tensor([6])]
padded_sequences = pad_sequence(tokenized_sequences, batch_first=True)
In this example, we’ve created a list of three tokenized sequences represented by PyTorch tensors. We’ve then used pad_sequence
with batch_first=True
to pad the sequences to equal length along the first dimension.
You can also use the pad_sequences
function from the keras.preprocessing.sequence
module to achieve the same result with Keras. Here’s an example:
from keras.preprocessing.sequence import pad_sequences
tokenized_sequences = [[1, 2, 3], [4, 5], [6]]
padded_sequences = pad_sequences(tokenized_sequences, padding='post')
In this example, we’ve created a list of three tokenized sequences represented by regular Python lists. We’ve then used pad_sequences
with padding='post'
to pad the sequences to equal length and add padding at the end of each sequence.
I hope this helps! Let me know if you have any further questions.