How To pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer

Published on Aug. 22, 2023, 12:19 p.m.

To tokenize text data using the HuggingFace tokenizer, you can use the Tokenizer class provided by the transformers library. Here is an example:

from transformers import AutoTokenizer

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input text
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)
print(tokens)

In this example, we use the AutoTokenizer class to initialize the tokenizer. from_pretrained method is used to download the pretrained tokenizer. Then, we call the tokenize method on the tokenizer object to tokenize the input text. Finally, we print the resulting list of tokens.

The Tokenizer class also provides methods for batch encoding and decoding, padding, truncating and more. You can refer to the HuggingFace documentation for more details on using the Tokenizer class.

Here is an example of how to pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

texts = ["Hello, world!", "This is a sentence.", "Short sentence."]
max_length = 10
padded_input = tokenizer(texts, padding=True, truncation=True, max_length=max_length)

print(padded_input)

In this example, we’ve used the AutoTokenizer class to load the pre-trained tokenizer for BERT. We’ve then used the tokenizer method to tokenize a list of texts and pad them to a maximum length of 10 with padding tokens. We’ve also truncated any texts longer than the maximum length.

The resulting padded_input is a dictionary containing the padded and tokenized input, with keys input_ids, attention_mask, and token_type_ids.

You can adjust the max_length parameter to set the desired length of the padded input. By default, the padding parameter is set to True, which pads the sequences with the tokenizer’s padding token. The truncation parameter ensures that any sequence longer than the specified max_length will be truncated.