How To pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer
Published on Aug. 22, 2023, 12:19 p.m.
To tokenize text data using the HuggingFace tokenizer, you can use the Tokenizer
class provided by the transformers library. Here is an example:
from transformers import AutoTokenizer
# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Tokenize input text
text = "This is an example sentence."
tokens = tokenizer.tokenize(text)
print(tokens)
In this example, we use the AutoTokenizer
class to initialize the tokenizer. from_pretrained
method is used to download the pretrained tokenizer. Then, we call the tokenize
method on the tokenizer
object to tokenize the input text. Finally, we print the resulting list of tokens.
The Tokenizer
class also provides methods for batch encoding and decoding, padding, truncating and more. You can refer to the HuggingFace documentation for more details on using the Tokenizer
class.
Here is an example of how to pad sequences of tokenized input to equal length using the HuggingFace Transformers tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
texts = ["Hello, world!", "This is a sentence.", "Short sentence."]
max_length = 10
padded_input = tokenizer(texts, padding=True, truncation=True, max_length=max_length)
print(padded_input)
In this example, we’ve used the AutoTokenizer
class to load the pre-trained tokenizer for BERT. We’ve then used the tokenizer
method to tokenize a list of texts and pad them to a maximum length of 10 with padding tokens. We’ve also truncated any texts longer than the maximum length.
The resulting padded_input
is a dictionary containing the padded and tokenized input, with keys input_ids
, attention_mask
, and token_type_ids
.
You can adjust the max_length
parameter to set the desired length of the padded input. By default, the padding
parameter is set to True
, which pads the sequences with the tokenizer’s padding token. The truncation
parameter ensures that any sequence longer than the specified max_length
will be truncated.