transformers中使用BertTokenizer编码文本

Published on Aug. 22, 2023, 12:10 p.m.

使用BertTokenizer编码文本

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-2_H-128')
model = BertModel.from_pretrained("uer/chinese_roberta_L-2_H-128")
text = "用任何文本替换我。"
encoded_input = tokenizer([text]*5, return_tensors='np')
# output = model(**encoded_input)

return_tensors可选参数

If set, will return tensors instead of list of python integers. Acceptable values are:
* :obj:'tf': Return TensorFlow :obj:tf.constant objects.
* :obj:'pt': Return PyTorch :obj:torch.Tensor objects.
* :obj:'np': Return Numpy :obj:np.ndarray objects.
verbose (:obj:bool, optional, defaults to :obj:True):
Whether or not to print more information and warnings.
 print(encoded_input["input_ids"])
>  {'input_ids': array([[ 101, 4500,  872, 1599, 3614, 4638,  818,  862, 3152, 3315, 3296,
        2940, 2769,  511,  102],
       [ 101, 4500,  872, 1599, 3614, 4638,  818,  862, 3152, 3315, 3296,
        2940, 2769,  511,  102],
       [ 101, 4500,  872, 1599, 3614, 4638,  818,  862, 3152, 3315, 3296,
        2940, 2769,  511,  102],
       [ 101, 4500,  872, 1599, 3614, 4638,  818,  862, 3152, 3315, 3296,
        2940, 2769,  511,  102],
       [ 101, 4500,  872, 1599, 3614, 4638,  818,  862, 3152, 3315, 3296,
        2940, 2769,  511,  102]]), 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}