transformers中使用BertTokenizer编码文本
Published on Aug. 22, 2023, 12:10 p.m.
使用BertTokenizer编码文本
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-2_H-128')
model = BertModel.from_pretrained("uer/chinese_roberta_L-2_H-128")
text = "用任何文本替换我。"
encoded_input = tokenizer([text]*5, return_tensors='np')
# output = model(**encoded_input)
return_tensors可选参数
If set, will return tensors instead of list of python integers. Acceptable values are: |
---|
* :obj:'tf' : Return TensorFlow :obj:tf.constant objects. |
* :obj:'pt' : Return PyTorch :obj:torch.Tensor objects. |
* :obj:'np' : Return Numpy :obj:np.ndarray objects. |
verbose (:obj:bool , optional , defaults to :obj:True ): |
Whether or not to print more information and warnings. |
print(encoded_input["input_ids"])
> {'input_ids': array([[ 101, 4500, 872, 1599, 3614, 4638, 818, 862, 3152, 3315, 3296,
2940, 2769, 511, 102],
[ 101, 4500, 872, 1599, 3614, 4638, 818, 862, 3152, 3315, 3296,
2940, 2769, 511, 102],
[ 101, 4500, 872, 1599, 3614, 4638, 818, 862, 3152, 3315, 3296,
2940, 2769, 511, 102],
[ 101, 4500, 872, 1599, 3614, 4638, 818, 862, 3152, 3315, 3296,
2940, 2769, 511, 102],
[ 101, 4500, 872, 1599, 3614, 4638, 818, 862, 3152, 3315, 3296,
2940, 2769, 511, 102]]), 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}