sentence bert句子表示示例
Published on Aug. 22, 2023, 12:10 p.m.
BERT
In this notebook we’ll take a look at how we can use transformer models (like BERT) to create sentence vectors for calculating similarity. Let’s start by defining a few example sentences.
a = "purple is the best city in the forest"
b = "there is an art to getting your way and throwing bananas on to the street is not it" # this is very similar to 'g'
c = "it is not often you find soggy bananas on the street"
d = "green should have smelled more tranquil but somehow it just tasted rotten"
e = "joyce enjoyed eating pancakes with ketchup"
f = "as the asteroid hurtled toward earth becky was upset her dentist appointment had been canceled"
g = "to get your way you must not bombard the road with yellow fruit" # this is very similar to 'b'
Installing dependencies needed for this notebook
!pip install -qU transformers sentence-transformers
from transformers import AutoTokenizer, AutoModel
import torch
Initialize our HF transformer model and tokenizer - using a pretrained SBERT model.
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
Tokenize all of our sentences.
tokens = tokenizer([a, b, c, d, e, f, g],
max_length=128,
truncation=True,
padding='max_length',
return_tensors='pt')
tokens.keys()
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tokens['input_ids'][0]
tensor([ 101, 6379, 2003, 1996, 2190, 2103, 1999, 1996, 3224, 102, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
Process our tokenized tensors through the model.
outputs = model(**tokens)
outputs.keys()
odict_keys(['last_hidden_state', 'pooler_output'])
Here we can see the final embedding layer, last_hidden_state.
embeddings = outputs.last_hidden_state
embeddings[0]
tensor([[-0.6239, -0.2058, 0.0411, ..., 0.1490, 0.5681, 0.2381],
[-0.3694, -0.1485, 0.3780, ..., 0.4204, 0.5553, 0.1441],
[-0.7221, -0.3813, 0.2031, ..., 0.0761, 0.5162, 0.2813],
...,
[-0.1894, -0.3711, 0.3034, ..., 0.1536, 0.3265, 0.1376],
[-0.2496, -0.5227, 0.2341, ..., 0.3419, 0.3164, 0.0256],
[-0.3311, -0.4430, 0.3492, ..., 0.3655, 0.2910, 0.0728]],
grad_fn=)
embeddings[0].shape
torch.Size([128, 768])
Here we have our vectors of length 768, but we see that these are not sentence vectors because we have a vector representation for each token in our sequence (128 in total). We need to perform a mean pooling operation to create the sentence vector.
The first thing we do is multiply each value in our embeddings
tensor by its respective attention_mask
value. The attention_mask
contains 1s where we have ‘real tokens’ (eg not padding tokens), and 0s elsewhere - so this operation allows us to ignore non-real tokens.
mask = tokens['attention_mask'].unsqueeze(-1).expand(embeddings.size()).float()
mask.shape
torch.Size([7, 128, 768])
mask[0]
tensor([[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Now we have a masking array that has an equal shape to our output embeddings
- we multiply those together to apply the masking operation on our outputs.
masked_embeddings = embeddings * mask
masked_embeddings[0]
tensor([[-0.6239, -0.2058, 0.0411, ..., 0.1490, 0.5681, 0.2381],
[-0.3694, -0.1485, 0.3780, ..., 0.4204, 0.5553, 0.1441],
[-0.7221, -0.3813, 0.2031, ..., 0.0761, 0.5162, 0.2813],
...,
[-0.0000, -0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[-0.0000, -0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[-0.0000, -0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]],
grad_fn=)
Sum the remaining embeddings along axis 1 to get a total value in each of our 768 values.
summed = torch.sum(masked_embeddings, 1)
summed.shape
torch.Size([7, 768])
Next, we count the number of values that should be given attention in each position of the tensor (+1 for real tokens, +0 for non-real).
counted = torch.clamp(mask.sum(1), min=1e-9)
counted.shape
torch.Size([7, 768])
Finally, we get our mean-pooled values as the summed
embeddings divided by the number of values that should be given attention, counted
.
mean_pooled = summed / counted
mean_pooled.shape
torch.Size([7, 768])
Now we have our sentence vectors, we can calculate the cosine similarity between each.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# convert to numpy array from torch tensor
mean_pooled = mean_pooled.detach().numpy()
# calculate similarities (will store in array)
scores = np.zeros((mean_pooled.shape[0], mean_pooled.shape[0]))
for i in range(mean_pooled.shape[0]):
scores[i, :] = cosine_similarity(
[mean_pooled[i]],
mean_pooled
)[0]
scores
array([[ 1.00000012, 0.1869276 , 0.2829769 , 0.29628235, 0.2745102 ,
0.10176259, 0.21696258],
[ 0.1869276 , 1. , 0.72058779, 0.51428956, 0.1174964 ,
0.19306925, 0.66182363],
[ 0.2829769 , 0.72058779, 1.00000024, 0.4886443 , 0.23568943,
0.17157131, 0.55993092],
[ 0.29628235, 0.51428956, 0.4886443 , 0.99999988, 0.26985496,
0.37889433, 0.52388811],
[ 0.2745102 , 0.1174964 , 0.23568943, 0.26985496, 0.99999988,
0.23422126, -0.01599787],
[ 0.10176259, 0.19306925, 0.17157131, 0.37889433, 0.23422126,
1.00000012, 0.22319663],
[ 0.21696258, 0.66182363, 0.55993092, 0.52388811, -0.01599787,
0.22319663, 1. ]])
We can visualize these scores using matplotlib
.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,9))
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
sns.heatmap(scores, xticklabels=labels, yticklabels=labels, annot=True)
Using sentence-transformers
The sentence-transformers
library allows us to compress all of the above into just a few lines of code.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
We encode the sentences (producing our mean-pooled sentence embeddings) like so:
sentence_embeddings = model.encode([a, b, c, d, e, f, g])
And calculate the cosine similarity just like before.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# calculate similarities (will store in array)
scores = np.zeros((sentence_embeddings.shape[0], sentence_embeddings.shape[0]))
for i in range(sentence_embeddings.shape[0]):
scores[i, :] = cosine_similarity(
[sentence_embeddings[i]],
sentence_embeddings
)[0]
scores
array([[ 1. , 0.18692753, 0.28297687, 0.29628229, 0.27451015,
0.1017626 , 0.21696255],
[ 0.18692753, 0.99999988, 0.72058773, 0.5142895 , 0.11749639,
0.19306931, 0.66182363],
[ 0.28297687, 0.72058773, 1.00000024, 0.48864418, 0.2356894 ,
0.17157122, 0.55993092],
[ 0.29628229, 0.5142895 , 0.48864418, 0.99999976, 0.26985493,
0.3788943 , 0.52388811],
[ 0.27451015, 0.11749639, 0.2356894 , 0.26985493, 0.99999982,
0.23422126, -0.01599786],
[ 0.1017626 , 0.19306931, 0.17157122, 0.3788943 , 0.23422126,
1.00000012, 0.22319666],
[ 0.21696255, 0.66182363, 0.55993092, 0.52388811, -0.01599786,
0.22319666, 1. ]])