How to visualize Word2Vec embeddings using t-SNE or PCA?

Published on Aug. 22, 2023, 12:18 p.m.

To visualize Word2Vec embeddings using t-SNE or PCA, the first step is to extract the embeddings for each word in the vocabulary. Once you have the embeddings, you can use dimensionality reduction techniques such as t-SNE or PCA to reduce the dimensionality of the embeddings to two or three dimensions, which can then be plotted for data visualization.

Here is an example of how to extract the embeddings and perform dimensionality reduction using t-SNE in Python:

from gensim.models import Word2Vec
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

model = Word2Vec.load('word2vec.model')
words = list(model.wv.vocab)
embeddings = np.array([model.wv[word] for word in words])

tsne = TSNE(n_components=2, random_state=0)
tsne_embeddings = tsne.fit_transform(embeddings)

plt.figure(figsize=(10, 10))
for i, word in enumerate(words):
    x, y = tsne_embeddings[i, :]
    plt.scatter(x, y)
    plt.annotate(word, xy=(x, y), xytext=(4, 4), textcoords='offset points', ha='right', va='bottom')
plt.show()

This code loads a pre-trained Word2Vec model, extracts the embeddings for each word in the vocabulary, and then applies t-SNE to reduce the dimensionality of the embeddings to two dimensions. Finally, it plots the embeddings in a scatter plot with annotations for each word.

You can modify the code to use PCA instead of t-SNE or to plot the embeddings in three dimensions instead of two, depending on your specific needs.