How to use Gensim for text classification with Doc2Vec?

Published on Aug. 22, 2023, 12:18 p.m.

To use Gensim for text classification with Doc2Vec, you can follow these steps:

  1. Prepare your data: Create a dataset with textual data and corresponding labels. Each textual data should be a list of words, and each label should be a single string or integer.
  2. Preprocess your data: Preprocess your textual data by cleaning, tokenizing, stemming, and/or removing stop words.
  3. Create tagged documents: Create a list of TaggedDocument objects. Each TaggedDocument object represents a piece of text in your dataset, along with a unique tag/identifier (preferably a string or integer) that can help identify the document later.
from gensim.models.doc2vec import TaggedDocument

# assume `texts` is a list of preprocessed documents, and `labels` is a list of corresponding labels
tagged_documents = [TaggedDocument(words=texts[i], tags=[labels[i]]) for i in range(len(texts))]
  1. Train a Doc2Vec model: Train a Doc2Vec model on your tagged documents to generate document embeddings. You can use the Doc2Vec class from Gensim to create and train a Doc2Vec model.
from gensim.models.doc2vec import Doc2Vec

# create a Doc2Vec model and build the vocabulary
doc2vec_model = Doc2Vec(vector_size=100, min_count=5, epochs=50)
doc2vec_model.build_vocab(tagged_documents)

# train the Doc2Vec model on the tagged documents
doc2vec_model.train(tagged_documents,total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)
  1. Use the document embeddings for classification: using the document embeddings for classification:
from sklearn.linear_model import LogisticRegression
import numpy as np

# get the document embeddings from the Doc2Vec model
doc_embeddings = np.array([doc2vec_model.dv[doc.tags[0]] for doc in tagged_documents])

# train a logistic regression classifier on the document embeddings and labels
clf = LogisticRegression()
clf.fit(doc_embeddings, labels)

# make predictions on new, unseen documents
new_doc = ['this is a new document']
new_doc_embedding = doc2vec_model.infer_vector(new_doc)
predicted_label = clf.predict([new_doc_embedding])[0]

This code trains a logistic regression classifier on the document embeddings and labels, where the document embeddings are obtained from the trained Doc2Vec model. Then, it makes predictions on new, unseen documents by computing their embeddings using the infer_vector() method of the Doc2Vec model and using the classifier’s predict() method to predict their label.

Note that the document embeddings may not be normalized, so it may be necessary to normalize them before using them for classification.