How to load a text corpus into Gensim?

Published on Aug. 22, 2023, 12:18 p.m.

To load a text corpus into Gensim, you can use the corpora module to create a dictionary from the corpus and transform the corpus into a bag-of-words format. Here is an example code snippet:

from gensim import corpora

# read in corpus as list of documents
with open('corpus.txt', 'r') as f:
    documents = [doc.strip().split() for doc in f.readlines()]

# create dictionary from corpus
dictionary = corpora.Dictionary(documents)

# transform corpus into bag-of-words format
corpus = [dictionary.doc2bow(doc) for doc in documents]

This code loads in a text corpus from a file named corpus.txt, creates a dictionary from the corpus, and then transforms it into a bag-of-words format that can be processed by Gensim models. You can then use this corpus to train a variety of models, such as LDA or Word2Vec. If you have a large corpus that cannot be loaded into memory at once, you can use the corpora.textcorpus module to stream the corpus from disk.

Tags: