How to load a text corpus into Gensim?
Published on Aug. 22, 2023, 12:18 p.m.
To load a text corpus into Gensim, you can use the corpora
module to create a dictionary from the corpus and transform the corpus into a bag-of-words format. Here is an example code snippet:
from gensim import corpora
# read in corpus as list of documents
with open('corpus.txt', 'r') as f:
documents = [doc.strip().split() for doc in f.readlines()]
# create dictionary from corpus
dictionary = corpora.Dictionary(documents)
# transform corpus into bag-of-words format
corpus = [dictionary.doc2bow(doc) for doc in documents]
This code loads in a text corpus from a file named corpus.txt
, creates a dictionary from the corpus, and then transforms it into a bag-of-words format that can be processed by Gensim models. You can then use this corpus to train a variety of models, such as LDA or Word2Vec. If you have a large corpus that cannot be loaded into memory at once, you can use the corpora.textcorpus
module to stream the corpus from disk.