How to use fastText for document classification on Linux?
Published on Aug. 22, 2023, 12:19 p.m.
To use fastText for document classification on Linux, you can follow these steps:
- Prepare the training data: Prepare the training data in fastText format, where each line is a single text document followed by a label that indicates the category of the document. The label should be prefixed with the
__label__
prefix. For example, if you have a document about sports, the format should be:
This is a document about sports __label__sports
- Train the model: Train the fastText model on the labeled text data using the
fasttext
command-line tool. You can specify the training data file, the number of epochs, and other hyperparameters such as learning rate and dimensionality of the word vectors. For example, to train a document classification model, you can run:
fasttext supervised -input train.txt -output model -dim 100 -lr 0.1 -epoch 25
This will create a model file model.bin
that contains the word vectors and the category labels.
3. Evaluate the model: Evaluate the model on a test set to see how well it performs in document classification. You can use metrics such as accuracy, F1 score, and confusion matrix to evaluate the model’s performance.
4. Use the model for document classification: Use the trained model to classify new documents with the predict
command-line tool. You can specify the model file, the input file that contains the documents to be classified, and the number of labels to output per document. For example:
fasttext predict model.bin test.txt 3
This will output the top 3 predicted labels for each document in the test.txt
file.
That’s it! With these steps, you should be able to use fastText for document classification on Linux.