How to implement text classification using scikit-learn

Published on Aug. 22, 2023, 12:16 p.m.

To implement text classification using scikit-learn, you can use a bag-of-words representation of the text data along with a classification algorithm, such as logistic regression or a support vector machine (SVM). Here’s an example code snippet that illustrates this approach:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Convert the text data into feature vectors using a bag-of-words representation
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train a logistic regression classifier on the training data
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the classifier on the test data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

In this code, we load the text data from a CSV file and split it into training and test sets. We then convert the text data into feature vectors using a CountVectorizer object, which represents the data using a bag-of-words representation. We train a logistic regression classifier on the training data and evaluate the classifier on the test data using the accuracy score metric.

Note that this represents just one approach to text classification using scikit-learn, and there are many other algorithms and techniques that can be used as well. You may need to experiment with different approaches to find the best one for your specific task and data.