To train a named entity recognition (NER) model using scikit-learn
Published on Aug. 22, 2023, 12:16 p.m.
LabelEncoderTo train a named entity recognition (NER) model using scikit-learn, you can use the sklearn_crfsuite
package, which provides an interface to train conditional random field (CRF) models for NER tasks. Here is an example code snippet:
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
# Load the training data
with open('train.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
# Extract the features and labels from the data
sentences = []
labels = []
for line in lines:
fields = line.strip().split()
sentences.append(fields[:-1])
labels.append(fields[-1])
# Encode the labels as integers
le = LabelEncoder()
labels = le.fit_transform(labels)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.2, random_state=42)
# Convert the sentences into feature vectors using a Bag-of-Words representation
vectorizer = CountVectorizer(analyzer='word', token_pattern=r'\w+')
X_train = vectorizer.fit_transform([' '.join(sentence) for sentence in X_train])
X_test = vectorizer.transform([' '.join(sentence) for sentence in X_test])
# Train a CRF model using the training data
crf = sklearn_crfsuite.CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
crf.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = crf.predict(X_test)
print(metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=le.transform(le.classes_)))
In this code, we first load the training data from a text file, and then extract the features and labels from the data. We encode the labels as integers using a LabelEncoder
object, and split the data into training and test sets.
We then convert the sentences into feature vectors using a CountVectorizer
object, which represents the data using a Bag-of-Words representation. We train a CRF model using the training data and evaluate the model on the test set using the F1-score metric.