How to use scikit-learn pipelines?

Published on Aug. 22, 2023, 12:18 p.m.

To use scikit-learn pipelines, you need to import the Pipeline class from the sklearn.pipeline module. Then you define a sequence of transformations to be applied to the data using a list of tuples, where each tuple contains a name for the transformer object and the transformer object itself. The final tuple should contain the name of the estimator object and the estimator object itself. Here’s an example of a pipeline that scales the data and applies a support vector machine classifier:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

X = np.array([[0, 0], [1, 1]])
y = np.array([0, 1])

my_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear', C=1))
])

my_pipeline.fit(X, y)

print(my_pipeline.predict([[2., 2.], [-1., -2.]]))

In this example, the pipeline includes two steps: scaling the data using the StandardScaler transformer, and applying a support vector machine classifier using the SVC estimator. The fit() method of the pipeline object fits both the transformer and the estimator to the data. The predict() method of the pipeline object can be used to make predictions on new data.

Note that the output of the pipeline’s predict() method is the same as the output of the estimator’s predict() method. This is because the pipeline objects have the same interface as the estimator objects they wrap.