How to perform clustering using scikit-learn?

Published on Aug. 22, 2023, 12:18 p.m.

To perform clustering using scikit-learn, you can follow these general steps:

  1. Load the dataset into scikit-learn and preprocess it if necessary.
  2. Choose a clustering algorithm that best suits your dataset and problem. Some popular ones include K-means, DBSCAN, and hierarchical clustering.
  3. Create an instance of the chosen clustering algorithm and set any hyperparameters as needed.
  4. Train the clustering model on the data by calling its fit() method.
  5. If necessary, predict cluster labels for new data points using the predict() method of the trained model.
  6. Evaluate the performance of the clustering algorithm using appropriate metrics such as silhouette score, coherence, or domain-specific measures.

Here is an example of performing K-means clustering on the iris dataset:

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris_data = load_iris()
X = iris_data.data
y = iris_data.target

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

labels = kmeans.labels_
score = silhouette_score(X, labels)
print("Silhouette score:", score)

In this example, we are loading the iris dataset, instantiating the KMeans algorithm with 3 clusters, fitting the model and obtaining the predicted labels, and finally evaluating the model’s performance using the silhouette score.

By following these steps, you can perform clustering using scikit-learn on your own datasets.