How to perform clustering using scikit-learn?
Published on Aug. 22, 2023, 12:18 p.m.
To perform clustering using scikit-learn, you can follow these general steps:
- Load the dataset into scikit-learn and preprocess it if necessary.
- Choose a clustering algorithm that best suits your dataset and problem. Some popular ones include K-means, DBSCAN, and hierarchical clustering.
- Create an instance of the chosen clustering algorithm and set any hyperparameters as needed.
- Train the clustering model on the data by calling its
fit()
method. - If necessary, predict cluster labels for new data points using the
predict()
method of the trained model. - Evaluate the performance of the clustering algorithm using appropriate metrics such as silhouette score, coherence, or domain-specific measures.
Here is an example of performing K-means clustering on the iris dataset:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
iris_data = load_iris()
X = iris_data.data
y = iris_data.target
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
labels = kmeans.labels_
score = silhouette_score(X, labels)
print("Silhouette score:", score)
In this example, we are loading the iris dataset, instantiating the KMeans algorithm with 3 clusters, fitting the model and obtaining the predicted labels, and finally evaluating the model’s performance using the silhouette score.
By following these steps, you can perform clustering using scikit-learn on your own datasets.