Step-by-Step Python Implementation of Spectral Clustering
Let's implement Spectral Clustering using Python with detailed steps, example data, and outputs.
Step 1: Install Required Libraries
We will use scikit-learn, numpy, and matplotlib for the Spectral Clustering algorithm, creating and visualizing data.
You can install these libraries using pip if you don't have them already:
pip install numpy matplotlib scikit-learn
Step 2: Import Required Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import SpectralClustering
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
Step 3: Generate Example Data
We'll create a toy dataset using make_moons from sklearn.datasets. This dataset is a simple two-class dataset, which is ideal for clustering.
# Create synthetic data - two interlocking crescent shapes (moons)
X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
# Standardize the features for better results
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Plot the dataset to visualize
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='viridis')
plt.title("Original Moon Dataset")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Step 4: Apply Spectral Clustering
Now, we’ll apply the Spectral Clustering algorithm from sklearn.cluster. The number of clusters kk will be set to 2, as our dataset consists of two clusters.
# Initialize SpectralClustering with 2 clusters
spectral = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', random_state=42)
# Fit the model and predict cluster labels
labels = spectral.fit_predict(X_scaled)
# Plot the clustering results
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title("Spectral Clustering Results")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Step 5: Understanding Parameters of Spectral Clustering
- n_clusters: Specifies the number of clusters.
- affinity: Defines how to compute the similarity matrix. The options are:
- nearest_neighbors: Uses a K-nearest neighbors graph to calculate similarities.
- rbf: Uses a Gaussian (RBF) kernel.
- random_state: Controls the randomness of the clustering.
In this case, we use nearest_neighbors to compute the affinity matrix, which helps when dealing with non-convex clusters, like our moon-shaped dataset.
Step 6: Output and Interpretation
Visualizing Original Data and Clustering Output:
- Original Data:
- The dataset consists of two crescent-shaped clusters (interlocking moons).
- We visualize the data with points colored according to their ground truth class labels (y).
- Clustering Results:
- The points are colored according to the predicted cluster labels obtained from Spectral Clustering.
- The algorithm is able to correctly separate the two moon-shaped clusters.
Step 7: Performance Evaluation (Optional)
If you have the true labels (which we do, since we generated the synthetic dataset), you can evaluate the clustering performance using metrics such as accuracy, adjusted_rand_score, or silhouette_score.
from sklearn.metrics import accuracy_score, adjusted_rand_score
# Evaluating clustering performance using accuracy and ARI
accuracy = accuracy_score(y, labels)
ari = adjusted_rand_score(y, labels)
print(f"Clustering Accuracy: {accuracy:.2f}")
print(f"Adjusted Rand Index: {ari:.2f}")
Step 8: Modify Parameters and Experiment
You can experiment with different parameters of the SpectralClustering class to observe how the clustering results change:
Change affinity to rbf (Gaussian Kernel):
spectral_rbf = SpectralClustering(n_clusters=2, affinity='rbf', random_state=42) labels_rbf = spectral_rbf.fit_predict(X_scaled) plt.figure(figsize=(8, 6)) plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels_rbf, cmap='viridis') plt.title("Spectral Clustering with RBF Kernel") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.show()
Change the Number of Neighbors (for nearest_neighbors affinity):
spectral_k = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', n_neighbors=10, random_state=42) labels_k = spectral_k.fit_predict(X_scaled) plt.figure(figsize=(8, 6)) plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels_k, cmap='viridis') plt.title("Spectral Clustering with 10 Neighbors") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.show()
Step 9: Conclusion
This is a simple yet powerful example of applying Spectral Clustering to a dataset. Spectral Clustering is particularly useful for datasets that have non-convex, complex shapes. By using eigenvalues and eigenvectors, it transforms the problem of clustering into a more tractable problem in a lower-dimensional space.
Key Takeaways:
- Spectral Clustering uses graph theory to cluster data points based on their similarities.
- The affinity parameter determines how the similarity matrix is constructed. This allows flexibility in how the data is represented.
- Spectral Clustering can handle clusters with arbitrary shapes, making it useful in many real-world scenarios where traditional algorithms like K-means fail.