-
-
Notifications
You must be signed in to change notification settings - Fork 215
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
90 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
--- | ||
|
||
id: silhouette | ||
title: Silhouette Score | ||
sidebar_label: Silhouette Score | ||
description: "The Silhouette Score is a metric used to evaluate the quality of clustering results by measuring cohesion and separation among clusters." | ||
tags: [machine learning, clustering, unsupervised learning, silhouette score, evaluation metric] | ||
|
||
--- | ||
|
||
### Definition: | ||
**Silhouette Score** is an evaluation metric used in unsupervised learning, specifically for clustering algorithms. It helps assess the quality of clusters by measuring how similar each data point is to its own cluster compared to other clusters. A higher silhouette score indicates that data points are well-matched to their own clusters and poorly matched to neighboring clusters, suggesting a good clustering structure. | ||
|
||
### Characteristics: | ||
- **Cohesion and Separation**: | ||
The silhouette score balances two critical aspects in clustering: | ||
- **Cohesion**: Measures how close a data point is to other points within the same cluster. | ||
- **Separation**: Measures how far a data point is from points in the nearest neighboring cluster. | ||
|
||
- **Range**: | ||
The silhouette score ranges from -1 to +1: | ||
- **+1**: Indicates that data points are highly cohesive within clusters and well-separated from other clusters. | ||
- **0**: Suggests that data points are on or very close to the boundary between two clusters. | ||
- **-1**: Indicates that data points may be misclassified, assigned to the wrong clusters. | ||
|
||
|
||
### Steps Involved: | ||
1. **Calculate Average Intra-cluster Distance (`a`)**: | ||
For each data point, calculate the average distance between the point and all other points in its assigned cluster. This value represents the cohesion within the cluster. | ||
|
||
2. **Calculate Nearest-cluster Distance (`b`)**: | ||
For each data point, calculate the average distance between the point and all points in the nearest cluster (not including its own). This value represents the separation from other clusters. | ||
|
||
3. **Compute Silhouette Score for Each Point**: | ||
The silhouette score for each data point is calculated as: | ||
|
||
$s = \frac{b - a}{\max(a, b)}$ | ||
|
||
4. **Aggregate Scores**: | ||
To evaluate the overall clustering structure, the silhouette scores of all points can be averaged, giving an overall silhouette score for the clustering solution. | ||
|
||
### Problem Statement: | ||
Given a dataset and a clustering solution, the goal of the silhouette score is to provide a quantitative measure of how well-defined each cluster is. The silhouette score assists in identifying the ideal number of clusters, enhancing model selection in clustering tasks. | ||
|
||
### Key Concepts: | ||
- **Cohesion (`a`)**: | ||
Reflects the average distance between a data point and other points within the same cluster, indicating how tightly packed the cluster is. | ||
|
||
- **Separation (`b`)**: | ||
Reflects the average distance between a data point and points in the nearest neighboring cluster, indicating the distinction between clusters. | ||
|
||
### Mathematical Formula: | ||
The silhouette score for a data point is given by: | ||
|
||
$Silhoutte Score = \frac{b - a}{\max(a, b)}$ | ||
|
||
Where: | ||
- `a` is the average intra-cluster distance for the data point. | ||
- `b` is the average distance from the data point to the points in the nearest cluster. | ||
|
||
### Time Complexity: | ||
- **Best, Average, and Worst Case: $O(n^2)$** | ||
Calculating the silhouette score involves computing pairwise distances between all data points, which requires $O(n^2)$ time complexity. If clustering is already performed, the silhouette score calculation is dominated by this pairwise distance step. | ||
|
||
### Space Complexity: | ||
- **Space Complexity: $O(n^2)$** | ||
The space complexity is driven by the need to store pairwise distance matrices for all data points, which requires $O(n^2)$ space. Additionally, storing cluster labels requires $O(n)$ space. | ||
|
||
### Python Implementation: | ||
Here is a basic implementation of silhoutte score in Python using the **sklearn** library: | ||
|
||
```python | ||
from sklearn.cluster import KMeans | ||
from sklearn.metrics import silhouette_score | ||
from sklearn.datasets import make_blobs | ||
|
||
# Create a sample dataset | ||
X, _ = make_blobs(n_samples=300, centers=4, random_state=42) | ||
|
||
# Apply K-means clustering | ||
kmeans = KMeans(n_clusters=4) | ||
cluster_labels = kmeans.fit_predict(X) | ||
|
||
# Calculate the silhouette score | ||
score = silhouette_score(X, cluster_labels) | ||
print(f'Silhouette Score: {score}') | ||
``` | ||
|
||
### Summary: | ||
**Silhoutte** is a valuable tool in clustering analysis for evaluating the quality of clusters. By measuring the cohesion and separation of clusters, it provides insights into the effectiveness of clustering solutions. It is commonly used to determine the optimal number of clusters in unsupervised learning tasks. |