Skip to content

Commit

Permalink
add: silhoutte score in the ml doc
Browse files Browse the repository at this point in the history
  • Loading branch information
bmukesh23 committed Nov 8, 2024
1 parent 89e6957 commit 60c331b
Showing 1 changed file with 90 additions and 0 deletions.
90 changes: 90 additions & 0 deletions docs/machine-learning/Silhoutte.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---

id: silhouette
title: Silhouette Score
sidebar_label: Silhouette Score
description: "The Silhouette Score is a metric used to evaluate the quality of clustering results by measuring cohesion and separation among clusters."
tags: [machine learning, clustering, unsupervised learning, silhouette score, evaluation metric]

---

### Definition:
**Silhouette Score** is an evaluation metric used in unsupervised learning, specifically for clustering algorithms. It helps assess the quality of clusters by measuring how similar each data point is to its own cluster compared to other clusters. A higher silhouette score indicates that data points are well-matched to their own clusters and poorly matched to neighboring clusters, suggesting a good clustering structure.

### Characteristics:
- **Cohesion and Separation**:
The silhouette score balances two critical aspects in clustering:
- **Cohesion**: Measures how close a data point is to other points within the same cluster.
- **Separation**: Measures how far a data point is from points in the nearest neighboring cluster.

- **Range**:
The silhouette score ranges from -1 to +1:
- **+1**: Indicates that data points are highly cohesive within clusters and well-separated from other clusters.
- **0**: Suggests that data points are on or very close to the boundary between two clusters.
- **-1**: Indicates that data points may be misclassified, assigned to the wrong clusters.


### Steps Involved:
1. **Calculate Average Intra-cluster Distance (`a`)**:
For each data point, calculate the average distance between the point and all other points in its assigned cluster. This value represents the cohesion within the cluster.

2. **Calculate Nearest-cluster Distance (`b`)**:
For each data point, calculate the average distance between the point and all points in the nearest cluster (not including its own). This value represents the separation from other clusters.

3. **Compute Silhouette Score for Each Point**:
The silhouette score for each data point is calculated as:

$s = \frac{b - a}{\max(a, b)}$

4. **Aggregate Scores**:
To evaluate the overall clustering structure, the silhouette scores of all points can be averaged, giving an overall silhouette score for the clustering solution.

### Problem Statement:
Given a dataset and a clustering solution, the goal of the silhouette score is to provide a quantitative measure of how well-defined each cluster is. The silhouette score assists in identifying the ideal number of clusters, enhancing model selection in clustering tasks.

### Key Concepts:
- **Cohesion (`a`)**:
Reflects the average distance between a data point and other points within the same cluster, indicating how tightly packed the cluster is.

- **Separation (`b`)**:
Reflects the average distance between a data point and points in the nearest neighboring cluster, indicating the distinction between clusters.

### Mathematical Formula:
The silhouette score for a data point is given by:

$Silhoutte Score = \frac{b - a}{\max(a, b)}$

Where:
- `a` is the average intra-cluster distance for the data point.
- `b` is the average distance from the data point to the points in the nearest cluster.

### Time Complexity:
- **Best, Average, and Worst Case: $O(n^2)$**
Calculating the silhouette score involves computing pairwise distances between all data points, which requires $O(n^2)$ time complexity. If clustering is already performed, the silhouette score calculation is dominated by this pairwise distance step.

### Space Complexity:
- **Space Complexity: $O(n^2)$**
The space complexity is driven by the need to store pairwise distance matrices for all data points, which requires $O(n^2)$ space. Additionally, storing cluster labels requires $O(n)$ space.

### Python Implementation:
Here is a basic implementation of silhoutte score in Python using the **sklearn** library:

```python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Create a sample dataset
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-means clustering
kmeans = KMeans(n_clusters=4)
cluster_labels = kmeans.fit_predict(X)

# Calculate the silhouette score
score = silhouette_score(X, cluster_labels)
print(f'Silhouette Score: {score}')
```

### Summary:
**Silhoutte** is a valuable tool in clustering analysis for evaluating the quality of clusters. By measuring the cohesion and separation of clusters, it provides insights into the effectiveness of clustering solutions. It is commonly used to determine the optimal number of clusters in unsupervised learning tasks.

0 comments on commit 60c331b

Please sign in to comment.