This code provides a benchmarking framework to evaluate ease of tuning of hyperparameters in a clustering algorithm.
-
To create a virtual environment using conda, with all the dependencies installed use:
conda env create --name clustering-env --file=environment.yml
Optionally if you prefer pip, create a virtual environment and run:
pip install -r requirements.txt
-
After cloning the repo, you can install all required dependencies by running
make install
in the root directory. (Note: this will install python modules, so be sure you are in the proper virtual environment.) -
[Optional] To build a Docker environment use the provided
Dockerfile
ordocker-compose.yml
, go to the root directory of the repository and perform:# Using docker build docker build -t clustering-benchmark . # Using docker compose docker-compose up
This module provides a command line interface powered by Facebook Hydra available with clustering_hyperparameters
command.
The default options are provided in Global Config file
They can be overriden using the cli in the following way:
clustering_hyperparameters override_key1=override_value1
override_key2=override_value2 ...
override_key_n=override_value_n
# e.g
clustering_hyperparameters suite=nlp model=dbscan
Modify the config file at src/clustering_hyperparameters/conf/model/<model>.yaml
e.g. dbscan model config file
You can use dynamic ranges using hydra's variable interpolation e.g. kmeans-minibatch config file Modified dbscan parameter ranges
name: dbscan
params:
- name: metric
type: fixed
value_type: str
value: cosine
# Updated eps upper bound from 0.999 to 1.999
- name: eps
type: range
value_type: float
bounds: [0.001, 1.999]
# Updated min_samples upper bound from 100 to 500
- name: min_samples
type: range
value_type: int
bounds: [2, 500]
Add a new file to model dir
@ClusteringModel.register('my-new-model')
class MyNewModel(ClusteringModel):
def __init__(self, **parameters):
# Initialize parameters
def fit(self, x):
# Define how to fit a model
def get_labels(self):
# Define how to get clustering label assignments
To use the new model, run:
clustering_hyperparameters model=my-new-model
To define a new suite, create a new file in suite dir in the following format:
name: my-suite
cache_dir: ${root_dir}/data/my-suite
datasets:
- name: mfeat-fourier
loader: openml
metadata:
id: 14
num_instances: 2000
num_features: 76
num_clusters: 10
....
- name: AGNews-paraphrase-mpnet
loader: torchtext
metadata:
tag: AG_NEWS
split: test
encoder: sentence-transformer
encoder_model: paraphrase-mpnet-base-v2
num_instances: 7600
num_features: 768
num_clusters: 4
@DatasetLoader.register("my-data-loader")
class MyDataLoader(DatasetLoader):
def __init__(self, name, metadata)
# Initialize metadata
def fetch_and_cache(self, cache_dir):
# Fetch dataset, perform preprocessing and cache it using `Dataset.store_from_data` utility
To use this data loader, use loader: my-new-loader
in suite config file.
The notebooks with the results/plots found in the paper can be found as jupyter notebook in experiments directory. It contains plots for fANOVA analysis,
In a SLURM environment, run the script file:
./bin/run_all_exps.sh
This will spawn multiple sbatch jobs which will run all the required evaluations in parallel.
For ease of tweaking, the evaluated outputs described in the paper are provided in output
To reproduce the results:
- Extract
output.zip
tooutput/
- Go to any jupyter notebook in experiments folder and run the notebook to get the corresponding plots.
@inproceedings{mishra2022evaluative,
title={An evaluative measure of clustering methods incorporating hyperparameter sensitivity},
author={Mishra, Siddhartha and Monath, Nicholas and Boratko, Michael and Kobren, Ariel and McCallum, Andrew},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={36},
number={7},
pages={7788--7796},
year={2022}
}