Accepted at NeurIPS 2022 - Datasets and Benchmarks Track
- Title: AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection
- Authors: Marius Dragoi, Elena Burceanu, Emanuela Haller, Andrei Manolache, Florin Brad
- ArXiv Preprint
💥💥 AD benchmark for both In-Distribution (ID) and Out-Of-Distribution (OOD) Anomaly Detection tasks (full results)
Out-Of-Distribution Anomaly Detection
Method | ROC-AUC ( |
||
---|---|---|---|
IID | NEAR | FAR | |
OC-SVM [3] | 76.86 | 71.43 | 49.57 |
IsoForest [10] | 86.09 | 75.26 | 27.16 |
ECOD [6] | 84.76 | 44.87 | 49.19 |
COPOD [8] | 85.62 | 54.24 | 50.42 |
LOF [11] | 91.50 | 79.29 | 34.96 |
SO-GAAL [1] | 50.48 | 54.55 | 49.35 |
deepSVDD [2] | 92.67 | 87.00 | 34.53 |
AE for anomalies [4] | 81.00 | 44.06 | 19.96 |
LUNAR [9] | 85.75 | 49.03 | 28.19 |
InternalContrastiveLearning [7] | 84.86 | 52.26 | 22.45 |
BERT for anomalies [5] | 84.54 | 86.05 | 28.15 |
- Average results over multiple runs
- Train data
files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010}
- IID test data
files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010}
- NEAR test data
files "[year]_subset.parquet" with year in {2011, 2012, 2013}
- FAR test data
files "[year]_subset.parquet" with year in {2014, 2015}
- Results for each split are reported as an average over the performance on each year
- Scripts for repoducing the results are available in 'baselines_OOD_setup/' (check Baselines section for more details).
In-Distribution Anomaly Detection
Method | ROC-AUC ( |
---|---|
OC-SVM [3] | 68.73 |
IsoForest [10] | 81.27 |
ECOD [6] | 79.41 |
COPOD [8] | 80.89 |
LOF [11] | 87.61 |
SO-GAAL [1] | 49.90 |
deepSVDD [2] | 88.24 |
AE for anomalies [4] | 64.08 |
LUNAR [9] | 78.53 |
InternalContrastiveLearning [7] | 66.99 |
BERT for anomalies [5] | 79.62 |
- Average results over multiple runs
- Train data
files "[year]_subset.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
- Test data
files "[year]_subset_valid.parquet" with year in {2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015}
- Scripts for repoducing the results are available in 'baselines_ID_setup/' (check Baselines section for more details).
- We introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span (from 2006 to 2015), with naturally occurring changes over time. In AnoShift, we split the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest).
-
With all tested baselines, we notice a significant decrease in performance on FAR years for inliers, showing that there might be a particularity with those years. We observe a large distance in the Jeffreys divergence on FAR and the rest of the years for 2 features: service type and the number of bytes sent by the source IP.
-
From the OTDD analysis we observe that: first, the inliers from FAR are very distanced to training years; and second, the outliers from FAR are quite close to the training inliers.
-
We propose a BERT model for MLM and compare several training regimes: iid, finetune and a basic distillation technique, and show that acknowledging the distribution shift leads to better test performance on average.
-
Original Kyoto-2006+ data available at: https://www.takakura.com/Kyoto_data/ (in AnoShift we used the New data (Nov. 01, 2006 - Dec. 31, 2015))
-
Preprocessed Kyoto-2006+ data available at: https://share.bitdefender.com/s/9D4bBE7H8XTdYDB
-
The result is obtained by applying the preprocessing script
data_processor/parse_kyoto_logbins.py
to the original data. -
The preprocessed dataset is available in pandas parquet format and available as both full sets and subsets with 300k inlier samples, with equal outlier proportions to the original data.
-
In the notebook tutorials, we use the subsets for fast experimentation. In our experiments, the subset results are consistent with the full sets.
- Label column (18) has value 1 for inlier class (normal traffic) and -1 (known type) and -2 (unknown type) for anomalies.
curl https://share.bitdefender.com/s/9D4bBE7H8XTdYDB/download --output AnoShift.zip
mkdir datasets
mv AnoShift.zip datasets
unzip datasets/AnoShift.zip -d datasets/
rm datasets/AnoShift.zip
- Create a new conda environment:
conda create --name anoshift
- Activate the new environment:
conda activate anoshift
- Install pip:
conda install -c anaconda pip
- Upgrade pip:
pip install --upgrade pip
- Install dependencies:
pip install -r requirements.txt
We provide numeros baselines in the baselines_OOD_setup/
directory, which are a good entrypoint for familiarizing with the protocol:
baseline_*.ipynb
: isoforest/ocsvm/LOF baselines on AnoShiftbaseline_deep_svdd/baseline_deepSVDD.py
: deppSVDD baseline on AnoShiftbaseline_BERT_train.ipynb
: BERT baseline on AnoShiftbaseline_InternalContrastiveLearning.py
: InternalContrastiveLearning baseline on AnoShiftbaselines_PyOD.py
: ['ecod', 'copod', 'lunar', 'ae', 'so_gaal'] baselines on AnoShift using PyODiid_finetune_distill_comparison.ipynb
: compare the IID, finetune and distillation training strategies for the BERT model, on AnoShift
- run the notebooks from the
root
of the project:jupyter-notebook .
If you intend to use AnoShift in the ID setup, please consider the code provided in 'baselines_ID_setup/'. You can use either the full set (full_set=1
=> all ten years) or the years corresponding to our original IID split (full_set=0
=> first five years) (check usage instructions for each baseline in order to switch between them).
@article{druagoi2022anoshift,
title={AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection},
author={Dr{\u{a}}goi, Marius and Burceanu, Elena and Haller, Emanuela and Manolache, Andrei and Brad, Florin},
journal={Neural Information Processing Systems {NeurIPS}, Datasets and Benchmarks Track},
year={2022}
}