(Brent A. Griffin*, Jacob Marks, Jason J. Corso) @ Voxel51
* Corresponding author
Zero-Shot Coreset Selection (ZCore) is a method of coreset selection for unlabeled data. Deep learning methods rely on massive data, resulting in substantial costs for storage, annotation, and model training. Coreset selection aims to select a subset of the data to train models with lower cost while ideally performing on par with the full data training. Although the majority of real-world data are unlabeled, previous state-of-the-art coreset methods cannot select data that are unlabeled. As a solution, ZCore addresses the problem of coreset selection without labels or training on candidate data. Instead, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. On ImageNet, the ZCore coreset achieves a higher accuracy than previous label-based coresets at a 90% prune rate, while removing annotation requirements for 1.15 million images.
Zero-Shot Coreset Selection Overview
We provide example ZCore commands for coreset selection and subsequent model training for the EuroSAT10 dataset from our paper. See instructions in Repeat Trials to repeat experiment trials and Dataset Setup for full ImageNet, CIFAR, or EuroSAT setup.
Step 1. Dataset. Download and unzip eurosat10.zip
in ./data
.
Step 2. Zero-Shot Coreset Selection
python zeroshot_coreset_selection.py --dataset eurosat10 --data_dir ./data --results_dir ./results --embedding clip resnet18 --num_workers 10
FiftyOne dependency to generate embeddings (pip install fiftyone
).
Step 3. Train Coreset Model
python train_coreset_model.py --prune_rate 0.7 --dataset eurosat10 --data_dir ./data --score_file ./results/eurosat10/zcore-eurosat10-clip-resnet18-1000Ks-2sd-ri-1000nn-4ex-0/score.npy
We provide examples scripts to repeat ZCore experiments over multiple trials in ./repeat-trial-scripts
.
Repeat ZCore Selections for EuroSAT10
chmod +x ./repeat-trial-scripts/eurosat10-score-x5.sh
./repeat-trial-scripts/eurosat10-score-x5.sh
Repeat Coreset Model Training for EuroSAT10
chmod +x ./repeat-trial-scripts/eurosat10-train-x5.sh
./repeat-trial-scripts/eurosat10-train-x5.sh
We provide example repeat trial results in ./results/example/eurosat10
. To tabulate these repeat trials run:
python process_repeat_trials.py --base_score_dir ./results/example/eurosat10/zcore-eurosat10-clip-resnet18-1000Ks-2sd-ri-1000nn-4ex
to generate the following table:
Setting p30-s51 p50-s51 p70-s51 p80-s51 p90-s51
Trial Results
0 93.80 91.93 86.10 80.98 63.63
1 93.39 91.26 85.74 78.88 65.58
2 93.63 91.21 87.91 79.84 66.70
3 93.90 92.38 86.91 79.86 65.16
4 94.06 92.26 86.47 80.20 67.75
Aggregate Results
Mean 93.76 91.81 86.63 79.95 65.76
StdDev 0.230 0.491 0.750 0.677 1.398
Overall Mean: 83.58
ImageNet can be downloaded here and subsequently reformatted using:
cd ./ILSVRC/Data/CLS-LOC/val/
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
CIFAR10 and CIFAR100 can be downloaded here.
EuroSAT80, EuroSAT40, EuroSAT20, and EuroSAT10 can be downloaded here.
If you find this code useful, please consider citing our paper:
@article{griffin24zcore,
title={Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data},
author={Griffin, Brent A and Marks, Jacob and Corso, Jason J},
journal={arXiv preprint arXiv:2411.15349},
year={2024}
}
You may also want to check out our open-source toolkit, FiftyOne, which provides a powerful interface for exploring, analyzing, and visualizing datasets for computer vision and machine learning.