Rethinking Resolution in the Context of Efficient Video Recognition (NeurIPS 2022)
By
Chuofan Ma,
Qiushan Guo,
Yi Jiang,
Ping Luo,
Zehuan Yuan, and
Xiaojuan Qi.
We introduce cross-resolution knowledge distillation (ResKD) to make the most of low-resolution frames for efficient video recognition. In the training phase, a pre-trained teacher network taking high-resolution frames as input is leveraged to guide the learning of a student network on low-resolution frames. While for evaluation, only the student is deployed to make predictions. This simple but effective method largely boosts the boost recognition accuracy on low-resolution frames, and is compatible with state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers.
This project is developed with CUDA 11.0
, PyTorch 1.7.1
, and Python 3.7
.
Please be aware of possible code compatibility issues if you are using another version.
The following is an example of setting up the experimental environment:
git clone https://github.com/CVMI-Lab/ResKD.git
cd ResKD
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu110/torch1.7.0/index.html
pip install -r requirements/build.txt
pip install -v -e .
pip install tqdm
pip install timm
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Four benchmarks are involved for training and evaluation.
Please download the corresponding dataset(s) from the official websites and place
or sim-link them under $ResKD_ROOT/data/
. (You don't have to download all of them at one time)
$ResKD_ROOT/data/
actnet/
sthv2/
fcvid/
kinetics400/
- ActivityNet. After downloading the raw videos,
extract frames using
tools/data/activitynet/video2img.py
. To reproduce the results in our paper, you need to extract frames inpng
format at a frame rate of4
. The extracted frames will take roughly 1.9T space. If you do not have enough space, you may consider extracting frames injpg
format at the default frame rate, which will sacrifice accuracy slightlty. - Mini-Kinetics and Kinetics-400. We use the Kinetics-400 version provided by Common Visual Data Foundation. Remeber to filter out corrupted videos before using the dataset. Mini-Kinetics is a subset of Kinetics-400. You can get the train/val splits files from AR-net.
- FCVID. Following the same pipeline to extract frames as Activitynet.
- Something Something V2.
You may need to modify the corresponding file paths in the config files after data preparation.
Backbone | Dataset | Config | Model |
---|---|---|---|
TSN_Res50 | actnet | tsn_r50_1x1x16_50e_actnet_rgb.py | ckpt |
TSM_Res50 | sthv2 | tsm_r50_1x1x8_50e_sthv2_rgb.py | ckpt |
TSN_Res152 | actnet | tsn_r152_1x1x16_50e_actnet_rgb.py | ckpt |
TSN_Res152 | minik | tsn_r152_1x1x8_50e_minik_rgb.py | ckpt |
TSN_Res152 | fcvid | tsn_r152_1x1x16_50e_fcvid_rgb.py | ckpt |
Slowonly_Res50 | k400 | slowonly_r50_8x8x1_150e_k400_rgb.py | ckpt |
Swin_Base | k400 | swin_base_32x2x1_50e_k400.py | ckpt |
Here we provide some examples to train and test a model. For more details of the training and evaluation scripts, you may refer to the documents of mmaction2.
- Evaluation on ActivityNet
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval mean_average_precision
- Evaluation on Mini-kinetics, Something Something V2, and FCVID
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} --eval top_k_accuracy
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} --validate --test-last
If you find this repo useful for your research, please consider citing our paper:
@inproceedings{
ma2022rethinking,
title={Rethinking Resolution in the Context of Efficient Video Recognition},
author={Chuofan Ma and Qiushan Guo and Yi Jiang and Ping Luo and Zehuan Yuan and XIAOJUAN QI},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
}
Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project:
- https://github.com/open-mmlab/mmaction2
- https://github.com/SwinTransformer/Video-Swin-Transformer
- https://github.com/blackfeather-wang/AdaFocus
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.