Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

howardyclo · 2021-02-13T05:14:10Z

Metadata: Knowledge Distillation Meets Self-Supervision

Author: Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
Organization: The Chinese University of Hong Kong & Nanyang Technological University
Conference: ECCV 2020
Paper: https://arxiv.org/abs/2006.07114

howardyclo · 2021-02-13T05:17:42Z

Prior Approaches on Knowledge Distillation

Basically, knowledge distillation aims to obtain a smaller student model from typically larger teacher model by matching their information hidden in the model. The information could be: final soft predictions, intermediate features, attentions, relations between samples. See this complete review.

Highlights

Conventional knowledge distillation (KD) mimics teacher's single prediction on an image. SSKD considers to mimic the predictions of self-supervised contrastive pairs from teacher.

Different from prior work that exploits architecture-specific cues such as mimicking teacher's features or attentions.
This paper instead proposes to mimic teacher's prediction on self-supervised contrastive examples, i.e., similarity scores of positive and negative examples.
This general and model-agnostic approach achieves SoTA on CIFAR-100 and ImageNet. It also works well under several settings such as few-shot, noisy labels, and especially on cross-architecture setting due to its model-agnostic nature.
Another benefit is that the distilled knowledge is richen by self-supervised predictions instead of only supervised task-specific predictions, leading to richer distilled knowledge instead of the one that only reflects a single facet of the complete knowledge encapsulated in teacher.
The main competitor is CRD (Contrastive representation distillation. Tian et al. ICLR 2020). The difference between them is how contrastive task is performed. For example, CRD chooses to maximize the mutual information whereas SSKD adopts SimCLR.

Methods

Stage 1: Train teacher network (backbone & classifier) with supervised task loss on weakly augmented samples (previous work shows that strongly augmented samples harms supervised learning but benefits self-supervised learning).
Stage 2: Fix teacher network (backbone & classifier), train "SS Module" on SimCLR contrastive task. The SS module is 2-layer MLP projection head on backbone features and its output is basically the pairwise similarity matrix of positive and negative pairs.
Stage 3: Train student network with (1) supervised task loss on weakly augmented samples (2) knowledge distillation loss matching teacher's output on weakly augmented + strongly augmented samples (3) knowledge distillation loss matching teacher's SS module's output.
Filtering incorrect teacher's contrastive predictions (i.e., incorrectly assign higher similarity score to negative pair instead of positive pair). They only transfer correct and top-k% ranked incorrect predictions within a batch. The best performance is achieved by keeping both correct and top-50%~75% incorrect predictions. → Keeping noisy predictions from teacher and not treating all samples equally are important. This leads to our intro to the next paper "Prime-Aware Adaptive Distillation".

howardyclo · 2021-02-13T05:20:23Z

Metadata: Prime-Aware Adaptive Distillation

Author: Youcai Zhang, Zhonghao Lan, +4, Yichen Wei
Organization: Megvii Inc. & University of Science and Technology of China & Tongji University
Conference: ECCV 2020
Paper: https://arxiv.org/pdf/2008.01458.pdf

howardyclo · 2021-02-13T05:26:15Z

Highlights

This paper incorporates data uncertainty for re-weighting samples for distillation.
They conjecture that previous hard-mining (i.e., focus more on hard samples instead of easier ones) for distillation could harm student performance since the capacity of student makes it less capable of learning these hard samples. Therefore they propose sampling weights should be biased towards easier samples. This work is highly inspired by Prime sample attention in object detection. Cao et al. CVPR 2020.
Also achieves SoTA on various datasets.

I haven't verified which one in this blogpost is the real SoTA… Let's just appreciate their ideas LOL.

Methods

They verify the idea by a simple baseline: simply discarding hard samples with large d[.] in distillation loss can improve results.
Based on the above simple intuition, two slightly better baselines are proposed: Using Softmax or Polynomial weighting function to re-weight samples within a batch:
However, the above hyper-parameters are sensitive and the results are not satisfied, even fewer of them are better than without re-weighting:
Instead, they propose to use data uncertainty var to re-weight samples (1/var) in distillation loss (i.e., PAD - Prime-Aware Distillation Loss). The data uncertainty var is predicted by an auxiliary branch:
Results: We can see that the proposed two re-weighting baselines are not really effective. However the uncertainty re-weighting PAD significantly outperform baselines.

Findings

When learning uncertainty var, they found that the var becomes small at the beginning and then stable during PAD training. Therefore they added an additional "warm-up" experiment in the above table and found that it performs slightly better than baselines but worse than the learned weights from PAD. Finally, combining PAD with warm-up training scheduling can achieve better results.

howardyclo · 2021-02-13T05:28:03Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

howardyclo commented Feb 13, 2021 •

edited

Loading

howardyclo commented Feb 13, 2021

howardyclo commented Feb 13, 2021

howardyclo commented Feb 13, 2021

howardyclo commented Feb 13, 2021

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

Comments

howardyclo commented Feb 13, 2021 • edited Loading

Metadata: Knowledge Distillation Meets Self-Supervision

howardyclo commented Feb 13, 2021

Prior Approaches on Knowledge Distillation

Highlights

Methods

howardyclo commented Feb 13, 2021

Metadata: Prime-Aware Adaptive Distillation

howardyclo commented Feb 13, 2021

Highlights

Methods

Findings

howardyclo commented Feb 13, 2021

Further Readings

howardyclo commented Feb 13, 2021 •

edited

Loading