Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation #75

Open
howardyclo opened this issue Feb 13, 2021 · 4 comments

Comments

@howardyclo
Copy link
Owner

howardyclo commented Feb 13, 2021

Metadata: Knowledge Distillation Meets Self-Supervision

  • Author: Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
  • Organization: The Chinese University of Hong Kong & Nanyang Technological University
  • Conference: ECCV 2020
  • Paper: https://arxiv.org/abs/2006.07114
@howardyclo howardyclo changed the title Knowledge Distillation Meets Self-Supervision Knowledge Distillation Meets Self-Supervision (ECCV 2020) & Prime-Aware Adaptive Distillation (ECCV 2020) Feb 13, 2021
@howardyclo
Copy link
Owner Author

Prior Approaches on Knowledge Distillation

Basically, knowledge distillation aims to obtain a smaller student model from typically larger teacher model by matching their information hidden in the model. The information could be: final soft predictions, intermediate features, attentions, relations between samples. See this complete review.

Highlights


Conventional knowledge distillation (KD) mimics teacher's single prediction on an image. SSKD considers to mimic the predictions of self-supervised contrastive pairs from teacher.

  • Different from prior work that exploits architecture-specific cues such as mimicking teacher's features or attentions.
    This paper instead proposes to mimic teacher's prediction on self-supervised contrastive examples, i.e., similarity scores of positive and negative examples.
  • This general and model-agnostic approach achieves SoTA on CIFAR-100 and ImageNet. It also works well under several settings such as few-shot, noisy labels, and especially on cross-architecture setting due to its model-agnostic nature.
  • Another benefit is that the distilled knowledge is richen by self-supervised predictions instead of only supervised task-specific predictions, leading to richer distilled knowledge instead of the one that only reflects a single facet of the complete knowledge encapsulated in teacher.
  • The main competitor is CRD (Contrastive representation distillation. Tian et al. ICLR 2020). The difference between them is how contrastive task is performed. For example, CRD chooses to maximize the mutual information whereas SSKD adopts SimCLR.

Methods

  • Stage 1: Train teacher network (backbone & classifier) with supervised task loss on weakly augmented samples (previous work shows that strongly augmented samples harms supervised learning but benefits self-supervised learning).
  • Stage 2: Fix teacher network (backbone & classifier), train "SS Module" on SimCLR contrastive task. The SS module is 2-layer MLP projection head on backbone features and its output is basically the pairwise similarity matrix of positive and negative pairs.
  • Stage 3: Train student network with (1) supervised task loss on weakly augmented samples (2) knowledge distillation loss matching teacher's output on weakly augmented + strongly augmented samples (3) knowledge distillation loss matching teacher's SS module's output.
  • Filtering incorrect teacher's contrastive predictions (i.e., incorrectly assign higher similarity score to negative pair instead of positive pair). They only transfer correct and top-k% ranked incorrect predictions within a batch. The best performance is achieved by keeping both correct and top-50%~75% incorrect predictions. → Keeping noisy predictions from teacher and not treating all samples equally are important. This leads to our intro to the next paper "Prime-Aware Adaptive Distillation".

@howardyclo
Copy link
Owner Author

Metadata: Prime-Aware Adaptive Distillation

  • Author: Youcai Zhang, Zhonghao Lan, +4, Yichen Wei
  • Organization: Megvii Inc. & University of Science and Technology of China & Tongji University
  • Conference: ECCV 2020
  • Paper: https://arxiv.org/pdf/2008.01458.pdf

@howardyclo howardyclo changed the title Knowledge Distillation Meets Self-Supervision (ECCV 2020) & Prime-Aware Adaptive Distillation (ECCV 2020) Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation (ECCV 2020) Feb 13, 2021
@howardyclo howardyclo changed the title Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation (ECCV 2020) Knowledge Distillation Meets Self-Supervision & Prime-Aware Adaptive Distillation Feb 13, 2021
@howardyclo
Copy link
Owner Author

Highlights

  • This paper incorporates data uncertainty for re-weighting samples for distillation.
  • They conjecture that previous hard-mining (i.e., focus more on hard samples instead of easier ones) for distillation could harm student performance since the capacity of student makes it less capable of learning these hard samples. Therefore they propose sampling weights should be biased towards easier samples. This work is highly inspired by Prime sample attention in object detection. Cao et al. CVPR 2020.
  • Also achieves SoTA on various datasets.

I haven't verified which one in this blogpost is the real SoTA… Let's just appreciate their ideas LOL.

Methods

  • They verify the idea by a simple baseline: simply discarding hard samples with large d[.] in distillation loss can improve results.
  • Based on the above simple intuition, two slightly better baselines are proposed: Using Softmax or Polynomial weighting function to re-weight samples within a batch:
  • However, the above hyper-parameters are sensitive and the results are not satisfied, even fewer of them are better than without re-weighting:
  • Instead, they propose to use data uncertainty var to re-weight samples (1/var) in distillation loss (i.e., PAD - Prime-Aware Distillation Loss). The data uncertainty var is predicted by an auxiliary branch:
  • Results: We can see that the proposed two re-weighting baselines are not really effective. However the uncertainty re-weighting PAD significantly outperform baselines.

Findings

When learning uncertainty var, they found that the var becomes small at the beginning and then stable during PAD training. Therefore they added an additional "warm-up" experiment in the above table and found that it performs slightly better than baselines but worse than the learned weights from PAD. Finally, combining PAD with warm-up training scheduling can achieve better results.

@howardyclo
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant