Unsupervised Data Augmentation for Consistency Training #60

howardyclo · 2019-07-23T03:13:21Z

Authors: Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
Organization: Google Brain & CMU
Paper: https://arxiv.org/abs/1904.12848
Conference: It seems like a submission to NeurIPS 2019.
Code: https://github.com/google-research/uda (They didn't release the algorithm...)

howardyclo · 2019-07-23T13:20:56Z

Supervised data augmentation: Current data augmentation method for labeled data provides a steady but limited performance boost when labeled data is usually small.
Unsupervised data augmentation (UDA):
- Design data augmentation method for unlabeled data since unlabeled data is often larger.
- Consistency loss: Minimize the KL divergence between between the predicted distributions on an unlabeled example and an augmented unlabeled example.
- Consistency/smoothness enforcing: UDA smooths input/hidden space so that model can be more robust.
- Total loss: Supervised loss + Consistency loss
- Allows label information to propagate from labeled data to unlabeled data.

Propose Training signal (supervised loss) annealing (TSA) for preventing overfitting on small labeled data: Gradually release supervised loss signal during training with log/linear/exp schedules (exp is recommeded for very limited labeled data).
Use targeted data augmentation (e.g. AutoAugment) gives a significant improvement over other untargeted data augmentations.
Diverse and valid augmentations that inject targeted inductive biases are the keys, but there are tradeoffs for generating text, e.g., diverse text may not be a valid sentence.
Propose (1) Confidence-based masking; (2) Entropy minimization; (3) Softmax temperature controlling to sharpen the unlabeled data predictions (prevent to be over-flat and thus causes the consistency loss useless). (1)+(3) is the most effective.
Propose Domain-relevance Data Filtering to address the mismatch of class distribution of out-of-domain unlabeled data: Train a in-domain baseline model, predict unlabeled data, and pick out the examples that the model is most confident about (equally distributed among classes).

How to apply it in regression problem?

2.7% error rate (w/ 4000 labeled data) on CIFAR-10, nearly matching full dataset performance.
2.85% error rate (w/ 250 labeled data) on SVHN, nearly matching full dataset performance.
4.2% error rate (w/ 20 labeled data) > SToA model (w/ 25000 labeled data) on IMDb text classification.
Improves ImageNet top-1/top-5 accuracy from 55.1/77.3% to 68.7%/88.5% (w/ 10% of labeled data)
Improves ImageNet top-1/top-5 accuracy from 78.3/94.4% to 79.0/94.5% (w/ full labeled data + 1.3M extra unlabeled data)

mixup: Beyond Empirical Risk Minimization by MIT & FAIR (ICLR 2018): Data augmentation from a single data point and performs interpolation of data pairs to achieve augmentation.
MixMatch: A Holistic Approach to Semi-Supervised Learning by Google Research (2019/05): A concurrent work that unifies several prior works on semi-supervised learning.

howardyclo added Semi-supervised Learning Data Augmentation Unsupervised Learning labels Jul 23, 2019

Provide feedback