Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

howardyclo · 2020-12-27T11:15:05Z

Metadata

Authors: Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, Ling Shao
Organization: Inception Institute of Artificial Intelligence, Abu Dhabi, UAE
Paper: https://arxiv.org/pdf/2009.12836.pdf

The first author's work mainly focuses on normalization, some of his recent work:

Controllable Orthogonalization in Training DNNs. CVPR'20 Oral (Best Paper Nomination)

On the Number of Linear Regions of Convolutional Neural Networks. ICML'20.

Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs. ECCV'20 (Oral)

TL;DR

This paper reviews the past, present and future of normalization methods for DNNs training, and aims to answer the following questions:

What are the main motivations behind different normalization methods in DNNs, and how can we present a taxonomy
for understanding the similarities and differences between a wide
variety of approaches?
How can we reduce the gap between the empirical success of normalization techniques and our theoretical understanding of them?
What recent advances have been made in designing/tailoring normalization techniques for different tasks, and what are the main insights behind them?

Introduction

Normalization techniques typically serves as a "layer" between learnable weights and activations in DNN architectures.
More importantly, they've advanced deep learning research and become an essential module in DNN architectures for various applications. For example, Layer Normalization (LN) for transformers used in NLP; Spectral Normalization (SN) for discriminator in GANs used in generative modeling.

Question 1

Five normalization operations considered

Centering: Makes input zero mean.
Scaling: Makes input unit variance.
Decorrelating : Make input zero correlation between dimensions of input (i.e., zeroing covariance matrix's off-diagonals).
Standardization: Composition of centering and scaling.
Whitening: Make input a spherical Gaussian distribution. Composition of standardization and decorrelating, also called PCA Whitening.

Motivation of normalization

Convergence is proved to be related to the statistics fo input of a linear model, e.g., if the Hessian of input to a linear model is identity matrix, then this linear model can converge within one iteration by full gradient descent (GD). Several normalizations are discussed:

Normalizing the activations (non-learnable or learnable)
Normalizing the weights with a constrained distribution such that activations' gradients are implicitly normalized. These are inspired by weight normalizations but are extended towards satisfying the desired properties during training.
Normalizing the gradients to exploit the curvature information for GD/SGD.

Normalization framework Π -> Φ -> Ψ

Take batch normalization (BN; ICML'15) for example, for a given input channel-first batch X with shape (c, b, h, w)

Normalization area partitioning (Π): (c, b, h, w) -> (c, b*h*w)
Normalization operation (Φ): Standardization along the last dimension of (c, b* h*w)
Normalization representation recovery (Ψ): Affine transformation with learnable parameters for X.

Several weakness of BN

Inconsistent between training and inference limits its usage in complex networks, such as RNN or GANs.
Suffers from small batch size setting (e.g., object detection and segmentation)
To address weakness of BN, several normalizations have been proposed and we discussed them under the framework.

Normalization area partitioning

Layer normalization (LN; arXiv'16): (c, b, h, w) -> (b, c*h*w). Widely used in NLP.
Group normalization (GN; ECCV'18): (c, b, h, w) -> (b*g, s*h*w), where g is group number of channel dimension, and s = c/g is splits number. When g=1, GN becomes LN. Widely used in object detection and segmentation.
Instance normalization (IN; arXiv'16): (c, b, h, w) -> (b*c, h*w) . Widely used in image style transfer.
Position normalization (PN; NeurIPS'19): (c, b, h, w -> (b*h*w, c). Designed to deal with spatial information and has the potential to enhance generative models.
Batch-group normalization (BGN; ICLR'20): (c, b, h, w) -> (g_b*g_c, s_b*s_c*h*w), where s_b = b/g_b and s_c = c/g_c. Extend GN by also grouping the batch dimension.

TBD

The text was updated successfully, but these errors were encountered:

howardyclo added Deep Learning Survey Normalization labels Dec 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

howardyclo commented Dec 27, 2020

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

Normalization Techniques in Training DNNs: Methodology, Analysis and Application #73

Comments

howardyclo commented Dec 27, 2020

Metadata

TL;DR

Introduction

Question 1

Five normalization operations considered

Motivation of normalization

Normalization framework Π -> Φ -> Ψ

Several weakness of BN

Normalization area partitioning

TBD