You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This paper reviews the past, present and future of normalization methods for DNNs training, and aims to answer the following questions:
What are the main motivations behind different normalization methods in DNNs, and how can we present a taxonomy
for understanding the similarities and differences between a wide
variety of approaches?
How can we reduce the gap between the empirical success of normalization techniques and our theoretical understanding of them?
What recent advances have been made in designing/tailoring normalization techniques for different tasks, and what are the main insights behind them?
Introduction
Normalization techniques typically serves as a "layer" between learnable weights and activations in DNN architectures.
More importantly, they've advanced deep learning research and become an essential module in DNN architectures for various applications. For example, Layer Normalization (LN) for transformers used in NLP; Spectral Normalization (SN) for discriminator in GANs used in generative modeling.
Question 1
Five normalization operations considered
Centering: Makes input zero mean.
Scaling: Makes input unit variance.
Decorrelating : Make input zero correlation between dimensions of input (i.e., zeroing covariance matrix's off-diagonals).
Standardization: Composition of centering and scaling.
Whitening: Make input a spherical Gaussian distribution. Composition of standardization and decorrelating, also called PCA Whitening.
Motivation of normalization
Convergence is proved to be related to the statistics fo input of a linear model, e.g., if the Hessian of input to a linear model is identity matrix, then this linear model can converge within one iteration by full gradient descent (GD). Several normalizations are discussed:
Normalizing the activations (non-learnable or learnable)
Normalizing the weights with a constrained distribution such that activations' gradients are implicitly normalized. These are inspired by weight normalizations but are extended towards satisfying the desired properties during training.
Normalizing the gradients to exploit the curvature information for GD/SGD.
Normalization framework Π -> Φ -> Ψ
Take batch normalization (BN; ICML'15) for example, for a given input channel-first batch X with shape (c, b, h, w)
Normalization operation (Φ): Standardization along the last dimension of (c, b* h*w)
Normalization representation recovery (Ψ): Affine transformation with learnable parameters for X.
Several weakness of BN
Inconsistent between training and inference limits its usage in complex networks, such as RNN or GANs.
Suffers from small batch size setting (e.g., object detection and segmentation)
To address weakness of BN, several normalizations have been proposed and we discussed them under the framework.
Normalization area partitioning
Layer normalization (LN; arXiv'16): (c, b, h, w) -> (b, c*h*w). Widely used in NLP.
Group normalization (GN; ECCV'18): (c, b, h, w) -> (b*g, s*h*w), where g is group number of channel dimension, and s = c/g is splits number. When g=1, GN becomes LN. Widely used in object detection and segmentation.
Instance normalization (IN; arXiv'16): (c, b, h, w) -> (b*c, h*w) . Widely used in image style transfer.
Position normalization (PN; NeurIPS'19): (c, b, h, w -> (b*h*w, c). Designed to deal with spatial information and has the potential to enhance generative models.
Batch-group normalization (BGN; ICLR'20): (c, b, h, w) -> (g_b*g_c, s_b*s_c*h*w), where s_b = b/g_b and s_c = c/g_c. Extend GN by also grouping the batch dimension.
TBD
The text was updated successfully, but these errors were encountered:
Metadata
TL;DR
This paper reviews the past, present and future of normalization methods for DNNs training, and aims to answer the following questions:
for understanding the similarities and differences between a wide
variety of approaches?
Introduction
Normalization techniques typically serves as a "layer" between learnable weights and activations in DNN architectures.
More importantly, they've advanced deep learning research and become an essential module in DNN architectures for various applications. For example, Layer Normalization (LN) for transformers used in NLP; Spectral Normalization (SN) for discriminator in GANs used in generative modeling.
Question 1
Five normalization operations considered
Motivation of normalization
Convergence is proved to be related to the statistics fo input of a linear model, e.g., if the Hessian of input to a linear model is identity matrix, then this linear model can converge within one iteration by full gradient descent (GD). Several normalizations are discussed:
Normalization framework Π -> Φ -> Ψ
Take batch normalization (BN; ICML'15) for example, for a given input channel-first batch X with shape
(c, b, h, w)
(c, b, h, w)
->(c, b*h*w)
(c, b* h*w)
Several weakness of BN
To address weakness of BN, several normalizations have been proposed and we discussed them under the framework.
Normalization area partitioning
(c, b, h, w)
->(b, c*h*w)
. Widely used in NLP.(c, b, h, w)
->(b*g, s*h*w)
, whereg
is group number of channel dimension, ands = c/g
is splits number. Wheng=1
, GN becomes LN. Widely used in object detection and segmentation.(c, b, h, w)
->(b*c, h*w)
. Widely used in image style transfer.(c, b, h, w
->(b*h*w, c)
. Designed to deal with spatial information and has the potential to enhance generative models.(c, b, h, w)
->(g_b*g_c, s_b*s_c*h*w)
, wheres_b = b/g_b
ands_c = c/g_c
. Extend GN by also grouping the batch dimension.TBD
The text was updated successfully, but these errors were encountered: