This is quick evaluation of BatchNorm layer (BVLC/caffe#3229) performance on ImageNet-2012.
Other on-going evaluations:
- activations
- [architectures] (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Architectures.md)
- [augmentation] (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/Augmentation.md)
The architecture is similar to CaffeNet, but has differences:
- Images are resized to small side = 128 for speed reasons.
- fc6 and fc7 layers have 2048 neurons instead of 4096.
- Networks are initialized with LSUV-init
Because LRN layers add nothing to accuracy, they were removed for speed reasons in further experiments.
BN-paper, caffe-PR Note, that results are obtained without mentioned in paper y=kx+b additional layer.
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Before | 0.474 | 2.35 | As in paper |
Before + scale&bias layer | 0.478 | 2.33 | As in paper |
After | 0.499 | 2.21 | |
After + scale&bias layer | 0.493 | 2.24 |
So in all next experiments, BN is put after non-linearity
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ReLU | 0.499 | 2.21 | |
RReLU | 0.500 | 2.20 | |
PReLU | 0.503 | 2.19 | |
ELU | 0.498 | 2.23 | |
Maxout | 0.487 | 2.28 | |
Sigmoid | 0.475 | 2.35 | |
TanH | 0.448 | 2.50 | |
No | 0.384 | 2.96 |
ReLU non-linearity, fc6 and fc7 layer only
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Dropout = 0.5 | 0.499 | 2.21 | |
Dropout = 0.2 | 0.527 | 2.09 | |
Dropout = 0 | 0.513 | 2.19 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
Caffenet | 0.471 | 2.36 | |
Caffenet BN Before + scale&bias layer LSUV | 0.478 | 2.33 | |
Caffenet BN Before + scale&bias layer Ortho | 0.482 | 2.31 | |
Caffenet BN After LSUV | 0.499 | 2.21 | |
Caffenet BN After Ortho | 0.500 | 2.20 |
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
GoogLeNet128 | 0.619** | 1.61 | |
GoogLeNet BN Before + scale&bias layer LSUV | 0.603 | 1.68 | |
GoogLeNet BN Before + scale&bias layer Ortho | 0.607 | 1.67 | |
GoogLeNet BN After LSUV | 0.596 | 1.70 | |
GoogLeNet BN After Ortho | 0.584 | 1.77 | |
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn] | 0.645 | 1.54 | BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init. 5x5 replaced with 3x3 + 3x3. 3x3 replaced with 3x1+1x3 |
As one can see, BN makes difference between ReLU, ELU and PReLU negligable. It may confirm that main source of VLReLU and ELU advantages is that their output is closer to mean=0, var=1, than standard ReLU.
BN+Dropout = 0.5 is too much regularization. Dropout=0.2 is just enough :)
P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Anything vs. seconds" will give weird results.