[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters #5

Niccolo-Ajroldi · 2024-12-05T12:53:00Z

Description

The current implementation adds weight decay to all model parameters.

However:

Mamba should not have weight decay on A_log and D:

mad-lab/mad/model/layers/mamba.py

Line 116 in 69d09e2

self.A_log._no_weight_decay = True
It's common practice to not have weight decay on biases and normalization layers.

Fix

I think 1. is a more crucial issue, but we should also include 2. to reflect standard practices in Language Modelling.

#6 implements a fix for both.

It creates two different param_groups for parameters with and without weight decay (see 5ab076a):

decay_params, no_decay_params = [], []
for n, p in self.model.named_parameters():
    if p.requires_grad:
        if not getattr(p, '_no_weight_decay', False) and ("bias" not in n) and ("norm" not in n):
            decay_params.append(p)
        else:
            no_decay_params.append(p)
param_groups = [
    {"params": decay_params, "weight_decay": self.mad_config.weight_decay},
    {"params": no_decay_params, "weight_decay": 0.0},
]

# optimizer:
if self.mad_config.optimizer == 'adamw':
    optimizer = torch.optim.AdamW(
        param_groups,
        lr=self.mad_config.lr
    )

To distinguish normalization layers from other modules, I had to give them a name in the model initialization.
This is achieved by replacing:

self.unembed = nn.Sequential(OrderedDict([
    ('norm', norm(layer_cfg['dim'])), 
    ('lm_head', nn.Linear(dim, vocab_size))
]))

with:

self.unembed = nn.Sequential(norm(layer_cfg['dim']), nn.Linear(dim, vocab_size))
    self.model.append(nn.Sequential(OrderedDict([
        ('norm', norm(layer_cfg['dim'])),
        ('layer', layer(**layer_cfg))
    ])))

The text was updated successfully, but these errors were encountered:

Niccolo-Ajroldi linked a pull request Dec 5, 2024 that will close this issue

Fix weight decay #6

Open

Niccolo-Ajroldi changed the title ~~Weight decay on normalization layers and Mamba custom parameters~~ Weight decay incorrectly applied to LayerNorm and Mamba A, D parameters Dec 9, 2024

Niccolo-Ajroldi changed the title ~~Weight decay incorrectly applied to LayerNorm and Mamba A, D parameters~~ [bug🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters Dec 9, 2024

Niccolo-Ajroldi changed the title ~~[bug🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters~~ [bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters #5

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters #5

Niccolo-Ajroldi commented Dec 5, 2024 •

edited

Loading

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters #5

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters #5

Comments

Niccolo-Ajroldi commented Dec 5, 2024 • edited Loading

Description

Fix

Niccolo-Ajroldi commented Dec 5, 2024 •

edited

Loading