Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weight decay #6

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Niccolo-Ajroldi
Copy link
Contributor

@Niccolo-Ajroldi Niccolo-Ajroldi commented Dec 5, 2024

tl;dr: applies weight decay only to some layers, excluding Mamba's A_log and D, as well as biases and normalization layers.

Fixes #5

For a more thorough discussion, see #5

@Zymrael
Copy link
Collaborator

Zymrael commented Dec 25, 2024

Thanks! Similar changes to weight decay should actually also be applied to other operator primitives, depending on your objective. Can you report if you observe differences in scores (even a simple representative task is ok) with and without your PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters
2 participants