Inconsistent init methods of pythia-6.9b model #135

mqyqlx · 2023-11-17T11:26:50Z

Hi, I found that the init method of parameters in pythia-6.9B model is inconsistent with the standard deviation of the step0 checkpoint. Table 6 in the paper shows that init-method is small-init and output-layer-init-method is wang-init. But I got different std values from step0 models.

Inconsistent std values:

input_layer_std: 0.009882117688026186(small_init), 0.02(std calculated from step0 model paramters)
output_layer_std: 0.0009765625(wang_init), 0.0025(std calculated from step0 model paramters)

Could you provide the real init method? Thanks!

Config Table 6:

Here are the reproducible script and results.

import math
from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "EleutherAI/pythia-6.9b",
  revision="step0",
)

model_dim = 4096 # Pythia-6.9b

# compute right std values of the two init methods 
# reference https://github.com/EleutherAI/gpt-neox/blob/v1.0/megatron/model/init_functions.py#L101-L118 

small_init_std = (2/(5* model_dim)) ** 0.5
wang_init_std = 2 / (32 * math.sqrt(model_dim))
print('small_init_std:', small_init_std)
print('wang_init_std:', wang_init_std)

for n, p in model.named_parameters():
    print(n, p.shape, p.std().item())

Results:

small_init_std: 0.009882117688026186
wang_init_std: 0.0009765625

gpt_neox.embed_in.weight torch.Size([50432, 4096]) 0.019999271258711815
gpt_neox.layers.0.input_layernorm.weight torch.Size([4096]) 0.0
gpt_neox.layers.0.input_layernorm.bias torch.Size([4096]) 0.0
gpt_neox.layers.0.post_attention_layernorm.weight torch.Size([4096]) 0.0
gpt_neox.layers.0.post_attention_layernorm.bias torch.Size([4096]) 0.0
gpt_neox.layers.0.attention.query_key_value.weight torch.Size([12288, 4096]) 0.019999688491225243
gpt_neox.layers.0.attention.query_key_value.bias torch.Size([12288]) 0.0
gpt_neox.layers.0.attention.dense.weight torch.Size([4096, 4096]) 0.002499272581189871
gpt_neox.layers.0.attention.dense.bias torch.Size([4096]) 0.0
gpt_neox.layers.0.mlp.dense_h_to_4h.weight torch.Size([16384, 4096]) 0.019998779520392418
gpt_neox.layers.0.mlp.dense_h_to_4h.bias torch.Size([16384]) 0.0
gpt_neox.layers.0.mlp.dense_4h_to_h.weight torch.Size([4096, 16384]) 0.0024998513981699944
gpt_neox.layers.0.mlp.dense_4h_to_h.bias torch.Size([4096]) 0.0
gpt_neox.layers.1.input_layernorm.weight torch.Size([4096]) 0.0
gpt_neox.layers.1.input_layernorm.bias torch.Size([4096]) 0.0
gpt_neox.layers.1.post_attention_layernorm.weight torch.Size([4096]) 0.0
gpt_neox.layers.1.post_attention_layernorm.bias torch.Size([4096]) 0.0
gpt_neox.layers.1.attention.query_key_value.weight torch.Size([12288, 4096]) 0.01999974064528942
gpt_neox.layers.1.attention.query_key_value.bias torch.Size([12288]) 0.0
gpt_neox.layers.1.attention.dense.weight torch.Size([4096, 4096]) 0.0025000576861202717
gpt_neox.layers.1.attention.dense.bias torch.Size([4096]) 0.0
gpt_neox.layers.1.mlp.dense_h_to_4h.weight torch.Size([16384, 4096]) 0.02000279724597931
gpt_neox.layers.1.mlp.dense_h_to_4h.bias torch.Size([16384]) 0.0
gpt_neox.layers.1.mlp.dense_4h_to_h.weight torch.Size([4096, 16384]) 0.002499587135389447
gpt_neox.layers.1.mlp.dense_4h_to_h.bias torch.Size([4096]) 0.0
...

The text was updated successfully, but these errors were encountered:

StellaAthena · 2024-01-20T22:35:41Z

This is very weird. Have you been able to form any tentative hypotheses about it?

mqyqlx · 2024-03-19T03:32:20Z

This is very weird. Have you been able to form any tentative hypotheses about it?

Not yet. I guess these two standard deviations used in Pythia-6.9B are set empirically and seem not to be calculated by a formula.

Tr1ple-F · 2024-12-07T18:41:30Z

Following up on this. As you can see here in the config files

12b deduped
6.9b deduped

no initialization method was set for these models. I don't think the standard deviations are empirical, probably more an artifact of the lack of proper init methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent init methods of pythia-6.9b model #135

Inconsistent init methods of pythia-6.9b model #135

mqyqlx commented Nov 17, 2023 •

edited by StellaAthena

Loading

StellaAthena commented Jan 20, 2024

mqyqlx commented Mar 19, 2024 •

edited

Loading

Tr1ple-F commented Dec 7, 2024

Inconsistent init methods of pythia-6.9b model #135

Inconsistent init methods of pythia-6.9b model #135

Comments

mqyqlx commented Nov 17, 2023 • edited by StellaAthena Loading

StellaAthena commented Jan 20, 2024

mqyqlx commented Mar 19, 2024 • edited Loading

Tr1ple-F commented Dec 7, 2024

mqyqlx commented Nov 17, 2023 •

edited by StellaAthena

Loading

mqyqlx commented Mar 19, 2024 •

edited

Loading