You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I took a closer look at the source code in language_modeling_model.py and noticed that the attention masks are only created in this situation with use_hf_datasets:
....
inputs = inputs.to(self.device)
attention_mask = (
batch["attention_mask"].to(self.device)
if self.args.use_hf_datasets
else None
)
token_type_ids = (
batch["token_type_ids"].to(self.device)
if self.args.use_hf_datasets and "token_type_ids" in batch
else None
)
...
I assume that without the use of sliding_window no padding is added, so this warning does not occur.
Did I understand that correctly? I also tested the model evaluation on a test set with examples no longer than max_seq_length, with and without this parameter and found drastic differences in the results in terms of eval loss and perplexity.
So my question is: Is there a way to somehow include the attention_mask, which is generally important for training LM, or does it have no influence on the quality of model fine-tuning in this particular situation?
Thank you in advance!
Darija
The text was updated successfully, but these errors were encountered:
Hello!
I am trying to fine-tune an Electra model with my own dataset, as described HERE, and I am using these model arguments:
When setting
model_args.sliding_window = True
I always get this: We strongly recommend passing in anattention_mask
since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.I took a closer look at the source code in language_modeling_model.py and noticed that the attention masks are only created in this situation with
use_hf_datasets
:I assume that without the use of
sliding_window
no padding is added, so this warning does not occur.Did I understand that correctly? I also tested the model evaluation on a test set with examples no longer than
max_seq_length
, with and without this parameter and found drastic differences in the results in terms of eval loss and perplexity.So my question is: Is there a way to somehow include the
attention_mask
, which is generally important for training LM, or does it have no influence on the quality of model fine-tuning in this particular situation?Thank you in advance!
Darija
The text was updated successfully, but these errors were encountered: