We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我参考了一下hugging face的无监督训练代码,简单测试了一下。代码如下: from transformers import T5Tokenizer, T5ForConditionalGeneration import torch tokenizer = T5Tokenizer.from_pretrained("premodel/ChatYuan-large-v1") model = T5ForConditionalGeneration.from_pretrained("premodel/ChatYuan-large-v1") input_ids = tokenizer("一只<extra_id_0>走在<extra_id_1>大街上", return_tensors="pt").input_ids labels = tokenizer("<extra_id_0>可爱的<extra_id_1>宽敞的<extra_id_2>", return_tensors="pt").input_ids outputs = model(input_ids=input_ids, labels=labels) loss = outputs.loss logits = outputs.logits 结果报错: IndexError: index out of range in self 应该是embedding层索引越界,看了模型的词表,并没有<extra_id_0><extra_id_1>标记,但是tokenizer后没有报错 请问如何基于YUAN模型进行无监督预训练?无监督预训练的数据格式是什么,万分感谢
The text was updated successfully, but these errors were encountered:
请参考readme里预训练的代码
Sorry, something went wrong.
十分感谢,chatyuan无监督训练的数据集简单示例可以看一下么,用什么进行标记mask的
具体可以参考t5的构建规则哈https://github.com/google-research/text-to-text-transfer-transformer
No branches or pull requests
我参考了一下hugging face的无监督训练代码,简单测试了一下。代码如下:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
tokenizer = T5Tokenizer.from_pretrained("premodel/ChatYuan-large-v1")
model = T5ForConditionalGeneration.from_pretrained("premodel/ChatYuan-large-v1")
input_ids = tokenizer("一只<extra_id_0>走在<extra_id_1>大街上", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0>可爱的<extra_id_1>宽敞的<extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
结果报错:
IndexError: index out of range in self
应该是embedding层索引越界,看了模型的词表,并没有<extra_id_0><extra_id_1>标记,但是tokenizer后没有报错
请问如何基于YUAN模型进行无监督预训练?无监督预训练的数据格式是什么,万分感谢
The text was updated successfully, but these errors were encountered: