Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not run python3 -m vall_e.train yaml=config/test/nar.yml #81

Open
samual30000 opened this issue Mar 30, 2023 · 12 comments
Open

can not run python3 -m vall_e.train yaml=config/test/nar.yml #81

samual30000 opened this issue Mar 30, 2023 · 12 comments

Comments

@samual30000
Copy link

python3 -m vall_e.train yaml=config/test/nar.yml --debug

跑这个的时候报错了.chatgpt4 说是有可能是原始文件的问题但是又没法给出具体的建议.只能问作者了.

trainer.train(

File "/sam/vall-e/vall_e/utils/trainer.py", line 150, in train
for batch in _make_infinite_epochs(train_dl):
File "/sam/vall-e/vall_e/utils/trainer.py", line 103, in _make_infinite_epochs
yield from dl
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
Loaded tensor from /sam/vall-e/data/train/one.qnt.pt with shape: torch.Size([])
File "/usr/local/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
Added tensor with shape: torch.Size([])
Converted path: /sam/vall-e/data/train/one.qnt.pt -> /sam/vall-e/data/train/one.qnt.pt
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/sam/vall-e/vall_e/data.py", line 185, in getitem
proms = self.sample_prompts(spkr_name, ignore=path)
File "/sam/vall-e/vall_e/data.py", line 172, in sample_prompts
raise RuntimeError("All tensors in prom_list are zero-dimensional.")
RuntimeError: All tensors in prom_list are zero-dimensional.

Loaded tensor from /sam/vall-e/data/train/one.qnt.pt with shape: torch.Size([])
Added tensor with shape: torch.Size([])
Converted path: /sam/vall-e/data/train/one.qnt.pt -> /sam/vall-e/data/train/one.qnt.pt
Loaded tensor from /sam/vall-e/data/train/one.qnt.pt with shape: torch.Size([])
Added tensor with shape: torch.Size([])

chatgpt4 帮我写的程序:
root@CH-202203180108:/sam/vall-e/data# cat 1.py

import torch

train_qnt = torch.load('/sam/vall-e/data/train/one.qnt.pt')
print("Train qnt shape:", train_qnt.shape)

val_qnt = torch.load('/sam/vall-e/data/val/test.qnt.pt')
print("Val qnt shape:", val_qnt.shape)

root@CH-202203180108:/sam/vall-e/data# python3 1.py
Train qnt shape: torch.Size([3])
Val qnt shape: torch.Size([1, 8, 149])

data的目录结构:
root@CH-202203180108:/sam/vall-e/data# ll
total 24
drwxr-xr-x 5 root root 4096 Mar 30 21:07 ./
drwxr-xr-x 8 root root 4096 Mar 30 23:45 ../
-rw-r--r-- 1 root root 216 Mar 30 21:07 1.py
drwxr-xr-x 2 root root 4096 Mar 28 14:27 test/
drwxr-xr-x 2 root root 4096 Mar 30 23:34 train/
drwxr-xr-x 2 root root 4096 Mar 28 14:55 val/

train目录文件:

root@CH-202203180108:/sam/vall-e/data# ll train/
total 408
drwxr-xr-x 2 root root 4096 Mar 30 23:34 ./
drwxr-xr-x 5 root root 4096 Mar 30 21:07 ../
-rw-r--r-- 1 root root 159 Mar 28 14:53 1.py
-rw-r--r-- 1 root root 37 Mar 28 14:49 one.phn.txt
-rw-r--r-- 1 root root 747 Mar 28 14:54 one.qnt.pt
-rw-r--r-- 1 root root 26 Mar 28 14:38 test.phn.txt
-rw-r--r-- 1 root root 10286 Mar 28 14:38 test.qnt.pt
-rw-r--r-- 1 root root 380750 Mar 30 23:34 test.wav
root@CH-202203180108:/sam/vall-e/data#

报错了不知道怎么搞

@Xiangbj17
Copy link

gpt给你的建议是对的
经过Encodec编码的.pt文件维度都是[1, 8, time_step]
/sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

@samual30000
Copy link
Author

gpt给你的建议是对的 经过Encodec编码的.pt文件维度都是[1, 8, time_step] /sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

你能跑起来了吗,经过了跟gpt4的折腾和调试之后还是没办法,项目是不是有一些训练的数据没有提供 还是确了什么东西啊,就是到了 python3 -m vall_e.train yaml=config/test/nar.yml --debug 这一步就怎么样都跑不起来了

@samual30000
Copy link
Author

gpt给你的建议是对的 经过Encodec编码的.pt文件维度都是[1, 8, time_step] /sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

是缺了什么东西了吗

@samual30000
Copy link
Author

'NoneType' object has no attribute 'optimizer_name', self._config is a nonetype

@ilanshib
Copy link

encountered same problem. vall_e.train stopped working. At first look it seems that a change was applied to microsoft's DeepSpeed code. when Micorosoft's module is initialized it looks for a config object that contains the attribute optimizer_name.

vall_e uses DeepSpeed and initializes it as part of the class 'Engine' in utils/engines.py but it does not pass the required config parameter. I am not familiar with this code but I could see that other classes in utils/engines.py (e.g. the 'Engines' class) do use a config object that probably has the necessary information.

Can anyone help?

@Xiangbj17
Copy link

gpt给你的建议是对的 经过Encodec编码的.pt文件维度都是[1, 8, time_step] /sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

你能跑起来了吗,经过了跟gpt4的折腾和调试之后还是没办法,项目是不是有一些训练的数据没有提供 还是确了什么东西啊,就是到了 python3 -m vall_e.train yaml=config/test/nar.yml --debug 这一步就怎么样都跑不起来了

我能正常跑诶,我感觉是one.qnt.pt的维度有问题,你可以尝试一下把one相关的pt和txt都删掉,只用自带的test.pt和txt跑跑看,看会不会报错。如果可以正常跑的话就能证明是Encodec对one.wav编码的时候出点问题,你重新编码试试,看看能不能得到[1, 8, x]维度的pt.

@ilanshib
Copy link

'NoneType' object has no attribute 'optimizer_name', self._config is a nonetype

See the discussion here: #87

@samual30000
Copy link
Author

'NoneType' object has no attribute 'optimizer_name', self._config is a nonetype

See the discussion here: #87

thanks

@samual30000
Copy link
Author

gpt给你的建议是对的 经过Encodec编码的.pt文件维度都是[1, 8, time_step] /sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

你能跑起来了吗,经过了跟gpt4的折腾和调试之后还是没办法,项目是不是有一些训练的数据没有提供 还是确了什么东西啊,就是到了 python3 -m vall_e.train yaml=config/test/nar.yml --debug 这一步就怎么样都跑不起来了

我能正常跑诶,我感觉是one.qnt.pt的维度有问题,你可以尝试一下把one相关的pt和txt都删掉,只用自带的test.pt和txt跑跑看,看会不会报错。如果可以正常跑的话就能证明是Encodec对one.wav编码的时候出点问题,你重新编码试试,看看能不能得到[1, 8, x]维度的pt.

gpt给你的建议是对的 经过Encodec编码的.pt文件维度都是[1, 8, time_step] /sam/vall-e/data/train/one.qnt.pt 只有一个维度,不太对,检查一下你用qnt编码的过程是不是出了什么问题

你能跑起来了吗,经过了跟gpt4的折腾和调试之后还是没办法,项目是不是有一些训练的数据没有提供 还是确了什么东西啊,就是到了 python3 -m vall_e.train yaml=config/test/nar.yml --debug 这一步就怎么样都跑不起来了

我能正常跑诶,我感觉是one.qnt.pt的维度有问题,你可以尝试一下把one相关的pt和txt都删掉,只用自带的test.pt和txt跑跑看,看会不会报错。如果可以正常跑的话就能证明是Encodec对one.wav编码的时候出点问题,你重新编码试试,看看能不能得到[1, 8, x]维度的pt.

thx

@kgasenzer
Copy link

encountered same problem. vall_e.train stopped working. At first look it seems that a change was applied to microsoft's DeepSpeed code. when Micorosoft's module is initialized it looks for a config object that contains the attribute optimizer_name.

vall_e uses DeepSpeed and initializes it as part of the class 'Engine' in utils/engines.py but it does not pass the required config parameter. I am not familiar with this code but I could see that other classes in utils/engines.py (e.g. the 'Engines' class) do use a config object that probably has the necessary information.

Can anyone help?

I opened a pull request that deals with this issue. Make sure to have mpi4py installed correctly, as I utilize the default initialization of distributed training which might search for mpis.

@samual30000
Copy link
Author

!pip install deepspeed==0.8.3 make it alright

@tangzhimiao
Copy link

牛皮 thx,解决了train的问题

!pip install deepspeed==0.8.3 make it alright

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants