Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing DeepSpeedConfig for deepspeed v0.9.1 #92

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kgasenzer
Copy link

@kgasenzer kgasenzer commented Apr 25, 2023

Following the comment in https://github.com/microsoft/DeepSpeed/issues/3309 and discussions in #87 and #81 I added a fix for the missing config_class in trainer.Engine by adding snippets from deepspeed/init.py.

  • Distributed training initialized as default in the same way as deepspeed.initialize does.
  • provide config_class as deepspeed.runtime.config.DeepSpeedConfig instance

This fixes the problem in #87 as well as in #81.

This is only necessary if you want to use deepspeed>=0.9.1.

@kgasenzer kgasenzer changed the title Fix missing DeepSpeedConfig in deepspeed v0.9.1 Fix missing DeepSpeedConfig for deepspeed v0.9.1 Apr 25, 2023
@Onkarsus13
Copy link

I am still geting this issue
AttributeError: 'NoneType' object has no attribute 'optimizer_name'
Can you please let me know

@kgasenzer
Copy link
Author

I am still geting this issue AttributeError: 'NoneType' object has no attribute 'optimizer_name' Can you please let me know

Could you give more information about what you tried to do and at what point you encounter this issue? Because for me this is working fine with the config/test/ar.yml.

@Onkarsus13
Copy link

I just gitclone the repo and try to run it
and i got this issue

@Onkarsus13
Copy link

data_dirs: [data/test]

model: ar-quarter
batch_size: 1
eval_batch_size: 1
save_ckpt_every: 500
eval_every: 500
max_iter: 1000
This is there in my test/ar.yml

@kgasenzer
Copy link
Author

I just gitclone the repo and try to run it and i got this issue

  • did you check out my pull request with gh pr checkout 92 ? In vall_e/train.py there should be a change in load_engines().
  • Did you encounter this when running python -m vall_e.train yaml=config/test/ar.yml ?

@Onkarsus13
Copy link

yes when i ran "python -m vall_e.train yaml=config/test/ar.yml" i encounterd the issue

@Onkarsus13
Copy link

I saw your PR 92 but i am not getting where to put the code snippet

@kgasenzer
Copy link
Author

I saw your PR 92 but i am not getting where to put the code snippet

  1. Open your terminal and navigate to the folder of the repository.
  2. If you cloned it correctly from git you should be able to get my version by typing the command gh pr checkout 92.
  3. If it is still not working, try it with my fork of this repository: kgasenzer/vall-e

@Onkarsus13
Copy link

this is the new error i am getting as " AttributeError: 'PosixPath' object has no attribute 'log_dir'"

@JonathanColetti
Copy link

I think a easier solution would be doing deepspeed==0.8.3

@kgasenzer kgasenzer closed this Jun 7, 2023
@aleb
Copy link

aleb commented Jun 21, 2023

Please reopen this PR. It's only a matter of time until we need to update the code to support a version of DeepSpeed > 0.8.3.

@kgasenzer kgasenzer reopened this Jun 21, 2023
@aleb
Copy link

aleb commented Jun 21, 2023

With DeepSpeed 0.9.4 I get this error:

(venv) $ pip uninstall deepspeed && pip install deepspeed==0.9.4

(venv) $ python -m vall_e.train yaml=ar.yml
  File "/vall-e/vall_e/train.py", line 32, in load_engines
    dist.init_distributed(dist_backend=dist_backend)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
    init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
    rank = int(os.environ["RANK"])
               ~~~~~~~~~~^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'RANK'

It gets past these errors when specifying the following env vars:

(venv) $ RANK=0 WORLD_SIZE=1 python -m vall_e.train yaml=ar.yml

DeepSpeed needs to be hacked to be able to get past microsoft/DeepSpeed#826:

  File "/vall-e/vall_e/train.py", line 146, in <module>
    main()
  File "/vall-e/vall_e/train.py", line 137, in main
    trainer.train(
  File "/vall-e/vall_e/utils/trainer.py", line 125, in train
    engines = engines_loader()
              ^^^^^^^^^^^^^^^^
  File "/vall-e/vall_e/train.py", line 35, in load_engines
    dist.init_distributed(dist_backend=dist_backend)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 615, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 643, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants