Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Step 1: Warmup training,multiple gpu's trainning #37

Open
xavierdawn opened this issue Oct 29, 2024 · 4 comments
Open

In Step 1: Warmup training,multiple gpu's trainning #37

xavierdawn opened this issue Oct 29, 2024 · 4 comments

Comments

@xavierdawn
Copy link

xavierdawn commented Oct 29, 2024

I want to train with multiple gpu's, besides setting the export header="torchrun --nproc_per_node 4 --nnodes 1 and export CUDA_VISIBLE_DEVICES=4,5,6,7,is there anything else I need to set up? Because right now it's showing that my four gpu's with 24G of RAM still don't have enough memory. The training model is using Llama2-7B-HF

trainable params: 134,217,728 || all params: 6,872,641,536 || trainable%: 1.9529278123549145
[train set] examples: 13533; # avg tokens: 370.9773254394531
[train set] examples: 13533; # avg completion tokens: 105.39820861816406
Traceback (most recent call last):
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in
main()
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
trainer = Trainer(
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init
self._move_model_to_device(model, args.device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
model = model.to(device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 3 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in
main()
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
trainer = Trainer(
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init
self._move_model_to_device(model, args.device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
model = model.to(device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in
main()
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
trainer = Trainer(
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init
self._move_model_to_device(model, args.device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
model = model.to(device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 2 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in
main()
File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
trainer = Trainer(
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in init
self._move_model_to_device(model, args.device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
model = model.to(device)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 1 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-11-03 07:03:40,851] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 991055) of binary: /mnt/users/ylu/anaconda3/envs/xwb_less/bin/python
Traceback (most recent call last):
File "/mnt/users/ylu/anaconda3/envs/xwb_less/bin/torchrun", line 8, in
sys.exit(main())
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@xavierdawn xavierdawn changed the title In step 1,when run "warmup_lora_train.sh",It shows that a deep recursive call has occurred In step 1 Oct 31, 2024
@xavierdawn xavierdawn changed the title In step 1 In Step 1: Warmup training,multiple gpu's trainning Nov 3, 2024
@xavierdawn xavierdawn reopened this Nov 3, 2024
@roanvanblanken
Copy link

Have you already found a fix for this...?

@QingyangZhang
Copy link

Same question here.

@Cooper-Zhong
Copy link

same here:)

@Cooper-Zhong
Copy link

@QingyangZhang maybe you can try the following:

  • reduce lora_r and lora_r in base_training_args.sh
  • use a subset of training data, change train_files in warmup_lora_train.sh with a small .jsonl file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants