This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Replies: 1 comment 3 replies
-
Hi,
Are you saying that there are other people on the same machine using the GPUs at the same time? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am currently trying to train an ELMo model using the AllenNLP package. Executing the training command will lead to CUDA Out of Memory. The entire dataset I am trying to train on is ~2 million clinical notes. A toy implementation of 10 rows can be executed without error previously.
I tried setting the
batch_size=1
andmax_instances_in_memory=2
to reduce memory usage but the same issue persists. When I increased tobatch_size=2
, the amount of memory that PyTorch tries to allocate remains the same. So I'm unsure of whether reducing the batch size could help.I have 4 GPUs on the server that I'm working on. Tried both distributed and non-distributed training but the same issue persists. Even when using distributed training, PyTorch tries to allocate the same amount of memory. (Intuitively I thought distributed on 2 devices, memory usage would be halved too)
I am currently using a customised DatasetReader to read my serialised files which are stored in pkl format in lists. The datasetreader is not set to be lazy, I tried implementing it but did not see any improvement in results.
Before the CUDA Out of Memory error appears, the program runs for a minute before terminating. I was able to see the logs where they stated
Worker 0 memory usage: 4.5G
withGPU 0 memory usage: 3.6G
. Does this mean that there is insufficient GPU space? However, I've read online about people training ELMo with Tesla V100/P100 too. How can I optimise my codes?I am wondering if this is an issue with my implementation or this is an issue with the existing GPU space? As seen in the
nvidia-smi
log, there are memory left on certain GPUs but using them, the memory will be "allocated by PyTorch" due to other users using the GPU. Any suggestions on how I can tackle this issue?Beta Was this translation helpful? Give feedback.
All reactions