Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory during eval but not train? #51

Open
jamiehz opened this issue Apr 1, 2023 · 16 comments
Open

Out of memory during eval but not train? #51

jamiehz opened this issue Apr 1, 2023 · 16 comments

Comments

@jamiehz
Copy link

jamiehz commented Apr 1, 2023

Description:
During the execution of the code in the evaluate phase, the computer's memory(no cuda memory) keeps increasing, and the program is eventually killed.
Server Base Configuration :
GPU : V100S * 2
RAM : 256GB
May I ask if you have modified the source files in the HuggingFace Transformers? What configurations are needed to implement the code?

@gianfrancodemarco
Copy link

The problem comes from the fact that all of the encoded predictions are kept in memory, so has more predictions are made, more RAM is needed.
What you can do is edit the evaluation script to something like this:

split the data in batches
predictions = []
for batch in batches:
  _predictions = predict for batch
  predictionsl.append(decode(_predictions)) # this require a fraction of the memory wrt to the encoded predictions

Moreover, as you noticed, the GPU is not used during eval, so you might want to change that too

@zhenghao977
Copy link

@gianfrancodemarco I have the same problem, could you please provide the detail modified evaluation script, thanks!

@gianfrancodemarco
Copy link

@zhenghao977 I've provided a scheme for you to modify the script. Otherwise, I've the implementation in my fork of the project (even if I don't like to advertise it here...). However, the source code has been heavily modified

@thomascong121
Copy link

@gianfrancodemarco Thanks for the insight, however I could not find the codes you mentioned in their repo, do you mind telling me where can i find the related codes?

@zhenghao977
Copy link

zhenghao977 commented May 5, 2023 via email

@WayneWong97
Copy link

@thomascong121 Hi! I recently encountered a problem similar to yours. Did you find a way to modify the evaluation script? Thanks.

@zhenghao977
Copy link

zhenghao977 commented Jun 26, 2023 via email

@gianfrancodemarco
Copy link

gianfrancodemarco commented Jul 11, 2023

@thomascong121 @WayneWong97 our version is here

@zhenghao977
Copy link

zhenghao977 commented Jul 11, 2023 via email

@WayneWong97
Copy link

@gianfrancodemarco Thanks! I fixed the bug with your scheme.

@Sunhxxin
Copy link

Sunhxxin commented Jul 16, 2023

@WayneWong97 I have the same problem, could you please provide the detail modified evaluation script with the scheme, thanks!

@WayneWong97
Copy link

WayneWong97 commented Jul 17, 2023

@Sunhxxin You can find the scheme from @gianfrancodemarco's link.
My code refers to @gianfrancodemarco's previous scheme:

  1. Add a data iterator:
class ScienceQADatasetIterator:
    def __init__(self, dataset, batch_size):
        self._dataset = dataset
        self.batch_size = batch_size
        self.num_batches = int(len(self._dataset) / batch_size)
        if len(self._dataset) % batch_size:
            self.num_batches += 1

    def __iter__(self):
        self._index = 0
        return self

    def __next__(self):
        if self._index < self.num_batches:
            items = []
            for i in range(self.batch_size):
                try:
                    index = (self._index * self.batch_size) + i
                    items.append(self._dataset.__getitem__(index))
                except IndexError:
                    break
            self._index += 1
            return items
        else:
            raise StopIteration
  1. Each eval process in main.py adopts batch iterations similar to the following:
    batch_size = 1000
    test_set_iterator = ScienceQADatasetIterator(dataset=test_set, batch_size=1000)
    
    eval_metrics = {}
    for batch in test_set_iterator:
        batch_metrics = trainer.evaluate(eval_dataset=batch)
        for key, value in batch_metrics.items():
            if key not in eval_metrics:
                eval_metrics[key] = []
            eval_metrics[key].append(value)

    for key, value in eval_metrics.items():
        eval_metrics[key] = sum(value) / len(value)

@Sunhxxin
Copy link

@WayneWong97 Thanks!

@DJC-GO-SOLO
Copy link

For me, this is because I did not set the gradient cumulative steps when calling the program, and its default value was set to none. So you just need to change your eval_acc to 1 to solve this problem.

@cooelf
Copy link
Contributor

cooelf commented Oct 15, 2023

Please try the latest version. It should have fixed the problem.

@zhenghao977
Copy link

zhenghao977 commented Oct 15, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants