Out of memory during eval but not train? #51

jamiehz · 2023-04-01T11:59:58Z

Description:
During the execution of the code in the evaluate phase, the computer's memory(no cuda memory) keeps increasing, and the program is eventually killed.
Server Base Configuration :
GPU : V100S * 2
RAM : 256GB
May I ask if you have modified the source files in the HuggingFace Transformers? What configurations are needed to implement the code?

gianfrancodemarco · 2023-04-01T14:46:12Z

The problem comes from the fact that all of the encoded predictions are kept in memory, so has more predictions are made, more RAM is needed.
What you can do is edit the evaluation script to something like this:

split the data in batches
predictions = []
for batch in batches:
  _predictions = predict for batch
  predictionsl.append(decode(_predictions)) # this require a fraction of the memory wrt to the encoded predictions

Moreover, as you noticed, the GPU is not used during eval, so you might want to change that too

zhenghao977 · 2023-04-03T07:38:44Z

@gianfrancodemarco I have the same problem, could you please provide the detail modified evaluation script, thanks!

gianfrancodemarco · 2023-04-04T14:02:13Z

@zhenghao977 I've provided a scheme for you to modify the script. Otherwise, I've the implementation in my fork of the project (even if I don't like to advertise it here...). However, the source code has been heavily modified

thomascong121 · 2023-05-05T08:29:46Z

@gianfrancodemarco Thanks for the insight, however I could not find the codes you mentioned in their repo, do you mind telling me where can i find the related codes?

zhenghao977 · 2023-05-05T08:30:07Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

WayneWong97 · 2023-06-26T09:12:41Z

@thomascong121 Hi! I recently encountered a problem similar to yours. Did you find a way to modify the evaluation script? Thanks.

zhenghao977 · 2023-06-26T09:13:27Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

gianfrancodemarco · 2023-07-11T13:58:35Z

@thomascong121 @WayneWong97 our version is here

zhenghao977 · 2023-07-11T13:59:05Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

WayneWong97 · 2023-07-11T14:44:20Z

@gianfrancodemarco Thanks! I fixed the bug with your scheme.

Sunhxxin · 2023-07-16T14:40:09Z

@WayneWong97 I have the same problem, could you please provide the detail modified evaluation script with the scheme, thanks!

WayneWong97 · 2023-07-17T06:59:34Z

@Sunhxxin You can find the scheme from @gianfrancodemarco's link.
My code refers to @gianfrancodemarco's previous scheme:

Add a data iterator:

class ScienceQADatasetIterator:
    def __init__(self, dataset, batch_size):
        self._dataset = dataset
        self.batch_size = batch_size
        self.num_batches = int(len(self._dataset) / batch_size)
        if len(self._dataset) % batch_size:
            self.num_batches += 1

    def __iter__(self):
        self._index = 0
        return self

    def __next__(self):
        if self._index < self.num_batches:
            items = []
            for i in range(self.batch_size):
                try:
                    index = (self._index * self.batch_size) + i
                    items.append(self._dataset.__getitem__(index))
                except IndexError:
                    break
            self._index += 1
            return items
        else:
            raise StopIteration

Each eval process in main.py adopts batch iterations similar to the following:

    batch_size = 1000
    test_set_iterator = ScienceQADatasetIterator(dataset=test_set, batch_size=1000)
    
    eval_metrics = {}
    for batch in test_set_iterator:
        batch_metrics = trainer.evaluate(eval_dataset=batch)
        for key, value in batch_metrics.items():
            if key not in eval_metrics:
                eval_metrics[key] = []
            eval_metrics[key].append(value)

    for key, value in eval_metrics.items():
        eval_metrics[key] = sum(value) / len(value)

Sunhxxin · 2023-07-18T00:19:41Z

@WayneWong97 Thanks！

DJC-GO-SOLO · 2023-09-14T08:47:36Z

For me, this is because I did not set the gradient cumulative steps when calling the program, and its default value was set to none. So you just need to change your eval_acc to 1 to solve this problem.

cooelf · 2023-10-15T09:06:00Z

Please try the latest version. It should have fixed the problem.

zhenghao977 · 2023-10-15T09:06:20Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory during eval but not train? #51

Out of memory during eval but not train? #51

jamiehz commented Apr 1, 2023

gianfrancodemarco commented Apr 1, 2023

zhenghao977 commented Apr 3, 2023

gianfrancodemarco commented Apr 4, 2023

thomascong121 commented May 5, 2023

zhenghao977 commented May 5, 2023 via email

WayneWong97 commented Jun 26, 2023

zhenghao977 commented Jun 26, 2023 via email

gianfrancodemarco commented Jul 11, 2023 •

edited

Loading

zhenghao977 commented Jul 11, 2023 via email

WayneWong97 commented Jul 11, 2023

Sunhxxin commented Jul 16, 2023 •

edited

Loading

WayneWong97 commented Jul 17, 2023 •

edited

Loading

Sunhxxin commented Jul 18, 2023

DJC-GO-SOLO commented Sep 14, 2023

cooelf commented Oct 15, 2023

zhenghao977 commented Oct 15, 2023 via email

Out of memory during eval but not train? #51

Out of memory during eval but not train? #51

Comments

jamiehz commented Apr 1, 2023

gianfrancodemarco commented Apr 1, 2023

zhenghao977 commented Apr 3, 2023

gianfrancodemarco commented Apr 4, 2023

thomascong121 commented May 5, 2023

zhenghao977 commented May 5, 2023 via email

WayneWong97 commented Jun 26, 2023

zhenghao977 commented Jun 26, 2023 via email

gianfrancodemarco commented Jul 11, 2023 • edited Loading

zhenghao977 commented Jul 11, 2023 via email

WayneWong97 commented Jul 11, 2023

Sunhxxin commented Jul 16, 2023 • edited Loading

WayneWong97 commented Jul 17, 2023 • edited Loading

Sunhxxin commented Jul 18, 2023

DJC-GO-SOLO commented Sep 14, 2023

cooelf commented Oct 15, 2023

zhenghao977 commented Oct 15, 2023 via email

gianfrancodemarco commented Jul 11, 2023 •

edited

Loading

Sunhxxin commented Jul 16, 2023 •

edited

Loading

WayneWong97 commented Jul 17, 2023 •

edited

Loading