The usage of RAM is always increasing during one epoch. #19

quqixun · 2023-08-10T01:33:04Z

After preprocessing of HDTF dataset, I got 415 videos.
249 videos (60%) were randomly selected as training set, the others (40%) were test set.
The first 1500 frames of each video were extracted for training with stride 2.
So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap.
In training, the batch size is set to 16.
At the first epoch, the usage of RAM is always increasing.
At step 2743, all RAM was occupied (even the Swap space) and the training stopped.
Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account.
I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

Did you meet the same problem in your training?
If so, how did you solve the problem?
Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

rjc7011855 · 2023-08-16T11:44:01Z

Hello, may I ask how the signal features of your audio are extracted

quqixun · 2023-08-17T01:26:39Z

@rjc7011855
Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features
and try to make speech features in shape [8, 16, 29].

xz0305 · 2023-08-17T02:11:27Z

hello, How do you get landmarks? please

quqixun · 2023-08-17T03:19:17Z

@xz0305
It is quite simple to get 68 landmarks using dlib.
http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

xz0305 · 2023-08-17T03:35:43Z

very thankful

rjc7011855 · 2023-08-18T06:59:16Z

very thankful

979277 · 2023-08-18T11:44:24Z

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

您好，想问一下您的复现效果如何，可以交流一下吗

quqixun · 2023-08-21T03:00:31Z

@979277

训练了一些epoch，下面是一些效果。
difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

xz0305 · 2023-08-21T07:47:24Z

how you downloaded the hdtf data, the video I downloaded has no sound

quqixun · 2023-08-22T01:27:30Z

@xz0305
You can use youtube-dl or yt-dlp to download videos with best quality both in video and audio channel.

xz0305 · 2023-08-22T06:49:48Z

@quqixun
您好，这一步是将每一帧的音频保存为npy吗，我这样做生成的特征长度都是0，请问可以讲解一下具体过程吗

quqixun · 2023-08-22T09:01:07Z

@xz0305
保存的是音频特征，就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Tinaa23 · 2023-08-22T11:05:53Z

@xz0305 保存的是音频特征，就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Thank you for sharing this link.
If the video contains 3000 frames, then using this repo for audio feature extraction returns one .npy file with (3000,16,29) shape.
However, for the DiffTalk model, we need a separate .npy file for each frame.
Can you please share how can we do this?
Thanks

quqixun · 2023-08-22T11:23:26Z

@Tinaa23

Make (3000,16,29) to (3000, 8, 16, 29).
3000 : number of frames
8 : sequence length for each frame
16 : window size
29 : number of features
See #10 (comment) .

Or you can refer the code at https://github.com/miu200521358/NeuralVoicePuppetryMMD/blob/master/Audio2ExpressionNet/Training%20Code/data/audio_dataset.py#L85 , there are two ways to generate the sequence.

979277 · 2023-08-31T03:55:04Z

@979277

训练了一些epoch，下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

想问一下你做了全量测试吗？我做下来发现这个方法似乎对一些训练集没见过的id效果不太好

zyhsuperman · 2023-09-10T07:29:14Z

请问提取的视频帧和音频帧帧数是对应的吗？我把视频处理成了25fps，截取了前1000帧，这样的话音频应该对应的是40s, 而在16khz的采样率下它共有2400帧，请问应该怎么处理呢

Tinaa23 · 2023-10-13T08:30:28Z

@979277

训练了一些epoch，下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

Hi. I have a basic question and I hope you can help me with it. How can we specify the number of epochs in this code? this model only trains for 1 epoch on my machine.

sstzal · 2023-12-11T07:54:59Z

音频处理部分沿用了AD-Nerf的操作，使用deepspeech作为音频特征提取器。

我在实验中没有出现内存占用不断增加的情况，如果您能找到问题所在欢迎指出并修正，改动也可以合并到该项目中。

difftalk_demo.zip中的效果看起来还可以。我们在实际应用中还增加了一步后处理操作。具体地，我们使用了[Real-time intermediate flow estimation for
video frame interpolation]这一工作进行帧插值，以获得更流畅的视频。

kaiw7 · 2024-01-31T20:21:25Z

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

Did you meet the same problem in your training?

If so, how did you solve the problem?

Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

Hi, could I know whether your downloaded HDTF videos has audio stream? Could you share the downloading link? Many thanks

Utkarsh-shift · 2024-02-05T05:53:54Z

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Utkarsh-shift · 2024-02-05T06:44:08Z

thanks i got you answer in comment above

kaiw7 · 2024-02-15T02:08:11Z

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Hi, could I know how to download the dataset? I met some issues with dataset downloading. Thank you very much

jinlingxueluo · 2024-03-25T04:41:00Z

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

我沿用了AD-nerf的处理方式,您是否会遇到RuntimeError: stack expects each tensor to be equal size, but got [4, 16, 29] at entry 0 and [8, 16, 29] at entry 1
这样的问题呢?

SCP2922 · 2024-08-27T08:31:49Z

请问在说明中的
|——data/HDTF
|——images
|——0_0.jpg
|——0_1.jpg
|——...
|——N_M.bin
|——landmarks
|——0_0.lmd
|——0_1.lmd
|——...
|——N_M.lms
|——audio_smooth
|——0_0.npy
|——0_1.npy
|——...
|——N_M.npy
0_0.jpg和0_1.jpg代表的是某一个视频的第一帧和第二帧，还是某一个视频分段之后的每一段的第一帧？
N_M.bin，N_M.lms储存的是什么信息？
最后的音频文件0_0.jpy与0_0.jpg的对应关系应该是什么？是某一帧内的音频特征，还是某一段内的音频特征？
希望可以有大佬帮忙解惑
感激不尽

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The usage of RAM is always increasing during one epoch. #19

The usage of RAM is always increasing during one epoch. #19

quqixun commented Aug 10, 2023 •

edited

Loading

rjc7011855 commented Aug 16, 2023

quqixun commented Aug 17, 2023 •

edited

Loading

xz0305 commented Aug 17, 2023

quqixun commented Aug 17, 2023 •

edited

Loading

xz0305 commented Aug 17, 2023

rjc7011855 commented Aug 18, 2023

979277 commented Aug 18, 2023

quqixun commented Aug 21, 2023 •

edited

Loading

xz0305 commented Aug 21, 2023

quqixun commented Aug 22, 2023

xz0305 commented Aug 22, 2023

quqixun commented Aug 22, 2023

Tinaa23 commented Aug 22, 2023

quqixun commented Aug 22, 2023 •

edited

Loading

979277 commented Aug 31, 2023

zyhsuperman commented Sep 10, 2023

Tinaa23 commented Oct 13, 2023

sstzal commented Dec 11, 2023

kaiw7 commented Jan 31, 2024

Utkarsh-shift commented Feb 5, 2024

Utkarsh-shift commented Feb 5, 2024

kaiw7 commented Feb 15, 2024

jinlingxueluo commented Mar 25, 2024

SCP2922 commented Aug 27, 2024

The usage of RAM is always increasing during one epoch. #19

The usage of RAM is always increasing during one epoch. #19

Comments

quqixun commented Aug 10, 2023 • edited Loading

rjc7011855 commented Aug 16, 2023

quqixun commented Aug 17, 2023 • edited Loading

xz0305 commented Aug 17, 2023

quqixun commented Aug 17, 2023 • edited Loading

xz0305 commented Aug 17, 2023

rjc7011855 commented Aug 18, 2023

979277 commented Aug 18, 2023

quqixun commented Aug 21, 2023 • edited Loading

xz0305 commented Aug 21, 2023

quqixun commented Aug 22, 2023

xz0305 commented Aug 22, 2023

quqixun commented Aug 22, 2023

Tinaa23 commented Aug 22, 2023

quqixun commented Aug 22, 2023 • edited Loading

979277 commented Aug 31, 2023

zyhsuperman commented Sep 10, 2023

Tinaa23 commented Oct 13, 2023

sstzal commented Dec 11, 2023

kaiw7 commented Jan 31, 2024

Utkarsh-shift commented Feb 5, 2024

Utkarsh-shift commented Feb 5, 2024

kaiw7 commented Feb 15, 2024

jinlingxueluo commented Mar 25, 2024

SCP2922 commented Aug 27, 2024

quqixun commented Aug 10, 2023 •

edited

Loading

quqixun commented Aug 17, 2023 •

edited

Loading

quqixun commented Aug 17, 2023 •

edited

Loading

quqixun commented Aug 21, 2023 •

edited

Loading

quqixun commented Aug 22, 2023 •

edited

Loading