Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The usage of RAM is always increasing during one epoch. #19

Open
quqixun opened this issue Aug 10, 2023 · 24 comments
Open

The usage of RAM is always increasing during one epoch. #19

quqixun opened this issue Aug 10, 2023 · 24 comments

Comments

@quqixun
Copy link

quqixun commented Aug 10, 2023

After preprocessing of HDTF dataset, I got 415 videos.
249 videos (60%) were randomly selected as training set, the others (40%) were test set.
The first 1500 frames of each video were extracted for training with stride 2.
So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap.
In training, the batch size is set to 16.
At the first epoch, the usage of RAM is always increasing.
At step 2743, all RAM was occupied (even the Swap space) and the training stopped.
Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account.
I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

  • Did you meet the same problem in your training?
  • If so, how did you solve the problem?
  • Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

@rjc7011855
Copy link

Hello, may I ask how the signal features of your audio are extracted

@quqixun
Copy link
Author

quqixun commented Aug 17, 2023

@rjc7011855
Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features
and try to make speech features in shape [8, 16, 29].

@xz0305
Copy link

xz0305 commented Aug 17, 2023

hello, How do you get landmarks? please

@quqixun
Copy link
Author

quqixun commented Aug 17, 2023

@xz0305
It is quite simple to get 68 landmarks using dlib.
http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

@xz0305
Copy link

xz0305 commented Aug 17, 2023

very thankful

@rjc7011855
Copy link

very thankful

@979277
Copy link

979277 commented Aug 18, 2023

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

您好,想问一下您的复现效果如何,可以交流一下吗

@quqixun
Copy link
Author

quqixun commented Aug 21, 2023

@979277

训练了一些epoch,下面是一些效果。
difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

@xz0305
Copy link

xz0305 commented Aug 21, 2023

how you downloaded the hdtf data, the video I downloaded has no sound

@quqixun
Copy link
Author

quqixun commented Aug 22, 2023

@xz0305
You can use youtube-dl or yt-dlp to download videos with best quality both in video and audio channel.

@xz0305
Copy link

xz0305 commented Aug 22, 2023

@quqixun
您好,这一步是将每一帧的音频保存为npy吗,我这样做生成的特征长度都是0,请问可以讲解一下具体过程吗
1692686778232(1)

@quqixun
Copy link
Author

quqixun commented Aug 22, 2023

@Tinaa23
Copy link

Tinaa23 commented Aug 22, 2023

@xz0305 保存的是音频特征,就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Thank you for sharing this link.
If the video contains 3000 frames, then using this repo for audio feature extraction returns one .npy file with (3000,16,29) shape.
However, for the DiffTalk model, we need a separate .npy file for each frame.
Can you please share how can we do this?
Thanks

@quqixun
Copy link
Author

quqixun commented Aug 22, 2023

@Tinaa23

Make (3000,16,29) to (3000, 8, 16, 29).
3000 : number of frames
8 : sequence length for each frame
16 : window size
29 : number of features
See #10 (comment) .

Or you can refer the code at https://github.com/miu200521358/NeuralVoicePuppetryMMD/blob/master/Audio2ExpressionNet/Training%20Code/data/audio_dataset.py#L85 , there are two ways to generate the sequence.

@979277
Copy link

979277 commented Aug 31, 2023

@979277

训练了一些epoch,下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

想问一下你做了全量测试吗?我做下来发现这个方法似乎对一些训练集没见过的id效果不太好

@zyhsuperman
Copy link

请问提取的视频帧和音频帧帧数是对应的吗?我把视频处理成了25fps,截取了前1000帧,这样的话音频应该对应的是40s, 而在16khz的采样率下它共有2400帧,请问应该怎么处理呢

@Tinaa23
Copy link

Tinaa23 commented Oct 13, 2023

@979277

训练了一些epoch,下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

Hi. I have a basic question and I hope you can help me with it. How can we specify the number of epochs in this code? this model only trains for 1 epoch on my machine.

@sstzal
Copy link
Owner

sstzal commented Dec 11, 2023

音频处理部分沿用了AD-Nerf的操作,使用deepspeech作为音频特征提取器。

我在实验中没有出现内存占用不断增加的情况,如果您能找到问题所在欢迎指出并修正,改动也可以合并到该项目中。

difftalk_demo.zip中的效果看起来还可以。我们在实际应用中还增加了一步后处理操作。具体地,我们使用了[Real-time intermediate flow estimation for
video frame interpolation]这一工作进行帧插值,以获得更流畅的视频。

@kaiw7
Copy link

kaiw7 commented Jan 31, 2024

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

  • Did you meet the same problem in your training?
  • If so, how did you solve the problem?
  • Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

Hi, could I know whether your downloaded HDTF videos has audio stream? Could you share the downloading link? Many thanks

@Utkarsh-shift
Copy link

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

@Utkarsh-shift
Copy link

thanks i got you answer in comment above

@kaiw7
Copy link

kaiw7 commented Feb 15, 2024

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Hi, could I know how to download the dataset? I met some issues with dataset downloading. Thank you very much

@jinlingxueluo
Copy link

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

我沿用了AD-nerf的处理方式,您是否会遇到RuntimeError: stack expects each tensor to be equal size, but got [4, 16, 29] at entry 0 and [8, 16, 29] at entry 1
这样的问题呢?

@SCP2922
Copy link

SCP2922 commented Aug 27, 2024

请问在说明中的
|——data/HDTF
|——images
|——0_0.jpg
|——0_1.jpg
|——...
|——N_M.bin
|——landmarks
|——0_0.lmd
|——0_1.lmd
|——...
|——N_M.lms
|——audio_smooth
|——0_0.npy
|——0_1.npy
|——...
|——N_M.npy
0_0.jpg和0_1.jpg代表的是某一个视频的第一帧和第二帧,还是某一个视频分段之后的每一段的第一帧?
N_M.bin,N_M.lms储存的是什么信息?
最后的音频文件0_0.jpy与0_0.jpg的对应关系应该是什么?是某一帧内的音频特征,还是某一段内的音频特征?
希望可以有大佬帮忙解惑
感激不尽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests