- Add data preprocessing script for transforming audio samples from temporal to frequency domain.
- Adapt backbone architetcure to the new data shape.
- Update train and sample scripts accordingly following the previous feature.
- verify if train and sampling works well
- update postprocess script to leverage nd array of data_gen
Why using a Mel-spectrogram or STFT spectrogram as audio data representation is generally the most effective format for the following reasons:
- Balanced Resolution: Spectrograms provide a good balance between time and frequency resolution, enabling the model to learn both temporal progression and spectral content.
- Dimensionality Reduction: A Mel-spectrogram reduces the frequency resolution based on perceptual importance, making it easier for a model to process while retaining essential audio features.
- Ease of Reconstruction: Converting predictions from the frequency domain back to the temporal domain is feasible with inverse STFT functions.
Example Workflow for Spectrogram-Based Diffusion:
- Convert your 15-second history and future 5-second target segments into Mel-spectrograms or STFT spectrograms.
- Normalize each spectrogram to a suitable range (e.g., [-1, 1] or [0, 1]) for model input. You can also standardize and then scale.
- Train the Diffusion Model to predict future spectrogram frames based on the given history.
- Invert the generated spectrograms back to the waveform domain using an inverse STFT or a neural vocoder.
Rule of Thumb: A hop length of 512 samples (for a 1024-sample window) is a good starting point, especially for music.
Sampling Rate Consideration:
- For audio at a standard sampling rate of 44.1 kHz or 48 kHz, a window size of 1024 samples and hop length of 512 samples works well for most music genres.
- If your audio is at a lower sample rate (e.g., 16 kHz), you may consider smaller values like window size = 512 and hop length = 256.