Skip to content

maxrmorrison/promonet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prosody and Pronunciation Modification Network (ProMoNet)

PyPI License Downloads

A versatile speech editor for changing the prosody (pitch, duration, and loudness), pronunciation, or speaker identity (i.e., voice conversion) of recorded speech. Official code for the paper Fine-Grained and Interpretable Neural Speech Editing.

[paper] [website]

Table of contents

Installation

pip install promonet

We are working on adding torbi, our fast Viterbi decoding implementation to PyTorch. This Viterbi decoding implementation is used to significantly speed-up pitch estimation. Until then, you must manually download and install torbi as well as the development (dev) branch of the pitch estimator penn. You can track the progress of incorporation of Viterbi decoding into PyTorch here.

# Install torbi
git clone [email protected]:maxrmorrison/torbi
pip install torbi/

# Install the development branch of penn
git clone -b dev [email protected]:interactiveaudiolab/penn
pip install penn/

Usage

Our included model checkpoint allows speech editing and synthesis for VCTK speakers. To use promonet with other speakers, you must first perform speaker adaptation on a dataset of recordings of the target speaker. You can then use the resulting model checkpoint to perform speech editing in the target speaker's voice. All of this can be done using either the API or CLI.

import promonet


###############################################################################
# Speaker adaptation
###############################################################################


# Speaker's name
name = 'max'

# Audio files for adaptation
files = [...]

# GPU index to perform adaptation and editing on
gpu = 0

# Perform speaker adaptation
checkpoint = promonet.adapt.speaker(name, files, gpu=gpu)


###############################################################################
# Speech editing
###############################################################################


# Load speech to edit
audio = promonet.load.audio('test.wav')

# Get features to edit
loudness, pitch, periodicity, ppg = promonet.preprocess.from_audio(
    audio,
    promonet.SAMPLE_RATE,
    gpu)

# We'll use a ratio of 2.0 for all editing examples
ratio = 2.0

# Perform pitch-shifting
shifted = promonet.synthesize.from_features(
    *promonet.edit.from_features(
        loudness,
        pitch,
        periodicity,
        ppg,
        pitch_shift_cents=promonet.convert.ratio_to_cents(ratio)),
    checkpoint=checkpoint,
    gpu=gpu)

# Perform time-stretching
stretched = promonet.synthesize.from_features(
    *promonet.edit.from_features(
        loudness,
        pitch,
        periodicity,
        ppg,
        time_stretch_ratio=ratio),
    checkpoint=checkpoint,
    gpu=gpu)

# Perform loudness editing
scaled = promonet.synthesize.from_features(
    *promonet.edit.from_features(
        loudness,
        pitch,
        periodicity,
        ppg,
        loudness_scale_db=promonet.convert.ratio_to_db(ratio)),
    checkpoint=checkpoint,
    gpu=gpu)

# Edit spectral balance (> 1 for Alvin and the Chipmunks; < 1 for Patrick Star)
alvin = promonet.synthesize.from_features(
    loudness,
    pitch,
    periodicity,
    ppg,
    spectral_balance_ratio=ratio,
    checkpoint=checkpoint,
    gpu=gpu)

See the ppgs.edit submodule documentation for the pronunciation (PPG) editing API.

Application programming interface (API)

Adaptation API

promonet.adapt.speaker

def speaker(
    name: str,
    files: List[Path],
    checkpoint: Path = None,
    gpu: Optional[int] = None
) -> Path:
    """Perform speaker adaptation

    Args:
        name: The name of the speaker
        files: The audio files to use for adaptation
        checkpoint: The model checkpoint directory
        gpu: The gpu to run adaptation on

    Returns:
        checkpoint: The file containing the trained generator checkpoint
    """

Preprocessing API

promonet.preprocess.from_audio

def from_audio(
    audio: torch.Tensor,
    sample_rate: int = promonet.SAMPLE_RATE,
    gpu: Optional[int] = None,
    features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
    Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
    Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
]:
    """Preprocess audio

    Arguments
        audio: Audio to preprocess
        sample_rate: Audio sample rate
        gpu: The GPU index
        features: The features to preprocess.
            Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].

    Returns
        loudness: The loudness contour
        periodicity: The periodicity contour
        pitch: The pitch contour
        ppg: The phonetic posteriorgram
        text: The text transcript
    """

promonet.preprocess.from_file

def from_file(
    file: Union[str, bytes, os.PathLike],
    gpu: Optional[int] = None,
    features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
    Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
    Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
]:
    """Preprocess audio on disk

    Arguments
        file: Audio file to preprocess
        gpu: The GPU index
        features: The features to preprocess.
            Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].

    Returns
        loudness: The loudness contour
        pitch: The pitch contour
        periodicity: The periodicity contour
        ppg: The phonetic posteriorgram
        text: The text transcript
    """

promonet.preprocess.from_file_to_file

def from_file_to_file(
    file: Union[str, bytes, os.PathLike],
    output_prefix: Optional[Union[str, os.PathLike]] = None,
    gpu: Optional[int] = None,
    features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
    """Preprocess audio on disk and save

    Arguments
        file: Audio file to preprocess
        output_prefix: File to save features, minus extension
        gpu: The GPU index
        features: The features to preprocess.
            Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
    """

promonet.preprocess.from_files_to_files

def from_files_to_files(
    files: List[Union[str, bytes, os.PathLike]],
    output_prefixes: Optional[List[Union[str, os.PathLike]]] = None,
    gpu: Optional[int] = None,
    features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
    """Preprocess multiple audio files on disk and save

    Arguments
        files: Audio files to preprocess
        output_prefixes: Files to save features, minus extension
        gpu: The GPU index
        features: The features to preprocess.
            Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
    """

Editing API

promonet.edit.from_features
def from_features(
    loudness: torch.Tensor,
    pitch: torch.Tensor,
    periodicity: torch.Tensor,
    ppg: torch.Tensor,
    pitch_shift_cents: Optional[float] = None,
    time_stretch_ratio: Optional[float] = None,
    loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Edit speech representation

    Arguments
        loudness: Loudness contour to edit
        pitch: Pitch contour to edit
        periodicity: Periodicity contour to edit
        ppg: PPG to edit
        pitch_shift_cents: Amount of pitch-shifting in cents
        time_stretch_ratio: Amount of time-stretching. Faster when above one.
        loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)

    Returns
        edited_loudness, edited_pitch, edited_periodicity, edited_ppg
    """
promonet.edit.from_file
def from_file(
    loudness_file: Union[str, bytes, os.PathLike],
    pitch_file: Union[str, bytes, os.PathLike],
    periodicity_file: Union[str, bytes, os.PathLike],
    ppg_file: Union[str, bytes, os.PathLike],
    pitch_shift_cents: Optional[float] = None,
    time_stretch_ratio: Optional[float] = None,
    loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Edit speech representation on disk

    Arguments
        loudness_file: Loudness file to edit
        pitch_file: Pitch file to edit
        periodicity_file: Periodicity file to edit
        ppg_file: PPG file to edit
        pitch_shift_cents: Amount of pitch-shifting in cents
        time_stretch_ratio: Amount of time-stretching. Faster when above one.
        loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)

    Returns
        edited_loudness, edited_pitch, edited_periodicity, edited_ppg
    """
promonet.edit.from_file_to_file
def from_file_to_file(
    loudness_file: Union[str, bytes, os.PathLike],
    pitch_file: Union[str, bytes, os.PathLike],
    periodicity_file: Union[str, bytes, os.PathLike],
    ppg_file: Union[str, bytes, os.PathLike],
    output_prefix: Union[str, bytes, os.PathLike],
    pitch_shift_cents: Optional[float] = None,
    time_stretch_ratio: Optional[float] = None,
    loudness_scale_db: Optional[float] = None
) -> None:
    """Edit speech representation on disk and save to disk

    Arguments
        loudness_file: Loudness file to edit
        pitch_file: Pitch file to edit
        periodicity_file: Periodicity file to edit
        ppg_file: PPG file to edit
        output_prefix: File to save output, minus extension
        pitch_shift_cents: Amount of pitch-shifting in cents
        time_stretch_ratio: Amount of time-stretching. Faster when above one.
        loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
    """
promonet.edit.from_files_to_files
def from_files_to_files(
    loudness_files: List[Union[str, bytes, os.PathLike]],
    pitch_files: List[Union[str, bytes, os.PathLike]],
    periodicity_files: List[Union[str, bytes, os.PathLike]],
    ppg_files: List[Union[str, bytes, os.PathLike]],
    output_prefixes: List[Union[str, bytes, os.PathLike]],
    pitch_shift_cents: Optional[float] = None,
    time_stretch_ratio: Optional[float] = None,
    loudness_scale_db: Optional[float] = None
) -> None:
    """Edit speech representations on disk and save to disk

    Arguments
        loudness_files: Loudness files to edit
        pitch_files: Pitch files to edit
        periodicity_files: Periodicity files to edit
        ppg_files: Phonetic posteriorgram files to edit
        output_prefixes: Files to save output, minus extension
        pitch_shift_cents: Amount of pitch-shifting in cents
        time_stretch_ratio: Amount of time-stretching. Faster when above one.
        loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
    """

Synthesis API

promonet.synthesize.from_features
def from_features(
    loudness: torch.Tensor,
    pitch: torch.Tensor,
    periodicity: torch.Tensor,
    ppg: torch.Tensor,
    speaker: Union[int, torch.Tensor] = 0,
    spectral_balance_ratio: float = 1.,
    checkpoint: Optional[Union[str, os.PathLike]] = None,
    gpu: Optional[int] = None) -> torch.Tensor:
    """Perform speech synthesis

    Args:
        loudness: The loudness contour
        pitch: The pitch contour
        periodicity: The periodicity contour
        ppg: The phonetic posteriorgram
        speaker: The speaker index
        spectral_balance_ratio: > 1 for Alvin and the Chipmunks; < 1 for Patrick Star
        checkpoint: The generator checkpoint
        gpu: The GPU index

    Returns
        generated: The generated speech
    """
promonet.synthesize.from_file
def from_file(
    loudness_file: Union[str, os.PathLike],
    pitch_file: Union[str, os.PathLike],
    periodicity_file: Union[str, os.PathLike],
    ppg_file: Union[str, os.PathLike],
    speaker: Union[int, torch.Tensor] = 0,
    checkpoint: Optional[Union[str, os.PathLike]] = None,
    gpu: Optional[int] = None
) -> torch.Tensor:
    """Perform speech synthesis from features on disk

    Args:
        loudness_file: The loudness file
        pitch_file: The pitch file
        periodicity_file: The periodicity file
        ppg_file: The phonetic posteriorgram file
        speaker: The speaker index
        checkpoint: The generator checkpoint
        gpu: The GPU index

    Returns
        generated: The generated speech
    """
promonet.synthesize.from_file_to_file
def from_file_to_file(
    loudness_file: Union[str, os.PathLike],
    pitch_file: Union[str, os.PathLike],
    periodicity_file: Union[str, os.PathLike],
    ppg_file: Union[str, os.PathLike],
    output_file: Union[str, os.PathLike],
    speaker: Union[int, torch.Tensor] = 0,
    checkpoint: Optional[Union[str, os.PathLike]] = None,
    gpu: Optional[int] = None
) -> None:
    """Perform speech synthesis from features on disk and save

    Args:
        loudness_file: The loudness file
        pitch_file: The pitch file
        periodicity_file: The periodicity file
        ppg_file: The phonetic posteriorgram file
        output_file: The file to save generated speech audio
        speaker: The speaker index
        checkpoint: The generator checkpoint
        gpu: The GPU index
    """
promonet.synthesize.from_files_to_files
def from_files_to_files(
    loudness_files: List[Union[str, os.PathLike]],
    pitch_files: List[Union[str, os.PathLike]],
    periodicity_files: List[Union[str, os.PathLike]],
    ppg_files: List[Union[str, os.PathLike]],
    output_files: List[Union[str, os.PathLike]],
    speakers: Optional[Union[List[int], torch.Tensor]] = None,
    checkpoint: Optional[Union[str, os.PathLike]] = None,
    gpu: Optional[int] = None
) -> None:
    """Perform batched speech synthesis from features on disk and save

    Args:
        loudness_files: The loudness files
        pitch_files: The pitch files
        periodicity_files: The periodicity files
        ppg_files: The phonetic posteriorgram files
        output_files: The files to save generated speech audio
        speakers: The speaker indices
        checkpoint: The generator checkpoint
        gpu: The GPU index
    """

Command-line interface (CLI)

Adaptation CLI

promonet.adapt

python -m promonet.adapt \
    --name NAME \
    --files FILES [FILES ...] \
    [--checkpoint CHECKPOINT] \
    [--gpu GPU]

Perform speaker adaptation

optional arguments:
  -h, --help
    show this help message and exit
  --name NAME
    The name of the speaker
  --files FILES [FILES ...]
    The audio files to use for adaptation
  --checkpoint CHECKPOINT
    The model checkpoint directory
  --gpu GPU
    The gpu to run adaptation on

Preprocessing CLI

promonet.preprocess

python -m promonet.preprocess \
    [-h] \
    --files FILES [FILES ...] \
    [--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]] \
    [--features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]] \
    [--gpu GPU]

Preprocess

arguments:
  --files FILES [FILES ...]
    Audio files to preprocess

optional arguments:
  -h, --help
    show this help message and exit
  --output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]
    Files to save features, minus extension
  --features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]
    The features to preprocess
  --gpu GPU
    The index of the gpu to use

Editing CLI

promonet.edit

python -m promonet.edit \
    [-h] \
    --loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
    --pitch_files PITCH_FILES [PITCH_FILES ...] \
    --periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...] \
    --ppg_files PPG_FILES [PPG_FILES ...] \
    --output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...] \
    [--pitch_shift_cents PITCH_SHIFT_CENTS] \
    [--time_stretch_ratio TIME_STRETCH_RATIO] \
    [--loudness_scale_db LOUDNESS_SCALE_DB]

Edit speech representation

arguments:
  --loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
    The loudness files to edit
  --pitch_files PITCH_FILES [PITCH_FILES ...]
    The pitch files to edit
  --periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...]
    The periodicity files to edit
  --ppg_files PPG_FILES [PPG_FILES ...]
    The ppg files to edit
  --output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]
    The locations to save output files, minus extension

optional arguments:
  -h, --help
    show this help message and exit
  --pitch_shift_cents PITCH_SHIFT_CENTS
    Amount of pitch-shifting in cents
  --time_stretch_ratio TIME_STRETCH_RATIO
    Amount of time-stretching. Faster when above one.
  --loudness_scale_db LOUDNESS_SCALE_DB
    Loudness ratio editing in dB (not recommended; use loudness)

Synthesis CLI

promonet.synthesize

python -m promonet.synthesize \
    --loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
    --pitch_files PITCH_FILES [PITCH_FILES ...] \
    --periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...] \
    --ppg_files PPG_FILES [PPG_FILES ...] \
    --output_files OUTPUT_FILES [OUTPUT_FILES ...] \
    [--speakers SPEAKERS [SPEAKERS ...]] \
    [--checkpoint CHECKPOINT] \
    [--gpu GPU]

Synthesize speech from features

arguments:
  --loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
    The loudness files
  --pitch_files PITCH_FILES [PITCH_FILES ...]
    The pitch files
  --periodicity_files PERIODICITY_FILES [PERIODICITY_FILES ...]
    The periodicity files
  --ppg_files PPG_FILES [PPG_FILES ...]
    The phonetic posteriorgram files
  --output_files OUTPUT_FILES [OUTPUT_FILES ...]
    The files to save the edited audio

optional arguments:
  -h, --help
    show this help message and exit
  --speakers SPEAKERS [SPEAKERS ...]
    The IDs of the speakers for voice conversion
  --checkpoint CHECKPOINT
    The generator checkpoint
  --gpu GPU
    The GPU index

Training

Download

Downloads, unzips, and formats datasets. Stores datasets in data/datasets/. Stores formatted datasets in data/cache/.

python -m promonet.data.download --datasets <datasets>

Preprocess

Prepares features for training. Features are stored in data/cache/.

python -m promonet.data.preprocess \
    --datasets <datasets> \
    --features <features> \
    --gpu <gpu>

Partition

Partitions a dataset. You should not need to run this, as the partitions used in our work are provided for each dataset in promonet/assets/partitions/.

python -m promonet.partition --datasets <datasets>

Train

Trains a model. Checkpoints and logs are stored in runs/.

python -m promonet.train \
    --config <config> \
    --dataset <dataset> \
    --gpu <gpu>

If the config file has been previously run, the most recent checkpoint will automatically be loaded and training will resume from that checkpoint.

Monitor

You can monitor training via tensorboard.

tensorboard --logdir runs/ --port <port> --load_fast true

To use the torchutil notification system to receive notifications for long jobs (download, preprocess, train, and evaluate), set the PYTORCH_NOTIFICATION_URL environment variable to a supported webhook as explained in the Apprise documentation.

Evaluate

Performs objective evaluation and generates examples for subjective evaluation. Also performs benchmarking of generation speed. Results are stored in eval/.

python -m promonet.evaluate \
    --config <name> \
    --datasets <datasets> \
    --gpu <gpu>

Citation

IEEE

M. Morrison, C. Churchwell, N. Pruyne, and B. Pardo, "Fine-Grained and Interpretable Neural Speech Editing," Interspeech, September 2024.

BibTex

@inproceedings{morrison2024adaptive,
    title={Fine-Grained and Interpretable Neural Speech Editing},
    author={Morrison, Max and Churchwell, Cameron and Pruyne, Nathan and Pardo, Bryan},
    booktitle={Interspeech},
    month={September},
    year={2024}
}