A versatile speech editor for changing the prosody (pitch, duration, and loudness), pronunciation, or speaker identity (i.e., voice conversion) of recorded speech. Official code for the paper Fine-Grained and Interpretable Neural Speech Editing.
- Installation
- Usage
- Application programming interface (API)
- Command-line interface (CLI)
- Training
- Citation
pip install promonet
We are working on adding torbi
, our fast Viterbi decoding implementation to PyTorch. This Viterbi decoding implementation is used to significantly speed-up pitch estimation. Until then, you must manually download and install torbi
as well as the development (dev
) branch of the pitch estimator penn
. You can track the progress of incorporation of Viterbi decoding into PyTorch here.
# Install torbi
git clone [email protected]:maxrmorrison/torbi
pip install torbi/
# Install the development branch of penn
git clone -b dev [email protected]:interactiveaudiolab/penn
pip install penn/
Our included model checkpoint allows speech editing and synthesis for VCTK speakers.
To use promonet
with other speakers, you must first perform speaker
adaptation on a dataset of recordings of the target speaker. You can then use
the resulting model checkpoint to perform speech editing in the target
speaker's voice. All of this can be done using either the API or CLI.
import promonet
# Speaker adaptation
# Speaker's name
name = 'max'
# Audio files for adaptation
files = [...]
# GPU index to perform adaptation and editing on
gpu = 0
# Perform speaker adaptation
checkpoint = promonet.adapt.speaker(name, files, gpu=gpu)
# Speech editing
# Load speech to edit
audio = promonet.load.audio('test.wav')
# Get features to edit
loudness, pitch, periodicity, ppg = promonet.preprocess.from_audio(
# We'll use a ratio of 2.0 for all editing examples
ratio = 2.0
# Perform pitch-shifting
shifted = promonet.synthesize.from_features(
# Perform time-stretching
stretched = promonet.synthesize.from_features(
# Perform loudness editing
scaled = promonet.synthesize.from_features(
# Edit spectral balance (> 1 for Alvin and the Chipmunks; < 1 for Patrick Star)
alvin = promonet.synthesize.from_features(
See the ppgs.edit
submodule documentation for the pronunciation (PPG) editing API.
def speaker(
name: str,
files: List[Path],
checkpoint: Path = None,
gpu: Optional[int] = None
) -> Path:
"""Perform speaker adaptation
name: The name of the speaker
files: The audio files to use for adaptation
checkpoint: The model checkpoint directory
gpu: The gpu to run adaptation on
checkpoint: The file containing the trained generator checkpoint
def from_audio(
audio: torch.Tensor,
sample_rate: int = promonet.SAMPLE_RATE,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
"""Preprocess audio
audio: Audio to preprocess
sample_rate: Audio sample rate
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
loudness: The loudness contour
periodicity: The periodicity contour
pitch: The pitch contour
ppg: The phonetic posteriorgram
text: The text transcript
def from_file(
file: Union[str, bytes, os.PathLike],
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> Union[
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor],
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, str]
"""Preprocess audio on disk
file: Audio file to preprocess
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
loudness: The loudness contour
pitch: The pitch contour
periodicity: The periodicity contour
ppg: The phonetic posteriorgram
text: The text transcript
def from_file_to_file(
file: Union[str, bytes, os.PathLike],
output_prefix: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
"""Preprocess audio on disk and save
file: Audio file to preprocess
output_prefix: File to save features, minus extension
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
def from_files_to_files(
files: List[Union[str, bytes, os.PathLike]],
output_prefixes: Optional[List[Union[str, os.PathLike]]] = None,
gpu: Optional[int] = None,
features: list = ['loudness', 'pitch', 'periodicity', 'ppg']
) -> None:
"""Preprocess multiple audio files on disk and save
files: Audio files to preprocess
output_prefixes: Files to save features, minus extension
gpu: The GPU index
features: The features to preprocess.
Options: ['loudness', 'pitch', 'periodicity', 'ppg', 'text'].
def from_features(
loudness: torch.Tensor,
pitch: torch.Tensor,
periodicity: torch.Tensor,
ppg: torch.Tensor,
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Edit speech representation
loudness: Loudness contour to edit
pitch: Pitch contour to edit
periodicity: Periodicity contour to edit
ppg: PPG to edit
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
edited_loudness, edited_pitch, edited_periodicity, edited_ppg
def from_file(
loudness_file: Union[str, bytes, os.PathLike],
pitch_file: Union[str, bytes, os.PathLike],
periodicity_file: Union[str, bytes, os.PathLike],
ppg_file: Union[str, bytes, os.PathLike],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
"""Edit speech representation on disk
loudness_file: Loudness file to edit
pitch_file: Pitch file to edit
periodicity_file: Periodicity file to edit
ppg_file: PPG file to edit
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
edited_loudness, edited_pitch, edited_periodicity, edited_ppg
def from_file_to_file(
loudness_file: Union[str, bytes, os.PathLike],
pitch_file: Union[str, bytes, os.PathLike],
periodicity_file: Union[str, bytes, os.PathLike],
ppg_file: Union[str, bytes, os.PathLike],
output_prefix: Union[str, bytes, os.PathLike],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> None:
"""Edit speech representation on disk and save to disk
loudness_file: Loudness file to edit
pitch_file: Pitch file to edit
periodicity_file: Periodicity file to edit
ppg_file: PPG file to edit
output_prefix: File to save output, minus extension
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
def from_files_to_files(
loudness_files: List[Union[str, bytes, os.PathLike]],
pitch_files: List[Union[str, bytes, os.PathLike]],
periodicity_files: List[Union[str, bytes, os.PathLike]],
ppg_files: List[Union[str, bytes, os.PathLike]],
output_prefixes: List[Union[str, bytes, os.PathLike]],
pitch_shift_cents: Optional[float] = None,
time_stretch_ratio: Optional[float] = None,
loudness_scale_db: Optional[float] = None
) -> None:
"""Edit speech representations on disk and save to disk
loudness_files: Loudness files to edit
pitch_files: Pitch files to edit
periodicity_files: Periodicity files to edit
ppg_files: Phonetic posteriorgram files to edit
output_prefixes: Files to save output, minus extension
pitch_shift_cents: Amount of pitch-shifting in cents
time_stretch_ratio: Amount of time-stretching. Faster when above one.
loudness_scale_db: Loudness ratio editing in dB (not recommended; use loudness)
def from_features(
loudness: torch.Tensor,
pitch: torch.Tensor,
periodicity: torch.Tensor,
ppg: torch.Tensor,
speaker: Union[int, torch.Tensor] = 0,
spectral_balance_ratio: float = 1.,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None) -> torch.Tensor:
"""Perform speech synthesis
loudness: The loudness contour
pitch: The pitch contour
periodicity: The periodicity contour
ppg: The phonetic posteriorgram
speaker: The speaker index
spectral_balance_ratio: > 1 for Alvin and the Chipmunks; < 1 for Patrick Star
checkpoint: The generator checkpoint
gpu: The GPU index
generated: The generated speech
def from_file(
loudness_file: Union[str, os.PathLike],
pitch_file: Union[str, os.PathLike],
periodicity_file: Union[str, os.PathLike],
ppg_file: Union[str, os.PathLike],
speaker: Union[int, torch.Tensor] = 0,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> torch.Tensor:
"""Perform speech synthesis from features on disk
loudness_file: The loudness file
pitch_file: The pitch file
periodicity_file: The periodicity file
ppg_file: The phonetic posteriorgram file
speaker: The speaker index
checkpoint: The generator checkpoint
gpu: The GPU index
generated: The generated speech
def from_file_to_file(
loudness_file: Union[str, os.PathLike],
pitch_file: Union[str, os.PathLike],
periodicity_file: Union[str, os.PathLike],
ppg_file: Union[str, os.PathLike],
output_file: Union[str, os.PathLike],
speaker: Union[int, torch.Tensor] = 0,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> None:
"""Perform speech synthesis from features on disk and save
loudness_file: The loudness file
pitch_file: The pitch file
periodicity_file: The periodicity file
ppg_file: The phonetic posteriorgram file
output_file: The file to save generated speech audio
speaker: The speaker index
checkpoint: The generator checkpoint
gpu: The GPU index
def from_files_to_files(
loudness_files: List[Union[str, os.PathLike]],
pitch_files: List[Union[str, os.PathLike]],
periodicity_files: List[Union[str, os.PathLike]],
ppg_files: List[Union[str, os.PathLike]],
output_files: List[Union[str, os.PathLike]],
speakers: Optional[Union[List[int], torch.Tensor]] = None,
checkpoint: Optional[Union[str, os.PathLike]] = None,
gpu: Optional[int] = None
) -> None:
"""Perform batched speech synthesis from features on disk and save
loudness_files: The loudness files
pitch_files: The pitch files
periodicity_files: The periodicity files
ppg_files: The phonetic posteriorgram files
output_files: The files to save generated speech audio
speakers: The speaker indices
checkpoint: The generator checkpoint
gpu: The GPU index
python -m promonet.adapt \
--name NAME \
--files FILES [FILES ...] \
[--checkpoint CHECKPOINT] \
[--gpu GPU]
Perform speaker adaptation
optional arguments:
-h, --help
show this help message and exit
--name NAME
The name of the speaker
--files FILES [FILES ...]
The audio files to use for adaptation
--checkpoint CHECKPOINT
The model checkpoint directory
--gpu GPU
The gpu to run adaptation on
python -m promonet.preprocess \
[-h] \
--files FILES [FILES ...] \
[--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...]] \
[--features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]] \
[--gpu GPU]
--files FILES [FILES ...]
Audio files to preprocess
optional arguments:
-h, --help
show this help message and exit
Files to save features, minus extension
--features {loudness,pitch,periodicity,ppg} [{loudness,pitch,periodicity,ppg} ...]
The features to preprocess
--gpu GPU
The index of the gpu to use
python -m promonet.edit \
[-h] \
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
--pitch_files PITCH_FILES [PITCH_FILES ...] \
--ppg_files PPG_FILES [PPG_FILES ...] \
--output_prefixes OUTPUT_PREFIXES [OUTPUT_PREFIXES ...] \
[--pitch_shift_cents PITCH_SHIFT_CENTS] \
[--time_stretch_ratio TIME_STRETCH_RATIO] \
[--loudness_scale_db LOUDNESS_SCALE_DB]
Edit speech representation
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
The loudness files to edit
--pitch_files PITCH_FILES [PITCH_FILES ...]
The pitch files to edit
The periodicity files to edit
--ppg_files PPG_FILES [PPG_FILES ...]
The ppg files to edit
The locations to save output files, minus extension
optional arguments:
-h, --help
show this help message and exit
--pitch_shift_cents PITCH_SHIFT_CENTS
Amount of pitch-shifting in cents
--time_stretch_ratio TIME_STRETCH_RATIO
Amount of time-stretching. Faster when above one.
--loudness_scale_db LOUDNESS_SCALE_DB
Loudness ratio editing in dB (not recommended; use loudness)
python -m promonet.synthesize \
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...] \
--pitch_files PITCH_FILES [PITCH_FILES ...] \
--ppg_files PPG_FILES [PPG_FILES ...] \
--output_files OUTPUT_FILES [OUTPUT_FILES ...] \
[--speakers SPEAKERS [SPEAKERS ...]] \
[--checkpoint CHECKPOINT] \
[--gpu GPU]
Synthesize speech from features
--loudness_files LOUDNESS_FILES [LOUDNESS_FILES ...]
The loudness files
--pitch_files PITCH_FILES [PITCH_FILES ...]
The pitch files
The periodicity files
--ppg_files PPG_FILES [PPG_FILES ...]
The phonetic posteriorgram files
--output_files OUTPUT_FILES [OUTPUT_FILES ...]
The files to save the edited audio
optional arguments:
-h, --help
show this help message and exit
--speakers SPEAKERS [SPEAKERS ...]
The IDs of the speakers for voice conversion
--checkpoint CHECKPOINT
The generator checkpoint
--gpu GPU
The GPU index
Downloads, unzips, and formats datasets. Stores datasets in data/datasets/
Stores formatted datasets in data/cache/
python -m promonet.data.download --datasets <datasets>
Prepares features for training. Features are stored in data/cache/
python -m promonet.data.preprocess \
--datasets <datasets> \
--features <features> \
--gpu <gpu>
Partitions a dataset. You should not need to run this, as the partitions
used in our work are provided for each dataset in
python -m promonet.partition --datasets <datasets>
Trains a model. Checkpoints and logs are stored in runs/
python -m promonet.train \
--config <config> \
--dataset <dataset> \
--gpu <gpu>
If the config file has been previously run, the most recent checkpoint will automatically be loaded and training will resume from that checkpoint.
You can monitor training via tensorboard
tensorboard --logdir runs/ --port <port> --load_fast true
To use the torchutil
notification system to receive notifications for long
jobs (download, preprocess, train, and evaluate), set the
environment variable to a supported webhook as
explained in the Apprise documentation.
Performs objective evaluation and generates examples for subjective evaluation.
Also performs benchmarking of generation speed. Results are stored in eval/
python -m promonet.evaluate \
--config <name> \
--datasets <datasets> \
--gpu <gpu>
M. Morrison, C. Churchwell, N. Pruyne, and B. Pardo, "Fine-Grained and Interpretable Neural Speech Editing," Interspeech, September 2024.
title={Fine-Grained and Interpretable Neural Speech Editing},
author={Morrison, Max and Churchwell, Cameron and Pruyne, Nathan and Pardo, Bryan},