The system consists of Text-to-Speech (TTS) systems and a query-by-example spoken term detection (QbE-STD) system. The TTS system takes text inputs and generates synthesized audio samples (referred to as queries) that are searched in a unlabelled reference corpus. FastSpeech 2 architecture and Parallel Wavegan vocoder are used to train the TTS system. The search of the queries in the reference corpus is done following this work. This repo is also forked from here.
The script was used to install docker and docker-compose on a fresh instance of Ubuntu 20.04 LTS, based on DigitalOcean instructions.
sudo apt update && \
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - && \
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable" && \
sudo apt update && \
apt-cache policy docker-ce && \
sudo apt-get -y install docker-ce && \
sudo curl -L "https://github.com/docker/compose/releases/download/1.28.4/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose && \
sudo chmod +x /usr/local/bin/docker-compose
If you cannot run docker without sudo and getting permission denied error, please follow the instructions from this link
If you want to use only a single TTS system integrated to the qbe-std system, then use the following:
git clone https://github.com/samin9796/tts_qbe-std.git
cd tts_qbe-std
If you want to use multiple TTS systems integrated to the qbe-std system, then go to the following directory:
git clone https://github.com/samin9796/tts_qbe-std.git
cd tts_qbe-std/multiple_TTS_qbe-std/
For multiple TTS systems, there are three TTS systems trained on three different Gronings variants. For each of the variant, we get a similarity score for a query and reference pair. We take the average of the similarity scores from the TTS models and use that as input to our machine learning classifier to figure out whether it is a match or non-match.
# Download gos-kdl.zip into qbe-std_feats_eval/tmp directory
wget https://zenodo.org/record/4634878/files/gos-kdl.zip -P tmp/
## Install unzip if necessary
# apt-get install unzip
# Create directory data/raw/datasets/gos-kdl
mkdir -p data/raw/datasets/gos-kdl
# Unzip into directory
unzip tmp/gos-kdl.zip -d data/raw/datasets/gos-kdl
# For extracting wav2vec 2.0 features and running evaluation scripts
docker pull fauxneticien/qbe-std_feats_eval
If anaconda is not installed in your system, you need to install it first. Otherwise, you can ignore the anaconda installation part. You can follow this link or run the commands below to install anaconda.
cd /tmp && \
curl https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh --output anaconda.sh && \
sha256sum anaconda.sh && \
bash anaconda.sh && \
source ~/.bashrc
Once anaconda is installed, you need to follow the step below.
inference
environment will contain the packages required for TTS inference. qbe-std
environment is created in this work for future usability.
# Create two conda environments with the specified names
conda create -n inference python=3.8 anaconda
conda create -n qbe-std python=3.8 anaconda
Activate the inference
environment, install the required packages, and then deactivate the environment
conda activate inference
pip install espnet==0.10.6 pyopenjtalk==0.2 pypinyin==0.44.0 parallel_wavegan==0.5.4 gdown==4.4.0 espnet_model_zoo
conda deactivate
sudo apt-get install sox
Step 1-7 has to be performed only once. After successful setup, step 8 is for providing text query as input and getting the audio files as output.
# Pipeline integrating a TTS system with a QbE-STD system
bash pipeline.sh
After running the script, you will be prompted to type a query at first. Then, the system will return the audio files that contain the query and a comma separated file (CSV) with the similarity scores from the given query and all the reference audio files. These outputs can be found in the Output
directory.
The original documentation from the QbE-STD repository is kept as it is here to provide better insights about this system. However, only the above-mentioned 7 steps should be followed to run this system.
In this project we examine different feature extraction methods (Kaldi MFCCs, BUT/Phonexia Bottleneck features, and variants of wav2vec 2.0) for performing QbE-STD with data from language documentation projects.
A walkthrough of the entire experiment pipeline can be found in scripts/README.md. Links to acrhived experiment artefacts uploaded to Zenodo are provided in the last section of this README file. A description of the analyses based on the data is found in analyses/README.md, which also provides links to the pilot analyses with a multilingual model, system evaluations, and the error analysis (all viewable online as GitHub Markdown).
@misc{san2021leveraging,
title={Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages},
author={San, Nay and Bartelds, Martijn and Browne, Mitchell and Clifford, Lily and Gibson, Fiona and Mansfield, John and Nash, David and Simpson, Jane and Turpin, Myfany and Vollmer, Maria and Wilmoth, Sasha and Jurafsky, Dan},
year={2021},
eprint={2103.14583},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The directory structure for this project roughly follows the Cookiecutter Data Science guidelines.
├── README.md <- This top-level README
├── docker-compose.yml <- Configurations for launching Docker containers
├── qbe-std_feats_eval.Rproj <- RStudio project file, used to get repository path using R's 'here' package
├── requirements.txt <- Python package requirements
├── tmp/ <- Empty directory to download zip files into, if required
├── data/
│ ├── raw/ <- Immutable data, not modified by scripts
│ │ ├── datasets/ <- Audio data and ground truth labels placed here
│ │ ├── model_checkpoints/ <- wav2vec 2.0 model checkpoint files placed here
│ ├── interim/
│ │ ├── features/ <- features generated by extraction scripts (automatically generated)
│ ├── processed/
│ │ ├── dtw/ <- results returned by DTW search (automatically generated)
│ │ ├── STDEval/ <- evaluation of DTW searches (automatically generated)
├── scripts/
│ ├── README.md <- walkthrough for entire experiment pipeline
│ ├── wav_to_shennong-feats.py <- Extraction script for MFCC and BNF features using the Shennong library
│ ├── wav_to_w2v2-feats.py <- Extraction script for wav2vec 2.0 features
│ ├── feats_to_dtw.py <- QbE-STD DTW search using extracted features
│ ├── prep_STDEval.R <- Helper script to generate files needed for STD evaluation
│ ├── gather_mtwv.R <- Script to gather Maximum Term Weighted Values generated by STDEval
│ ├── STDEval-0.7/ <- NIST STDEval tool
├── analyses/
│ │ ├── data/ <- Final, post-processed data used in analyses
│ │ ├── mtwv.md <- MTWV figures and statistics reported in paper
│ │ ├── error-analysis.md <- Error analyses reported in paper
├── paper/
│ │ ├── ASRU2021.tex <- LaTeX source file of ASRU paper
│ │ ├── ASRU2021.pdf <- Final paper submitted to ASRU2021
With the exception of raw audio and texts from the Australian language documentation projects (for which we do not have permission to release openly) and those from the Mavir corpus (which can be obtained from the original distributor, subject to signing their licence agreement), all other data used in and generated by the experiments are available on Zenodo (see https://zenodo.org/communities/qbe-std_feats_eval). These are:
- Dataset: Gronings https://zenodo.org/record/4634878
- Experiment artefacts:
-
MFCC, BNF and wav2vec 2.0 LibriSpeech 960h features (limited to 50 GB per archive by Zenodo):
- Archive I (eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets): https://zenodo.org/record/4635438
- Archive II (gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd): https://zenodo.org/record/4635493
- Archive III (w2v2-T11 only; all datasets): https://zenodo.org/record/4638385
-
wav2vec 2.0 XLSR53 features:
- Archive I (eng-mav, gbb-lg, wbp-jk, and wrl-mb datasets): https://zenodo.org/record/5504371
- Archive II (gbb-pd, gos-kdl, gup-wat, mwf-jm, pjt-sw01, and wrm-pd datasets): https://zenodo.org/record/5504471
-
DTW search and evaluation data: https://zenodo.org/record/5508217
-