Skip to content

Quality Benchmarks

Alexander Veysov edited this page Mar 31, 2021 · 35 revisions

🥇 Quality Benchmarks

For your convenience, we provide a set of benchmarks on publicly available datasets. We chose Google's STT as a decent approximation of a high quality enterprise solution available commercially and in many languages.

Methodology

Our approach is described in this article.

Caveats

Overall Quality

Unlike many solutions available off-the-shelf our models (especially the Enterprise Edition models) feature generalization across the following domains:

  • Video;
  • Lectures;
  • Narration;
  • Phone calls;
  • Various noises, codecs, recording methods and conditions;

Any "in-the-wild" speech with sufficient SNR and recording quality should work reasonably fine by design. The main caveat is that our models work poorly with far-field audio and extremely noisy audio.

Though our models work fine with 8kHz audio (phone calls), for simplicity we always resample to 16 kHz. Robustness is built into the models themselves.

Visually Pleasing Transcriptions

Be prepared that sometimes the CE-model has hard time producing visually pleasing transcriptions, though the results are phonetically similar.

This is usually solved one way or another:

  1. Limiting the model to a very narrow domain (i.e. speech commands);
  2. Adding an external traditional (n-gram) or more modern (DL-based) language model(s) and performing some sort of fusion / re-scoring;
  3. Using much larger (hence slower) model;

Options (1) and (3) contradict our design philosophy and in general limit the real life applicability of models. We are firm believers that technology should be embarrassingly simple to use (i.e. one line of code). Naturally we have solved these challenges with the EE edition of our models, but at this stage we are still not ready to publish the embarrassingly simple EE models that fulfill the same criteria (i.e. all compute graph triggered by one line of code).

Models

  • Google was used as a main reference in terms of quality;
  • CE = Community Edition;
  • EE = Enterprise Edition;

All of the below metrics are WER (word error rate).

EN V1

All of these tests were run in early September 2020.

Dataset Silero CE Silero EE Google Video Premium Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean 8.6 6.9 7.8 8.7
en_librispeech_val 14.4 11.5 11.3 13.1
en_librispeech_test_other 20.6 17.1 16.2 19.1
Lecture / speech
en_multi_ted_test_he 16.6 12.0 15.3 14.1
en_multi_ted_test_common 21.2 17.6 16.9 16
en_multi_ted_val 23.5 20 22.7 20.8
In the wild
en_common_voice_val 27.5 20.6 20.8 20.8
en_common_voice_test 32.6 25.5 22.2 24
VOIP / calls
en_voip_test 9 8.6 19.7 18.3
British Dialects
en_uk_dialects_midlands_english_female 16.7 10.8 9.6 8.4
en_uk_dialects_southern_english_female 16.7 11.4 10.8 9.3
en_uk_dialects_welsh_english_female 17.1 12.1 20.5 10.5
en_uk_dialects_southern_english_male 17.9 12.7 11.5 10.6
en_uk_dialects_welsh_english_male 18.6 13.2 12.1
en_uk_dialects_northern_english_male 20 13.9 15.5 11.7
en_uk_dialects_scottish_english_male 21.3 15.1 10 11.3
en_uk_dialects_midlands_english_male 21.7 15.1 11.8 10.3
en_uk_dialects_northern_english_female 22 15.2 15 12.7
en_uk_dialects_scottish_english_female 22.2 15.7 13.5 12.6
en_uk_dialects_irish_english_male 32.7 25.5 25.5 21.9
Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest 17.3 13.7 21.5 27
en_voices_rm2_far_none_lav_manifest 31.4 26.5 27.5 42.3
en_voices_rm4_far_none_stu_manifest 33.5 28.7 43.2 43.2
en_voices_rm3_clo_none_stu_manifest 34.5 29.9 28.6 40.8
en_voices_rm2_far_musi_stu_manifest 35.4 30.9 30.6 42.4
en_voices_rm2_far_babb_stu_manifest 39.3 35.0 38.5 48.2
en_voices_rm3_clo_musi_stu_manifest 46.9 43 38.1 51.8
en_voices_rm4_ceo_none_lav_manifest 50.3 46.4 42.9 52.5
en_voices_rm3_far_none_stu_manifest 78.9 78.3 68.8 81.6
en_nsc_val_manifest_part1 31.7 24.4 NA NA
en_nsc_val_manifest_part2 67.0 60.9 NA NA

EN V2

Google tests were run in early September 2020. EN V2 metrics updated in early November 2020.

Dataset Silero CE Silero EE Google Video Premium Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean 8.7 6.9 7.8 8.7
en_librispeech_val 14.5 11.7 11.3 13.1
en_librispeech_test_other 20.6 17.4 16.2 19.1
Lecture / speech
en_multi_ted_test_he 15.0 11.5 15.3 14.1
en_multi_ted_test_common 20.7 17.3 16.9 16
en_multi_ted_val 22.9 19.9 22.7 20.8
In the wild
en_common_voice_val 27.1 20.3 20.8 20.8
en_common_voice_test 32.1 25.3 22.2 24
VOIP / calls
en_voip_test 11.4 10.8 19.7 18.3
British Dialects
en_uk_dialects_midlands_english_female 15.7 10.4 9.6 8.4
en_uk_dialects_southern_english_female 16.6 11.6 10.8 9.3
en_uk_dialects_welsh_english_female 16.9 11.9 20.5 10.5
en_uk_dialects_southern_english_male 17.4 12.6 11.5 10.6
en_uk_dialects_welsh_english_male 17.8 13.1 12.1
en_uk_dialects_northern_english_male 19.7 13.7 15.5 11.7
en_uk_dialects_scottish_english_male 20.5 14.6 10 11.3
en_uk_dialects_midlands_english_male 21.4 16.1 11.8 10.3
en_uk_dialects_northern_english_female 21.3 15.5 15 12.7
en_uk_dialects_scottish_english_female 21.8 15.4 13.5 12.6
en_uk_dialects_irish_english_male 32.5 25.7 25.5 21.9
Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest 17.5 14.1 21.5 27
en_voices_rm2_far_none_lav_manifest 31.6 27.0 27.5 42.3
en_voices_rm4_far_none_stu_manifest 33.7 29.3 43.2 43.2
en_voices_rm3_clo_none_stu_manifest 34.7 30.4 28.6 40.8
en_voices_rm2_far_musi_stu_manifest 35.9 31.5 30.6 42.4
en_voices_rm2_far_babb_stu_manifest 39.8 35.7 38.5 48.2
en_voices_rm3_clo_musi_stu_manifest 47.2 43.5 38.1 51.8
en_voices_rm4_ceo_none_lav_manifest 50.0 46.3 42.9 52.5
en_voices_rm3_far_none_stu_manifest 78.3 78.0 68.8 81.6
en_nsc_val_manifest_part1 18.3 13.9 NA NA
en_nsc_val_manifest_part2 31.7 28.5 NA NA

DE V1

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset CE EE Google
AudioBooks
de_caito_manifest_val 12.5 8.7 19.5
Narration
de_voxforge_manifest_val 3.8 2.3 5.9
In the wild
de_common_voice_test_manifest 28.0 17.6 16.1
de_common_voice_val_manifest 24.9 15.0 14.0
de_telekinect_dev_manifest 28.1 18.6 13.5
de_telekinect_test_manifest 28.3 19.4 15.7

ES V1

All of these tests were run in early September 2020.

For Spanish, we chose the region (US) where a Premium model was available. Judging by the benchmark results, Google heavily relies on the data it sources from Android most likely due to large population and less regulation. Note that most "dialect" recordings are quite clean, but pronunciation varies.

Dataset CE EE Google Google Phone Premium
AudioBooks
es_caito_val 7.7 5.7 20.3 22.3
Narration
es_voxforge_val 1.4 1.1 18.1 19.4
In the wild
es_common_voice_test 22.0 14.4 27.2 23.1
es_common_voice_val 20.1 13.0 24.5 19.6
Dialects
es_dialects_argentinian_val 19.0 12.9 11.8 6.7
es_dialects_chilean_val 19.8 13.7 8.9 6.6
es_dialects_columbian_val 18.4 11.9 7.8 5.4
es_dialects_peruvian_val 14.4 9.1 6.2 4.7
es_dialects_puerto_rico_val 21.1 14.5 7.9 6.0
es_dialects_venezuela_val 19.2 13.2 8.2 6.4

TTS Models

RU V1

We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 asessors to evaluate the sound quality on a five-point scale. For 8kHz and 16kHz the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades - [1, 2, 3, 4-, 4, 4+, 5-, 5] - the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.

In total people scored audios 37,403 times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.

Speaker Original Synthesis Ratio Examples
aidar_8khz 4.67 (.45) 4.52 (.55) 96.8% link
baya_8khz 4.52 (.57) 4.25 (.76) 94.0% link
kseniya_8khz 4.80 (.40) 4.54 (.60) 94.5% link
aidar_16khz 4.72 (.43) 4.53 (.55) 95.9% link
baya_16khz 4.59 (.55) 4.18 (.76) 91.1% link
kseniya_16khz 4.84 (.37) 4.54 (.59) 93.9% link

We asked our asessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.

Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.

We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our asessors to rate the same audio 4 times in random order - original and synthesis with different sampling rates. For asessors who annotated the whole dataset we calculated the following table:

Comparison Worse Same Better
16k vs 8k, original 957 4811 1512
16k vs 8k, synthesis 1668 4061 1551
Original vs synthesis, 8k 816 3697 2767
Original vs synthesis, 16k 674 3462 3144

Several conclusions can be drawn:

  • In 66% of cases people cannot hear difference between 8k и 16k;
  • In synthesis 8k helps to hide some errors;
  • In about 60% of cases synthesis is same or better than the original;
  • Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;

You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.

Clone this wiki locally