Quality Benchmarks

Quality Benchmarks
- Methodology
- Caveats
  - Overall Quality
  - Visually Pleasing Transcriptions
- STT Models
  - EN V1
  - EN V2
  - EN V3
  - DE V1
  - ES V1
- TTS Models
  - RU V1

🥇 Quality Benchmarks

For your convenience, we provide a set of benchmarks on publicly available datasets. We chose Google's STT as a decent approximation of a high quality enterprise solution available commercially and in many languages.

Methodology

Our approach is described in this article.

Caveats

Overall Quality

Unlike many solutions available off-the-shelf our models (especially the Enterprise Edition models) feature generalization across the following domains:

Video;
Lectures;
Narration;
Phone calls;
Various noises, codecs, recording methods and conditions;

Any "in-the-wild" speech with sufficient SNR and recording quality should work reasonably fine by design. The main caveat is that our models work poorly with far-field audio and extremely noisy audio.

Though our models work fine with 8kHz audio (phone calls), for simplicity we always resample to 16 kHz. Robustness is built into the models themselves.

Visually Pleasing Transcriptions

Be prepared that sometimes the CE-model has hard time producing visually pleasing transcriptions, though the results are phonetically similar.

This is usually solved one way or another:

Limiting the model to a very narrow domain (i.e. speech commands);
Adding an external traditional (n-gram) or more modern (DL-based) language model(s) and performing some sort of fusion / re-scoring;
Using much larger (hence slower) model;

Options (1) and (3) contradict our design philosophy and in general limit the real life applicability of models. We are firm believers that technology should be embarrassingly simple to use (i.e. one line of code). Naturally we have solved these challenges with the EE edition of our models, but at this stage we are still not ready to publish the embarrassingly simple EE models that fulfill the same criteria (i.e. all compute graph triggered by one line of code).

Models

Google was used as a main reference in terms of quality;
CE = Community Edition;
EE = Enterprise Edition;

All of the below metrics are WER (word error rate).

EN V1

All of these tests were run in early September 2020.

Dataset	Silero CE	Silero EE	Google Video Premium	Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean	8.6	6.9	7.8	8.7
en_librispeech_val	14.4	11.5	11.3	13.1
en_librispeech_test_other	20.6	17.1	16.2	19.1

Lecture / speech
en_multi_ted_test_he	16.6	12.0	15.3	14.1
en_multi_ted_test_common	21.2	17.6	16.9	16
en_multi_ted_val	23.5	20	22.7	20.8

In the wild
en_common_voice_val	27.5	20.6	20.8	20.8
en_common_voice_test	32.6	25.5	22.2	24

VOIP / calls
en_voip_test	9	8.6	19.7	18.3

British Dialects
en_uk_dialects_midlands_english_female	16.7	10.8	9.6	8.4
en_uk_dialects_southern_english_female	16.7	11.4	10.8	9.3
en_uk_dialects_welsh_english_female	17.1	12.1	20.5	10.5
en_uk_dialects_southern_english_male	17.9	12.7	11.5	10.6
en_uk_dialects_welsh_english_male	18.6	13.2		12.1
en_uk_dialects_northern_english_male	20	13.9	15.5	11.7
en_uk_dialects_scottish_english_male	21.3	15.1	10	11.3
en_uk_dialects_midlands_english_male	21.7	15.1	11.8	10.3
en_uk_dialects_northern_english_female	22	15.2	15	12.7
en_uk_dialects_scottish_english_female	22.2	15.7	13.5	12.6
en_uk_dialects_irish_english_male	32.7	25.5	25.5	21.9

Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest	17.3	13.7	21.5	27
en_voices_rm2_far_none_lav_manifest	31.4	26.5	27.5	42.3
en_voices_rm4_far_none_stu_manifest	33.5	28.7	43.2	43.2
en_voices_rm3_clo_none_stu_manifest	34.5	29.9	28.6	40.8
en_voices_rm2_far_musi_stu_manifest	35.4	30.9	30.6	42.4
en_voices_rm2_far_babb_stu_manifest	39.3	35.0	38.5	48.2
en_voices_rm3_clo_musi_stu_manifest	46.9	43	38.1	51.8
en_voices_rm4_ceo_none_lav_manifest	50.3	46.4	42.9	52.5
en_voices_rm3_far_none_stu_manifest	78.9	78.3	68.8	81.6
en_nsc_val_manifest_part1	31.7	24.4	NA	NA
en_nsc_val_manifest_part2	67.0	60.9	NA	NA

EN V2

Google tests were run in early September 2020.

EN V2 metrics updated in early November 2020.

Dataset	Silero CE	Silero EE	Google Video Premium	Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean	8.7	6.9	7.8	8.7
en_librispeech_val	14.5	11.7	11.3	13.1
en_librispeech_test_other	20.6	17.4	16.2	19.1

Lecture / speech
en_multi_ted_test_he	15.0	11.5	15.3	14.1
en_multi_ted_test_common	20.7	17.3	16.9	16
en_multi_ted_val	22.9	19.9	22.7	20.8

In the wild
en_common_voice_val	27.1	20.3	20.8	20.8
en_common_voice_test	32.1	25.3	22.2	24

VOIP / calls
en_voip_test	11.4	10.8	19.7	18.3

British Dialects
en_uk_dialects_midlands_english_female	15.7	10.4	9.6	8.4
en_uk_dialects_southern_english_female	16.6	11.6	10.8	9.3
en_uk_dialects_welsh_english_female	16.9	11.9	20.5	10.5
en_uk_dialects_southern_english_male	17.4	12.6	11.5	10.6
en_uk_dialects_welsh_english_male	17.8	13.1		12.1
en_uk_dialects_northern_english_male	19.7	13.7	15.5	11.7
en_uk_dialects_scottish_english_male	20.5	14.6	10	11.3
en_uk_dialects_midlands_english_male	21.4	16.1	11.8	10.3
en_uk_dialects_northern_english_female	21.3	15.5	15	12.7
en_uk_dialects_scottish_english_female	21.8	15.4	13.5	12.6
en_uk_dialects_irish_english_male	32.5	25.7	25.5	21.9

Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest	17.5	14.1	21.5	27
en_voices_rm2_far_none_lav_manifest	31.6	27.0	27.5	42.3
en_voices_rm4_far_none_stu_manifest	33.7	29.3	43.2	43.2
en_voices_rm3_clo_none_stu_manifest	34.7	30.4	28.6	40.8
en_voices_rm2_far_musi_stu_manifest	35.9	31.5	30.6	42.4
en_voices_rm2_far_babb_stu_manifest	39.8	35.7	38.5	48.2
en_voices_rm3_clo_musi_stu_manifest	47.2	43.5	38.1	51.8
en_voices_rm4_ceo_none_lav_manifest	50.0	46.3	42.9	52.5
en_voices_rm3_far_none_stu_manifest	78.3	78.0	68.8	81.6
en_nsc_val_manifest_part1	18.3	13.9	NA	NA
en_nsc_val_manifest_part2	31.7	28.5	NA	NA

EN V3

Google tests were run in early September 2020.

EN V3 metrics updated in April 2021.

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	CE	CE	CE	CE	CE	Premium	Premium
AudioBooks / narration

lj
v001_librispeech_test_clean						7.8	8.7
librispeech_val						11.3	13.1
librispeech_test_other						16.2	19.1
aru						16.2	19.1
mls_test
mls_dev

Lecture / speech
multi_ted_test_he						15.3	14.1
multi_ted_test_common						16.9	16.0
multi_ted_val						22.7	20.8
voxpopuli_dev
voxpopuli_test

Finance
kensho

In the wild
common_voice_val						20.8	20.8
common_voice_test						22.2	24

VOIP / calls
voip_test						19.7	18.3

Dialects
uk_dialects_midlands_english_female						9.6	8.4
uk_dialects_southern_english_female						10.8	9.3
uk_dialects_welsh_english_female						20.5	10.5
uk_dialects_southern_english_male						11.5	10.6
uk_dialects_welsh_english_male							12.1
uk_dialects_northern_english_male						15.5	11.7
uk_dialects_scottish_english_male						10	11.3
uk_dialects_midlands_english_male						11.8	10.3
uk_dialects_northern_english_female						15	12.7
uk_dialects_scottish_english_female						13.5	12.6
uk_dialects_irish_english_male						25.5	21.9
nsc_val_manifest_part1

Far-field / very noisy
voices_rm2_clo_none_stu_manifest						21.5	27
voices_rm2_far_none_lav_manifest						27.5	42.3
voices_rm4_far_none_stu_manifest						43.2	43.2
voices_rm3_clo_none_stu_manifest						28.6	40.8
voices_rm2_far_musi_stu_manifest						30.6	42.4
voices_rm2_far_babb_stu_manifest						38.5	48.2
voices_rm3_clo_musi_stu_manifest						38.1	51.8
voices_rm4_ceo_none_lav_manifest						42.9	52.5
voices_rm3_far_none_stu_manifest						68.8	81.6

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	EE	EE	EE	EE	EE	Premium	Premium
AudioBooks / narration

lj			5.9	5.6	5.4
v001_librispeech_test_clean			7.7	7.0	5.9	7.8	8.7
librispeech_val			12.4	11.2	9.7	11.3	13.1
librispeech_test_other			17.9	16.5	15.1	16.2	19.1
aru			11.0	9.7	8.2	16.2	19.1
mls_test			20.9	19.3	17.9
mls_dev			18.5	17.2	15.8

Lecture / speech
multi_ted_test_he			14.8	14.1	12.1	15.3	14.1
multi_ted_test_common			22.7	21.1	18.3	16.9	16.0
multi_ted_val			24.8	23.2	21.0	22.7	20.8
voxpopuli_dev			23.5	22.4	20.8
voxpopuli_test			24.1	23.0	21.4

Finance
kensho			10.6	9.7	8.1

In the wild
common_voice_val			21.4	20.1	18.5	20.8	20.8
common_voice_test			26.4	24.9	23.3	22.2	24

VOIP / calls
voip_test			24.0	23.6	21.0	19.7	18.3

Dialects
uk_dialects_midlands_english_female			12.5	10.8	8.8	9.6	8.4
uk_dialects_southern_english_female			13.1	11.9	9.9	10.8	9.3
uk_dialects_welsh_english_female			12.0	12.8	10.7	20.5	10.5
uk_dialects_southern_english_male			14.1	12.9	10.6	11.5	10.6
uk_dialects_welsh_english_male			14.5	13.9	12.1		12.1
uk_dialects_northern_english_male			15.7	14.6	12.0	15.5	11.7
uk_dialects_scottish_english_male			15.9	14.9	12.7	10	11.3
uk_dialects_midlands_english_male			17.6	16.0	12.2	11.8	10.3
uk_dialects_northern_english_female			16.3	15.8	13.4	15	12.7
uk_dialects_scottish_english_female			16.5	15.2	12.8	13.5	12.6
uk_dialects_irish_english_male			28.1	26.3	23.7	25.5	21.9
nsc_val_manifest_part1			10.0	9.3	8.3

Far-field / very noisy
voices_rm2_clo_none_stu_manifest			14.2	12.6	11.2	21.5	27
voices_rm2_far_none_lav_manifest			25.4	22.8	21.5	27.5	42.3
voices_rm4_far_none_stu_manifest			28.6	26.2	24.7	43.2	43.2
voices_rm3_clo_none_stu_manifest			41.9	39.2	37.9	28.6	40.8
voices_rm2_far_musi_stu_manifest			30.2	27.5	26.1	30.6	42.4
voices_rm2_far_babb_stu_manifest			34.6	31.8	30.9	38.5	48.2
voices_rm3_clo_musi_stu_manifest			29.8	27.0	26.0	38.1	51.8
voices_rm4_ceo_none_lav_manifest			46.9	43.7	41.3	42.9	52.5
voices_rm3_far_none_stu_manifest			74.2	71.5	70.0	68.8	81.6

DE V1

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset	CE	EE	Google
AudioBooks
de_caito_manifest_val	12.5	8.7	19.5

Narration
de_voxforge_manifest_val	3.8	2.3	5.9

In the wild
de_common_voice_test_manifest	28.0	17.6	16.1
de_common_voice_val_manifest	24.9	15.0	14.0
de_telekinect_dev_manifest	28.1	18.6	13.5
de_telekinect_test_manifest	28.3	19.4	15.7

ES V1

All of these tests were run in early September 2020.

For Spanish, we chose the region (US) where a Premium model was available. Judging by the benchmark results, Google heavily relies on the data it sources from Android most likely due to large population and less regulation. Note that most "dialect" recordings are quite clean, but pronunciation varies.

Dataset	CE	EE	Google	Google Phone Premium
AudioBooks
es_caito_val	7.7	5.7	20.3	22.3

Narration
es_voxforge_val	1.4	1.1	18.1	19.4

In the wild
es_common_voice_test	22.0	14.4	27.2	23.1
es_common_voice_val	20.1	13.0	24.5	19.6

Dialects
es_dialects_argentinian_val	19.0	12.9	11.8	6.7
es_dialects_chilean_val	19.8	13.7	8.9	6.6
es_dialects_columbian_val	18.4	11.9	7.8	5.4
es_dialects_peruvian_val	14.4	9.1	6.2	4.7
es_dialects_puerto_rico_val	21.1	14.5	7.9	6.0
es_dialects_venezuela_val	19.2	13.2	8.2	6.4

TTS Models

RU V1

We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 asessors to evaluate the sound quality on a five-point scale. For 8kHz and 16kHz the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades - [1, 2, 3, 4-, 4, 4+, 5-, 5] - the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.

In total people scored audios 37,403 times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.

Speaker	Original	Synthesis	Ratio	Examples
aidar_8khz	4.67 (.45)	4.52 (.55)	96.8%	link
baya_8khz	4.52 (.57)	4.25 (.76)	94.0%	link
kseniya_8khz	4.80 (.40)	4.54 (.60)	94.5%	link
aidar_16khz	4.72 (.43)	4.53 (.55)	95.9%	link
baya_16khz	4.59 (.55)	4.18 (.76)	91.1%	link
kseniya_16khz	4.84 (.37)	4.54 (.59)	93.9%	link

We asked our asessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.

Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.

We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our asessors to rate the same audio 4 times in random order - original and synthesis with different sampling rates. For asessors who annotated the whole dataset we calculated the following table:

Comparison	Worse	Same	Better
16k vs 8k, original	957	4811	1512
16k vs 8k, synthesis	1668	4061	1551
Original vs synthesis, 8k	816	3697	2767
Original vs synthesis, 16k	674	3462	3144

Several conclusions can be drawn:

In 66% of cases people cannot hear difference between 8k и 16k;
In synthesis 8k helps to hide some errors;
In about 60% of cases synthesis is same or better than the original;
Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;

You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.