Quality Benchmarks

Quality Benchmarks
- Methodology
- Caveats
  - Overall Quality
  - Visually Pleasing Transcriptions
- STT Models
  - EN Version Comparison
  - EN V1
  - EN V2
  - EN V3
  - EN V4
  - EN V5
  - EN V6
  - DE V1
  - DE V3
  - DE V4
  - ES V1
- TTS Models
  - RU V1
- TE Models
  - EN DE RU ES V2
  - EN DE RU ES V1

🥇 Quality Benchmarks

For your convenience, we provide a set of benchmarks on publicly available datasets. We chose Google's STT as a decent approximation of a high quality enterprise solution available commercially and in many languages.

Methodology

Our approach is described in this article.

Caveats

Overall Quality

Unlike many solutions available off-the-shelf our models (especially the Enterprise Edition models) feature generalization across the following domains:

Video;
Lectures;
Narration;
Phone calls;
Various noises, codecs, recording methods and conditions;

Any "in-the-wild" speech with sufficient SNR and recording quality should work reasonably fine by design. The main caveat is that our models work poorly with far-field audio and extremely noisy audio.

Though our models work fine with 8kHz audio (phone calls), for simplicity we always resample to 16 kHz. Robustness is built into the models themselves.

Visually Pleasing Transcriptions

Be prepared that sometimes the CE-model has hard time producing visually pleasing transcriptions, though the results are phonetically similar.

This is usually solved one way or another:

Limiting the model to a very narrow domain (i.e. speech commands);
Adding an external traditional (n-gram) or more modern (DL-based) language model(s) and performing some sort of fusion / re-scoring;
Using much larger (hence slower) model;

Options (1) and (3) contradict our design philosophy and in general limit the real life applicability of models. We are firm believers that technology should be embarrassingly simple to use (i.e. one line of code). Naturally we have solved these challenges with the EE edition of our models, but at this stage we are still not ready to publish the embarrassingly simple EE models that fulfill the same criteria (i.e. all compute graph triggered by one line of code).

Models

Google was used as a main reference in terms of quality;
CE = Community Edition;
EE = Enterprise Edition;

All of the below metrics are WER (word error rate).

English Version Comparison

Simple WER Version Comparison Table

	V1	V2	V3	V4	V5
AudioBooks
lj			5.4	5.6	5.1
librispeech_test_clean	6.9	6.9	5.9	6.1	5.5
librispeech_val	11.5	11.7	9.7	10	8.8
librispeech_test_other	17.1	17.4	15.1	15.2	13.5
mls_test			17.9	17.1	14.8
mls_dev			15.8	15.3	13.3

Lecture / speech
multi_ted_test_he	12	11.5	12.1	10.2	8.4
multi_ted_test_common	17.6	17.3	18.3	15.5	14
multi_ted_val	20	19.9	21	18.4	16.9
voxpopuli_dev			20.8	19.4	16
voxpopuli_test			21.4	20.5	16.4

Finance
kensho			8.1	5.9	4.3

In the wild
common_voice_val	20.6	20.3	18.5	15.8	15
common_voice_test	25.5	25.3	23.3	20.3	19.2
gigaspeech					20.7

VOIP / calls
voip_test			21	18.7	18.3

Dialects
UK dialects mean	14.6	14.6	12.6	10.9	10.4

EN V1

All of these tests were run in early September 2020.

Dataset	Silero CE	Silero EE	Google Video Premium	Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean	8.6	6.9	7.8	8.7
en_librispeech_val	14.4	11.5	11.3	13.1
en_librispeech_test_other	20.6	17.1	16.2	19.1

Lecture / speech
en_multi_ted_test_he	16.6	12.0	15.3	14.1
en_multi_ted_test_common	21.2	17.6	16.9	16
en_multi_ted_val	23.5	20	22.7	20.8

In the wild
en_common_voice_val	27.5	20.6	20.8	20.8
en_common_voice_test	32.6	25.5	22.2	24

VOIP / calls
en_voip_test	9	8.6	19.7	18.3

British Dialects
en_uk_dialects_midlands_english_female	16.7	10.8	9.6	8.4
en_uk_dialects_southern_english_female	16.7	11.4	10.8	9.3
en_uk_dialects_welsh_english_female	17.1	12.1	20.5	10.5
en_uk_dialects_southern_english_male	17.9	12.7	11.5	10.6
en_uk_dialects_welsh_english_male	18.6	13.2		12.1
en_uk_dialects_northern_english_male	20	13.9	15.5	11.7
en_uk_dialects_scottish_english_male	21.3	15.1	10	11.3
en_uk_dialects_midlands_english_male	21.7	15.1	11.8	10.3
en_uk_dialects_northern_english_female	22	15.2	15	12.7
en_uk_dialects_scottish_english_female	22.2	15.7	13.5	12.6
en_uk_dialects_irish_english_male	32.7	25.5	25.5	21.9

Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest	17.3	13.7	21.5	27
en_voices_rm2_far_none_lav_manifest	31.4	26.5	27.5	42.3
en_voices_rm4_far_none_stu_manifest	33.5	28.7	43.2	43.2
en_voices_rm3_clo_none_stu_manifest	34.5	29.9	28.6	40.8
en_voices_rm2_far_musi_stu_manifest	35.4	30.9	30.6	42.4
en_voices_rm2_far_babb_stu_manifest	39.3	35.0	38.5	48.2
en_voices_rm3_clo_musi_stu_manifest	46.9	43	38.1	51.8
en_voices_rm4_ceo_none_lav_manifest	50.3	46.4	42.9	52.5
en_voices_rm3_far_none_stu_manifest	78.9	78.3	68.8	81.6
en_nsc_val_manifest_part1	31.7	24.4	NA	NA
en_nsc_val_manifest_part2	67.0	60.9	NA	NA

EN V2

Google tests were run in early September 2020.

EN V2 metrics updated in early November 2020.

Dataset	Silero CE	Silero EE	Google Video Premium	Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean	8.7	6.9	7.8	8.7
en_librispeech_val	14.5	11.7	11.3	13.1
en_librispeech_test_other	20.6	17.4	16.2	19.1

Lecture / speech
en_multi_ted_test_he	15.0	11.5	15.3	14.1
en_multi_ted_test_common	20.7	17.3	16.9	16
en_multi_ted_val	22.9	19.9	22.7	20.8

In the wild
en_common_voice_val	27.1	20.3	20.8	20.8
en_common_voice_test	32.1	25.3	22.2	24

VOIP / calls
en_voip_test	11.4	10.8	19.7	18.3

British Dialects
en_uk_dialects_midlands_english_female	15.7	10.4	9.6	8.4
en_uk_dialects_southern_english_female	16.6	11.6	10.8	9.3
en_uk_dialects_welsh_english_female	16.9	11.9	20.5	10.5
en_uk_dialects_southern_english_male	17.4	12.6	11.5	10.6
en_uk_dialects_welsh_english_male	17.8	13.1		12.1
en_uk_dialects_northern_english_male	19.7	13.7	15.5	11.7
en_uk_dialects_scottish_english_male	20.5	14.6	10	11.3
en_uk_dialects_midlands_english_male	21.4	16.1	11.8	10.3
en_uk_dialects_northern_english_female	21.3	15.5	15	12.7
en_uk_dialects_scottish_english_female	21.8	15.4	13.5	12.6
en_uk_dialects_irish_english_male	32.5	25.7	25.5	21.9

Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest	17.5	14.1	21.5	27
en_voices_rm2_far_none_lav_manifest	31.6	27.0	27.5	42.3
en_voices_rm4_far_none_stu_manifest	33.7	29.3	43.2	43.2
en_voices_rm3_clo_none_stu_manifest	34.7	30.4	28.6	40.8
en_voices_rm2_far_musi_stu_manifest	35.9	31.5	30.6	42.4
en_voices_rm2_far_babb_stu_manifest	39.8	35.7	38.5	48.2
en_voices_rm3_clo_musi_stu_manifest	47.2	43.5	38.1	51.8
en_voices_rm4_ceo_none_lav_manifest	50.0	46.3	42.9	52.5
en_voices_rm3_far_none_stu_manifest	78.3	78.0	68.8	81.6
en_nsc_val_manifest_part1	18.3	13.9	NA	NA
en_nsc_val_manifest_part2	31.7	28.5	NA	NA

EN V3

Google tests were run in early September 2020.

EN V3 metrics updated in April 2021.

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	CE	CE	CE	CE	CE	Premium	Premium
AudioBooks / narration
lj	11.5	10.2	8.6	7.9	6.6
librispeech_test_clean	14.3	12.1	11.1	9.7	7.4	7.8	8.7
librispeech_val	21.0	18.4	16.9	15.2	11.9	11.3	13.1
librispeech_test_other	29.0	25.7	23.8	21.6	17.9	16.2	19.1
aru	21.3	18.5	16.9	14.4	11.1	16.2	19.1
mls_test	32.0	29.2	27.3	25.2	22.0
mls_dev	29.6	26.7	24.6	22.7	19.7

Lecture / speech
multi_ted_test_he	25.9	23.1	20.6	19.0	15.8	15.3	14.1
multi_ted_test_common	34.3	30.9	28.1	25.8	21.5	16.9	16.0
multi_ted_val	34.6	31.5	29.4	27.7	23.9	22.7	20.8
voxpopuli_dev	35.2	32.6	30.6	28.7	25.0
voxpopuli_test	36.3	34.1	31.7	30.1	26.4

Finance
kensho	21.3	18.8	15.3	13.8	10.0

In the wild
common_voice_val	37.8	35.1	31.2	28.8	25.3	20.8	20.8
common_voice_test	42.2	39.5	35.9	33.5	30.1	22.2	24

VOIP / calls
voip_test	32.7	31.7	23.7	23.7	21.2	19.7	18.3

Dialects
uk_dialects_midlands_english_female	26.0	23.1	21.3	19.6	13.6	9.6	8.4
uk_dialects_southern_english_female	26.7	23.6	20.9	18.9	14.2	10.8	9.3
uk_dialects_welsh_english_female	25.6	22.6	19.8	18.3	14.2	20.5	10.5
uk_dialects_southern_english_male	27.7	24.7	22.2	20.0	15.0	11.5	10.6
uk_dialects_welsh_english_male	27.8	25.3	22.6	20.5	16.6		12.1
uk_dialects_northern_english_male	31.3	28.2	24.8	23.0	17.2	15.5	11.7
uk_dialects_scottish_english_male	32.0	28.8	25.1	23.2	17.8	10	11.3
uk_dialects_midlands_english_male	33.1	30.2	26.5	24.3	18.0	11.8	10.3
uk_dialects_northern_english_female	33.2	30.1	26.6	24.3	19.3	15	12.7
uk_dialects_scottish_english_female	31.3	28.6	25.4	23.5	18.6	13.5	12.6
uk_dialects_irish_english_male	42.7	40.2	36.8	34.1	29.3	25.5	21.9
nsc_val_manifest_part1

Far-field / very noisy
voices_rm2_clo_none_stu	25.6	22.4	19.7	17.5	14.2	21.5	27
voices_rm2_far_none_lav	41.5	37.2	32.1	29.0	25.7	27.5	42.3
voices_rm4_far_none_stu	46.1	41.4	36.5	33.1	30.1	43.2	43.2
voices_rm3_clo_none_stu	43.2	38.9	35.0	32.1	28.9	28.6	40.8
voices_rm2_far_musi_stu	46.0	41.6	37.0	33.6	30.3	30.6	42.4
voices_rm2_far_babb_stu	50.6	46.3	41.0	37.9	34.7	38.5	48.2
voices_rm3_clo_musi_stu	55.1	51.0	47.6	44.7	41.7	38.1	51.8
voices_rm4_ceo_none_lav	60.7	56.2	52.4	49.0	45.3	42.9	52.5
voices_rm3_far_none_stu	82.0	79.5	76.4	73.8	71.8	68.8	81.6

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	EE	EE	EE	EE	EE	Premium	Premium
AudioBooks / narration
lj	6.8	6.3	5.9	5.6	5.4
librispeech_test_clean	9.6	8.3	7.7	7.0	5.9	7.8	8.7
librispeech_val	15.0	13.2	12.4	11.2	9.7	11.3	13.1
librispeech_test_other	21.7	19.2	17.9	16.5	15.1	16.2	19.1
aru	13.7	11.7	11.0	9.7	8.2	16.2	19.1
mls_test	24.4	22.1	20.9	19.3	17.9
mls_dev	22.0	19.8	18.5	17.2	15.8

Lecture / speech
multi_ted_test_he	19.0	16.6	14.8	14.1	12.1	15.3	14.1
multi_ted_test_common	28.1	24.9	22.7	21.1	18.3	16.9	16.0
multi_ted_val	29.3	26.2	24.8	23.2	21.0	22.7	20.8
voxpopuli_dev	25.7	24.4	23.5	22.4	20.8
voxpopuli_test	26.1	25.0	24.1	23.0	21.4

Finance
kensho	14.0	12.3	10.6	9.7	8.1

In the wild
common_voice_val	25.7	24.0	21.4	20.1	18.5	20.8	20.8
common_voice_test	30.9	29.0	26.4	24.9	23.3	22.2	24

VOIP / calls
voip_test	29.1	29.0	24.0	23.6	21.0	19.7	18.3

Dialects
uk_dialects_midlands_english_female	15.5	13.8	12.5	10.8	8.8	9.6	8.4
uk_dialects_southern_english_female	16.4	14.7	13.1	11.9	9.9	10.8	9.3
uk_dialects_welsh_english_female	15.8	14.3	12.0	12.8	10.7	20.5	10.5
uk_dialects_southern_english_male	17.6	15.7	14.1	12.9	10.6	11.5	10.6
uk_dialects_welsh_english_male	17.9	16.4	14.5	13.9	12.1		12.1
uk_dialects_northern_english_male	19.8	17.9	15.7	14.6	12.0	15.5	11.7
uk_dialects_scottish_english_male	20.5	18.4	15.9	14.9	12.7	10	11.3
uk_dialects_midlands_english_male	22.6	20.2	17.6	16.0	12.2	11.8	10.3
uk_dialects_northern_english_female	21.1	18.9	16.3	15.8	13.4	15	12.7
uk_dialects_scottish_english_female	20.1	18.2	16.5	15.2	12.8	13.5	12.6
uk_dialects_irish_english_male	31.4	29.6	28.1	26.3	23.7	25.5	21.9
nsc_val_manifest_part1			10.0	9.3	8.3

Far-field / very noisy
voices_rm2_clo_none_stu_manifest	18.5	15.9	14.2	12.6	11.2	21.5	27
voices_rm2_far_none_lav_manifest	34.3	29.7	25.4	22.8	21.5	27.5	42.3
voices_rm4_far_none_stu_manifest	39.5	34.4	28.6	26.2	24.7	43.2	43.2
voices_rm3_clo_none_stu_manifest	36.8	32.1	41.9	39.2	37.9	28.6	40.8
voices_rm2_far_musi_stu_manifest	39.1	34.3	30.2	27.5	26.1	30.6	42.4
voices_rm2_far_babb_stu_manifest	44.8	39.3	34.6	31.8	30.9	38.5	48.2
voices_rm3_clo_musi_stu_manifest	49.8	45.1	29.8	27.0	26.0	38.1	51.8
voices_rm4_ceo_none_lav_manifest	56.3	50.7	46.9	43.7	41.3	42.9	52.5
voices_rm3_far_none_stu_manifest	80.9	78.0	74.2	71.5	70.0	68.8	81.6

EN V4

Google tests were run in early September 2020. EN V4 metrics updated in June 2021.

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	CE	CE	CE	CE	CE	Premium	Premium
AudioBooks / narration
lj					6.6
librispeech_test_clean					6.8	7.8	8.7
librispeech_val					11.7	11.3	13.1
librispeech_test_other					17.5	16.2	19.1
aru					10.6	16.2	19.1
mls_test					20.6
mls_dev					18.7
Lecture / speech
multi_ted_test_he					12.2	15.3	14.1
multi_ted_test_common					17.4	16.9	16
multi_ted_val					20.4	22.7	20.8
voxpopuli_dev					21.2
voxpopuli_test					22.6
Finance
kensho					6.5
In the wild
common_voice_val					21.6	20.8	20.8
common_voice_test					26.4	22.2	24
VOIP / calls
voip_test					21.2	19.7	18.3
Dialects
uk_dialects_midlands_english_female					10.8	9.6	8.4
uk_dialects_southern_english_female					11.8	10.8	9.3
uk_dialects_welsh_english_female					12.2	20.5	10.5
uk_dialects_southern_english_male					12.6	11.5	10.6
uk_dialects_welsh_english_male					14.1		12.1
uk_dialects_northern_english_male					14.0	15.5	11.7
uk_dialects_scottish_english_male					15.1	10	11.3
uk_dialects_midlands_english_male					13.7	11.8	10.3
uk_dialects_northern_english_female					16.0	15	12.7
uk_dialects_scottish_english_female					15.8	13.5	12.6
uk_dialects_irish_english_male					25.8	25.5	21.9
Far-field / very noisy
voices_rm2_clo_none_stu					13.7	21.5	27
voices_rm2_far_none_lav					25.0	27.5	42.3
voices_rm4_far_none_stu					30.0	43.2	43.2
voices_rm3_clo_none_stu					28.0	28.6	40.8
voices_rm2_far_musi_stu					29.7	30.6	42.4
voices_rm2_far_babb_stu					34.7	38.5	48.2
voices_rm3_clo_musi_stu					41.3	38.1	51.8
voices_rm4_ceo_none_lav					44.5	42.9	52.5
voices_rm3_far_none_stu					70.7	68.8	81.6

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	large	Video	Phone
	EE	EE	EE	EE	EE	Premium	Premium
AudioBooks / narration
lj					5.6
librispeech_test_clean					6.1	7.8	8.7
librispeech_val					10.0	11.3	13.1
librispeech_test_other					15.2	16.2	19.1
aru					8.0	16.2	19.1
mls_test					17.1
mls_dev					15.3
Lecture / speech
multi_ted_test_he					10.2	15.3	14.1
multi_ted_test_common					15.5	16.9	16
multi_ted_val					18.4	22.7	20.8
voxpopuli_dev					19.4
voxpopuli_test					20.5
Finance
kensho					5.9
In the wild
common_voice_val					15.8	20.8	20.8
common_voice_test					20.3	22.2	24
VOIP / calls
voip_test					18.7	19.7	18.3
Dialects
uk_dialects_midlands_english_female					7.8	9.6	8.4
uk_dialects_southern_english_female					8.3	10.8	9.3
uk_dialects_welsh_english_female					8.9	20.5	10.5
uk_dialects_southern_english_male					9.2	11.5	10.6
uk_dialects_welsh_english_male					10.9		12.1
uk_dialects_northern_english_male					10.0	15.5	11.7
uk_dialects_scottish_english_male					11.1	10	11.3
uk_dialects_midlands_english_male					9.8	11.8	10.3
uk_dialects_northern_english_female					11.3	15	12.7
uk_dialects_scottish_english_female					11.7	13.5	12.6
uk_dialects_irish_english_male					21.2	25.5	21.9

Far-field / very noisy
voices_rm2_clo_none_stu					10.8	21.5	27
voices_rm2_far_none_lav					21.1	27.5	42.3
voices_rm4_far_none_stu					26.0	43.2	43.2
voices_rm3_clo_none_stu					24.1	28.6	40.8
voices_rm2_far_musi_stu					26.1	30.6	42.4
voices_rm2_far_babb_stu					31.5	38.5	48.2
voices_rm3_clo_musi_stu					38	38.1	51.8
voices_rm4_ceo_none_lav					41.3	42.9	52.5
voices_rm3_far_none_stu					69.3	68.8	81.6

EN V5

Google tests were run in early September 2020. EN V5 metrics updated in September 2021.

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	xlarge	Video	Phone
	CE	CE	CE	CE	CE	Premium	Premium
AudioBooks / narration
lj			9.2	8.4	5.9
librispeech_test_clean			11.6	10.2	6.1	7.8	8.7
librispeech_val			17.7	15.9	10.3	11.3	13.1
librispeech_test_other			24	22.2	15.7	16.2	19.1
aru			17.8	15.4	9.3	16.2	19.1
mls_test			26.2	23.9	17.9
mls_dev			23.9	21.8	16.1
Lecture / speech
multi_ted_test_he			18.3	16.7	10.3	15.3	14.1
multi_ted_test_common			25.4	23.2	16.1	16.9	16
multi_ted_val			27.2	25.5	18.8	22.7	20.8
voxpopuli_dev			22.8	21.4	17.2
voxpopuli_test			23.3	22.3	17.9
Finance
kensho			10.5	9.3	4.7
In the wild
common_voice_val			28.5	26.3	20.2	20.8	20.8
common_voice_test			33.2	30.9	24.6	22.2	24
gigaspeech_test			30.5	28.6	22.4
VOIP / calls
voip_test			19.4	19.5	18.3	19.7	18.3
Dialects
uk_dialects_midlands_english_female			19.3	17.2	9.1	9.6	8.4
uk_dialects_southern_english_female			19.6	17.5	11.2	10.8	9.3
uk_dialects_welsh_english_female			18.7	16.6	11.9	20.5	10.5
uk_dialects_southern_english_male			20.2	18.5	11.7	11.5	10.6
uk_dialects_welsh_english_male			20.6	18.9	13.4		12.1
uk_dialects_northern_english_male			23.7	21.1	12.9	15.5	11.7
uk_dialects_scottish_english_male			23	21	14.5	10	11.3
uk_dialects_midlands_english_male			24.6	23.2	13.1	11.8	10.3
uk_dialects_northern_english_female			24.4	22.3	15.7	15	12.7
uk_dialects_scottish_english_female			23.5	21.7	15.2	13.5	12.6
uk_dialects_irish_english_male			35.6	33.6	25.3	25.5	21.9
Far-field / very noisy
voices_rm2_clo_none_stu			20.7	18.2	11.5	21.5	27
voices_rm2_far_none_lav			34	30.7	22	27.5	42.3
voices_rm4_far_none_stu			37.9	34.3	26	43.2	43.2
voices_rm3_clo_none_stu			36.5	33.4	25.1	28.6	40.8
voices_rm2_far_musi_stu			39	35.8	26.5	30.6	42.4
voices_rm2_far_babb_stu			44.5	41.2	30.8	38.5	48.2
voices_rm3_clo_musi_stu			49.5	46.6	38.5	38.1	51.8
voices_rm4_ceo_none_lav			54.3	50.9	40.6	42.9	52.5
voices_rm3_far_none_stu			76	74.5	69.9	68.8	81.6

Dataset	Silero	Silero	Silero	Silero	Silero	Google	Google
	xsmall_q	xsmall	small_q	small	xlarge	Video	Phone
	EE	EE	EE	EE	EE	Premium	Premium
AudioBooks / narration
lj			6.1	5.8	5.1
librispeech_test_clean			8.3	7.5	5.5	7.8	8.7
librispeech_val			12.8	11.9	8.8	11.3	13.1
librispeech_test_other			18.6	17.3	13.5	16.2	19.1
aru			11.6	10.3	7	16.2	19.1
mls_test			20.1	18.5	14.8
mls_dev			18	16.6	13.3
Lecture / speech
multi_ted_test_he			12.9	11.9	8.4	15.3	14.1
multi_ted_test_common			20.2	18.6	14	16.9	16
multi_ted_val			22.4	21.1	16.9	22.7	20.8
voxpopuli_dev			18.6	17.9	16
voxpopuli_test			18.9	18.3	16.4
Finance
kensho			7.5	6.7	4.3
In the wild
common_voice_val			19.9	18.5	15	20.8	20.8
common_voice_test			24.4	22.9	19.2	22.2	24
gigaspeech_test			26.2	24.5	20.7
VOIP / calls
voip_test			18.7	20.2	18.3	19.7	18.3
Dialects
uk_dialects_midlands_english_female			11.9	10.8	6.8	9.6	8.4
uk_dialects_southern_english_female			12.4	11.2	8	10.8	9.3
uk_dialects_welsh_english_female			12.5	11.3	8.7	20.5	10.5
uk_dialects_southern_english_male			13.3	12.2	8.6	11.5	10.6
uk_dialects_welsh_english_male			13.8	12.9	10.1		12.1
uk_dialects_northern_english_male			14.9	13.8	9.6	15.5	11.7
uk_dialects_scottish_english_male			14.9	13.8	10.6	10	11.3
uk_dialects_midlands_english_male			15.5	14.5	9.2	11.8	10.3
uk_dialects_northern_english_female			15.9	14.9	11.2	15	12.7
uk_dialects_scottish_english_female			15.5	14.5	11.3	13.5	12.6
uk_dialects_irish_english_male			26.5	25.1	20.3	25.5	21.9
Far-field / very noisy
voices_rm2_clo_none_stu			15	13.4	9.3	21.5	27
voices_rm2_far_none_lav			27.4	24.7	18.6	27.5	42.3
voices_rm4_far_none_stu			31.3	28.4	22.5	43.2	43.2
voices_rm3_clo_none_stu			30	27.7	21.7	28.6	40.8
voices_rm2_far_musi_stu			32.5	29.7	23.1	30.6	42.4
voices_rm2_far_babb_stu			38.1	35.1	27.5	38.5	48.2
voices_rm3_clo_musi_stu			44	41.5	35.3	38.1	51.8
voices_rm4_ceo_none_lav			48.9	45.8	37.4	42.9	52.5
voices_rm3_far_none_stu			73.7	72.2	68.3	68.8	81.6

EN V6

Google tests were run in early September 2020. EN V6 metrics updated in February 2022.

Dataset	Silero	Silero	Google	Google
	small	xlarge	Video	Phone
	CE	CE	Premium	Premium
AudioBooks / narration
lj	7.7	5.8
librispeech_test_clean	10.0	6.1	7.8	8.7
librispeech_val	15.5	10.4	11.3	13.1
librispeech_test_other	21.9	15.7	16.2	19.1
aru	16.1	9.6	16.2	19.1
mls_test	23.1	17.6
mls_dev	21.1	15.9
Lecture / speech
multi_ted_test_he	15.7	9.9	15.3	14.1
multi_ted_test_common	22.5	16.0	16.9	16
multi_ted_val	23.9	18.5	22.7	20.8
voxpopuli_dev	21.0	16.8
voxpopuli_test	21.9	17.4
Finance
kensho	8.4	4.6
In the wild
common_voice_val	25.9	19.9	20.8	20.8
common_voice_test	30.4	24.4	22.2	24
gigaspeech_test	27.5	22.1
gigaspeech_2s_test	26.1	20.5
fluent_ai_speech_commands	23.6	18.6
speech_commands	17.0	15.1
VOIP / calls
voip_test	19.7	17.5	19.7	18.3
voip_val	19.3	17.8
vystadial_dev	9.3	6.1
vystadial_test	9.1	5.6
vystadial_train	9.1	6.1
Dialects
uk_dialects	19.3	13.0
uk_dialects_midlands_english_female	16.7	8.7	9.6	8.4
uk_dialects_southern_english_female	17.4	11.3	10.8	9.3
uk_dialects_welsh_english_female	16.5	11.8	20.5	10.5
uk_dialects_southern_english_male	18.2	11.9	11.5	10.6
uk_dialects_welsh_english_male	18.7	13.3		12.1
uk_dialects_northern_english_male	20.6	13.1	15.5	11.7
uk_dialects_scottish_english_male	20.8	14.7	10	11.3
uk_dialects_midlands_english_male	22.4	13.4	11.8	10.3
uk_dialects_northern_english_female	21.8	15.7	15	12.7
uk_dialects_scottish_english_female	21.5	15.5	13.5	12.6
uk_dialects_irish_english_male	33.4	25.9	25.5	21.9
cmu_arctic_val	10.5	6.2
l2arctic_arabic	30.1	24.2
l2arctic_chinese	34.1	27.5
l2arctic_hindi	19.1	14.0
l2arctic_korean	23.9	17.6
l2arctic_spanish	28.7	22.6
l2arctic_vietnamese	39.4	33.8
Far-field / very noisy
voices_rm2_clo_none_stu	17.2	11.1	21.5	27
voices_rm2_far_none_lav	30.5	21.4	27.5	42.3
voices_rm4_far_none_stu	34.3	25.5	43.2	43.2
voices_rm3_clo_none_stu	32.3	24.1	28.6	40.8
voices_rm2_far_musi_stu	35.3	25.7	30.6	42.4
voices_rm2_far_babb_stu	42.1	31.0	38.5	48.2
voices_rm3_clo_musi_stu	45.2	36.7	38.1	51.8
voices_rm4_ceo_none_lav	48.9	38.9	42.9	52.5
voices_rm3_far_none_stu	73.8	65.5	68.8	81.6

Dataset	Silero	Silero	Google	Google
	small	xlarge	Video	Phone
	EE	EE	Premium	Premium
AudioBooks / narration
lj	5.7	5.0
librispeech_test_clean	7.5	5.4	7.8	8.7
librispeech_val	11.6	8.8	11.3	13.1
librispeech_test_other	17.3	13.6	16.2	19.1
aru	10.6	7.2	16.2	19.1
mls_test	18.3	14.8
mls_dev	16.6	13.4
Lecture / speech
multi_ted_test_he	11.3	8.4	15.3	14.1
multi_ted_test_common	17.7	13.9	16.9	16
multi_ted_val	20.6	16.8	22.7	20.8
voxpopuli_dev	17.8	15.8
voxpopuli_test	18.3	16.2
Finance
kensho	6.3	4.3
In the wild
common_voice_val	18.3	14.9	20.8	20.8
common_voice_test	22.6	19.1	22.2	24
gigaspeech_test	23.6	20.6
gigaspeech_2s_test	22.1	19.1
fluent_ai_speech_commands	17.2	15.3
speech_commands	16.6	12.0
VOIP / calls
voip_test	19.6	18.2	19.7	18.3
voip_val	18.4	18.3
vystadial_dev	8.2	6.1
vystadial_test	8.3	5.8
vystadial_train	8.7	6.0
Dialects
uk_dialects	13.1	9.7
uk_dialects_midlands_english_female	10.1	6.3	9.6	8.4
uk_dialects_southern_english_female	11.6	8.2	10.8	9.3
uk_dialects_welsh_english_female	11.4	8.8	20.5	10.5
uk_dialects_southern_english_male	12.2	8.8	11.5	10.6
uk_dialects_welsh_english_male	13.1	10.2		12.1
uk_dialects_northern_english_male	13.7	9.8	15.5	11.7
uk_dialects_scottish_english_male	14.2	10.9	10	11.3
uk_dialects_midlands_english_male	14.6	9.2	11.8	10.3
uk_dialects_northern_english_female	15.2	11.3	15	12.7
uk_dialects_scottish_english_female	14.8	11.5	13.5	12.6
uk_dialects_irish_english_male	25.1	21.2	25.5	21.9
cmu_arctic_val	7.6	5.1
l2arctic_arabic	23.1	19.4
l2arctic_chinese	26.8	22.4
l2arctic_hindi	14.1	11.3
l2arctic_korean	17.5	13.9
l2arctic_spanish	22.1	18.5
l2arctic_vietnamese	32.1	28.3
Far-field / very noisy
voices_rm2_clo_none_stu	13.1	9.3	21.5	27
voices_rm2_far_none_lav	25.2	18.5	27.5	42.3
voices_rm4_far_none_stu	29.0	22.4	43.2	43.2
voices_rm3_clo_none_stu	27.2	21.1	28.6	40.8
voices_rm2_far_musi_stu	30.0	22.6	30.6	42.4
voices_rm2_far_babb_stu	37.1	28.0	38.5	48.2
voices_rm3_clo_musi_stu	40.7	33.7	38.1	51.8
voices_rm4_ceo_none_lav	44.5	36.0	42.9	52.5
voices_rm3_far_none_stu	71.8	63.9	68.8	81.6

DE V1

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset	CE	EE	Google
AudioBooks
de_caito_manifest_val	12.5	8.7	19.5

Narration
de_voxforge_manifest_val	3.8	2.3	5.9

In the wild
de_common_voice_test_manifest	28.0	17.6	16.1
de_common_voice_val_manifest	24.9	15.0	14.0
de_telekinect_dev_manifest	28.1	18.6	13.5
de_telekinect_test_manifest	28.3	19.4	15.7

DE V3

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset	CE	EE	Google
Books
de_mls_test	19.5	15.0	N/A
de_mls_val	16.6	12.7	N/A

Narration
de_voxforge_manifest_val	7.4	5.2	5.9

Public speech
de_voxpopuli_dev	27.0	24.6	N/A
de_voxpopuli_test	25.0	22.8	N/A

In the wild
de_common_voice_test_manifest	21.0	14.3	16.1
de_common_voice_val_manifest	18.8	12.5	14.0
de_telekinect_dev_manifest	16.6	11.6	13.5
de_telekinect_test_manifest	17.3	12.1	15.7

DE V4

Google tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset	CE	EE	Google
Books
de_mls_test	16.3	12.8	N/A
de_mls_val	13.3	10.5	N/A

Narration
de_voxforge_val	5.8	4.4	5.9

Public speech
de_voxpopuli_dev	26.3	23.8	N/A
de_voxpopuli_test	24	21.6	N/A

In the wild
de_common_voice_test	20.6	14.1	16.1
de_common_voice_val	18.4	12.3	14
de_telekinect_dev	16.2	11.3	13.5
de_telekinect_test	16.4	12	15.7

ES V1

All of these tests were run in early September 2020.

For Spanish, we chose the region (US) where a Premium model was available. Judging by the benchmark results, Google heavily relies on the data it sources from Android most likely due to large population and less regulation. Note that most "dialect" recordings are quite clean, but pronunciation varies.

Dataset	CE	EE	Google	Google Phone Premium
AudioBooks
es_caito_val	7.7	5.7	20.3	22.3

Narration
es_voxforge_val	1.4	1.1	18.1	19.4

In the wild
es_common_voice_test	22.0	14.4	27.2	23.1
es_common_voice_val	20.1	13.0	24.5	19.6

Dialects
es_dialects_argentinian_val	19.0	12.9	11.8	6.7
es_dialects_chilean_val	19.8	13.7	8.9	6.6
es_dialects_columbian_val	18.4	11.9	7.8	5.4
es_dialects_peruvian_val	14.4	9.1	6.2	4.7
es_dialects_puerto_rico_val	21.1	14.5	7.9	6.0
es_dialects_venezuela_val	19.2	13.2	8.2	6.4

TTS Models

RU V1

We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 asessors to evaluate the sound quality on a five-point scale. For 8kHz and 16kHz the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades - [1, 2, 3, 4-, 4, 4+, 5-, 5] - the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.

In total people scored audios 37,403 times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.

Speaker	Original	Synthesis	Ratio	Examples
aidar_8khz	4.67 (.45)	4.52 (.55)	96.8%	link
baya_8khz	4.52 (.57)	4.25 (.76)	94.0%	link
kseniya_8khz	4.80 (.40)	4.54 (.60)	94.5%	link
aidar_16khz	4.72 (.43)	4.53 (.55)	95.9%	link
baya_16khz	4.59 (.55)	4.18 (.76)	91.1%	link
kseniya_16khz	4.84 (.37)	4.54 (.59)	93.9%	link

We asked our asessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.

Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.

We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our asessors to rate the same audio 4 times in random order - original and synthesis with different sampling rates. For asessors who annotated the whole dataset we calculated the following table:

Comparison	Worse	Same	Better
16k vs 8k, original	957	4811	1512
16k vs 8k, synthesis	1668	4061	1551
Original vs synthesis, 8k	816	3697	2767
Original vs synthesis, 16k	674	3462	3144

Several conclusions can be drawn:

In 66% of cases people cannot hear difference between 8k и 16k;
In synthesis 8k helps to hide some errors;
In about 60% of cases synthesis is same or better than the original;
Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;

You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.

TE Models

Contrary to the popular trends we aim to provide as detailed, informative and honest metrics as possible. In this particular case, we used the following datasets for validation:

Validation subsets of our private text corpora (5,000 sentences per language);
Audiobooks, we use the caito dataset, which has texts in all the languages the model was trained on (20,000 random sentences for each language);

We use the following metrics:

WER (word error rate) as a percentage: separately calculated for repunctuation WER_p (both sentences are transformed to lowercase) and for recapitalization WER_c (here we throw out all punctuation marks);
Precision / recall / F1 to check the quality of classification (i) between the space and the punctuation marks mentioned above .,-!?-, and (ii) for the restoration of capital letters - between classes a token of lowercase letters / a token starts with a capital / a token of all caps. Also we provide confusion matrices for visualization;

Results

For the correct and informative metrics calculation, the following transformations were applied to the texts beforehand:

Punctuation characters other than .,-!?- were removed;
Punctuation at the beginning of a sentence was removed;
In case of multiple consecutive punctuation marks we keep only the first one;
For Spanish ¿¡ were discarded from the model predictions, because they aren't in the texts of the books, but in general the model places them as well;

EN DE RU ES V2

WER_p / WER_c are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the text with a capital letter and ends it with a full stop.

Metrics on Paragraphs

Domain - validation data:

			Languages
	en	de	ru	es
baseline	14 / 19	13 / 41	17 / 20	10 / 16
model	6 / 6	5 / 5	7 / 7	5 / 5

Domain - books:

			Languages
	en	de	ru	es
baseline	14 / 13	15 / 26	23 / 14	13 / 8
model	12 / 7	11 / 8	18 / 10	12 / 6

Metrics on Sentences

Domain - validation data:

			Languages
	en	de	ru	es
baseline	12 / 18	10 / 33	13 / 12	8 / 11
model	5 / 4	5 / 4	7 / 4	5 / 4

Domain - books:

			Languages
	en	de	ru	es
baseline	12 / 10	12 / 22	19 / 9	15 / 7
model	12 / 6	10 / 6	17 / 7	13 / 5

EN DE RU ES V1

Metrics on Sentences

WER

WER_p / WER_c are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the sentence with a capital letter and ends it with a full stop.

Domain - validation data:

			Languages
	en	de	ru	es
baseline	20 / 26	13 / 36	18 / 17	8 / 13
model	8 / 8	7 / 7	13 / 6	6 / 5

Domain - books:

			Languages
	en	de	ru	es
baseline	14 / 13	13 / 22	20 / 11	14 / 7
model	14 / 8	11 / 6	21 / 7	13 / 6

Precision / Recall / F1

Domain - validation data:

Metric	' '	.	,	-	!	?	—
			en
precision	0.98	0.97	0.78	0.91	0.80	0.89	nan
recall	0.99	0.98	0.64	0.75	0.67	0.78	nan
f1	0.98	0.98	0.71	0.82	0.73	0.84	nan
			de
precision	0.98	0.98	0.86	0.81	0.74	0.90	nan
recall	0.99	0.99	0.68	0.60	0.70	0.71	nan
f1	0.99	0.98	0.76	0.69	0.72	0.79	nan
			ru
precision	0.98	0.97	0.80	0.90	0.80	0.84	0
recall	0.98	0.99	0.74	0.70	0.58	0.78	nan
f1	0.98	0.98	0.77	0.78	0.67	0.81	nan
			es
precision	0.98	0.96	0.70	0.74	0.85	0.83	0
recall	0.99	0.98	0.60	0.29	0.60	0.70	nan
f1	0.98	0.98	0.64	0.42	0.70	0.76	nan

Metric	a	A	AAA
		en
precision	0.98	0.94	0.97
recall	0.99	0.91	0.70
f1	0.98	0.92	0.81
		de
precision	0.99	0.98	0.89
recall	0.99	0.98	0.53
f1	0.99	0.98	0.66
		ru
precision	0.99	0.96	0.99
recall	0.99	0.92	0.99
f1	0.99	0.94	0.99
		es
precision	0.99	0.95	0.98
recall	0.99	0.90	0.82
f1	0.99	0.92	0.89

Domain - books:

Metric	' '	.	,	-	!	?	—
			en
precision	0.96	0.80	0.59	0.82	0.23	0.39	nan
recall	0.99	0.73	0.23	0.13	0.58	0.85	0
f1	0.97	0.77	0.33	0.22	0.33	0.53	nan
			de
precision	0.97	0.75	0.80	0.55	0.21	0.41	nan
recall	0.99	0.71	0.49	0.35	0.58	0.67	0
f1	0.98	0.73	0.61	0.43	0.30	0.51	nan
			ru
precision	0.97	0.77	0.69	0.90	0.17	0.49	0
recall	0.98	0.60	0.55	0.61	0.68	0.75	nan
f1	0.98	0.68	0.61	0.72	0.28	0.60	nan
			es
precision	0.96	0.57	0.59	0.96	0.30	0.24	nan
recall	0.98	0.70	0.29	0.02	0.40	0.68	0
f1	0.97	0.63	0.38	0.04	0.34	0.36	nan

Metric	a	A	AAA
		en
precision	0.99	0.80	0.94
recall	0.98	0.89	0.95
f1	0.98	0.85	0.94
		de
precision	0.99	0.90	0.77
recall	0.98	0.94	0.62
f1	0.98	0.92	0.70
		ru
precision	0.99	0.81	0.82
recall	0.99	0.87	0.96
f1	0.99	0.84	0.89
		es
precision	0.99	0.71	0.45
recall	0.98	0.82	0.91
f1	0.98	0.76	0.60

As one can see from the spreadsheets - even for Russian, the hyphen values remained empty, because the model suggested not to put it down at all on the data used for calculating metrics, or to replace the hyphen with some other symbol; seems that it's placed better in case of sentence in the form of definition.

header)

Home
Getting Started
- Quickstart
- PyTorch
- ONNX
Benchmarks:
Licensing:
- License
- CE and EE Tiers
Services:
- Model Adaptation
- Adding New Languages
TTS:
- SSML
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality Benchmarks

🥇 Quality Benchmarks

Methodology

Caveats

Overall Quality

Visually Pleasing Transcriptions

Models

English Version Comparison

EN V1

EN V2

EN V3

EN V4

EN V5

EN V6

DE V1

DE V3

DE V4

ES V1

TTS Models

RU V1

TE Models

Results

EN DE RU ES V2

Metrics on Paragraphs

Metrics on Sentences

EN DE RU ES V1

Metrics on Sentences

WER

Precision / Recall / F1

Clone this wiki locally