-
-
Notifications
You must be signed in to change notification settings - Fork 321
Model Adaptation
We often discuss model adaptation / tuning for particular use-cases with our clients. Generally with our tech it is possible to do so without losing the generalization of the core solution (which is paramount because you want your solution to be resilient).
Let's start-off with a brief discussion of generalization. There are 2 kinds of STT systems:
- Ones that work relatively well out-of-the box on any reasonable data;
- Ones that do not work on a "cold start", i.e. without training acoustic or language models;
Of course it is impossible to be 100% domain / codec / noise / vocabulary agnostic. But it is important to understand, that "quality" heavily depends on noise level and domain peculiarities. We often encountered that people wrongfully compare general domain-agnostic solutions with solutions heavily tuned on particular domains. A domain-tailored model automatically gets a boost of 5-10 WER just because it loses its generality.
In a nutshell solution A
may start off with 30% WER as-is and solution B
may not start out-of-the-box at all but show 25% WER after some investment of time and effort. It does not mean necessarily that A > B
or B > A
, but most likely it means that B
is much more fragile than A
.
When heavily researching the market for Russian STT we noticed that solutions available on the market: (i) either do not work from a cold start (ii) or are prohibitively priced and have strings attached. It goes without saying that no solution provider ever bothered to provide any decent quality measurements. Our design philosophy implies that our models should in general try to work on all domains at least decently.
Also if you try our CE-models you may wrongfully assign more importance to the fact that CE models tend to produce not visually pleasing outputs from time to time. But in fact this is merely the distinction of our CE and EE tiers.
Read further to see how to do your best to solve these artifacts.
Without further ado we have 4 approaches of doing this with our EE models:
Approach | Costs | WER Reduction |
---|---|---|
Term Dictionary | π΅ | 1-2 percentage points (i.e. 20% => 18%) |
Secondary LM | π΅ | 4-5 percentage points (i.e. 20% => 15%) |
Audio annotation | π΅ π΅ π΅ | 8 - 10 percentage points (i.e. 20% => 10%) |
Custom heuristics | Ranging from π΅ to π΅ π΅ π΅ π΅ | It depends |
Sometimes your domain may have some custom words or phrases in vocabulary that are very rare otherwise and have no well-established spelling, but nevertheless appear frequently in your case. We can just add such vocabularies to our EE system. It greatly improves perceptual quality, but does not affect overall quality greatly.
In case of taxi calls - we could shave 1-2 pp WER in reach region using this trick.
When analyzing quality on one of domains (finance) we noticed that just by adding a dictionary and a secondary LM we could shave-off additional 4-5 pp of WER without any major investments in annotation.
By annotating less than 100 hours of audio (and applying other optimizations) we could reduce WER from 20% to 12% on taxi hailing calls.
In general it depends, but commonly we stumbled on 2 main types of solutions:
- Playing with your the metadata that you store;
- Playing with parsing multiple hypotheses that sometimes our EE models produce;
In real large-scale applications, it is simple really. You should apply all of the methods at the same time. They just have different time-scales and returns on efforts.