Tesseract is extremely flexible, if you know how to control it. There is a large number of control parameters to modify its behaviour. While these change from time to time, most of them are fairly stable. List of all parameters with default value and short description can be retrieved with:
tesseract --print-parameters
There are 3 different types:
Characterized by INIT in its initialization macro.
These parameters can only be set at the TessBaseAPI::Init
function that takes a list of config files.
NOTE: You can't change init only parameter with tesseract executable option -c
.
The rest can be set through TessBaseAPI::SetVariable
and make 2 further groups:
Control many different aspects of Tesseract's functionality.
Contain debug in their name, control huge amounts of optional debug text and graphical output as Tesseract works.
Note that the default value may change; check the source code if you need to be sure of it.
Name | Type | Default value | Init only | Description |
---|---|---|---|---|
load_system_dawg |
boolean (0/1) | 1 | Yes | Controls whether or not to load the main dictionary for the selected language. |
user_words_suffix |
string | "" | Yes | The extension of the users-words word list file. If non-empty, it will attempt to load the relevant list of words to add to the dictionary for the selected language. Eg if set to user-words Tesseract will attempt to load eng.user-words from the tessdata directory at initialization time. |
language_model_penalty_non_dict_word |
double (0-1) | 0.15 | No | The penalty to apply to words not in the word_dawg / user_words wordlists. |
language_model_penalty_non_freq_dict_word |
double (0-1) | 0.1 | No | The penalty to apply to words not in the freq_dawg wordlist. |
Some Japanese tesseract user found these parameters helpful for increasing tesseract-ocr (3.02) accuracy for Japanese :
Name | Suggested value | Description |
---|---|---|
chop_enable | T | Chop enable. |
use_new_state_cost | F | Use new state cost heuristics for segmentation state evaluation |
segment_segcost_rating | F | Incorporate segmentation cost in word rating? |
enable_new_segsearch | 0 | Enable new segmentation search path. It could solve the problem of dividing one character to two characters |
language_model_ngram_on | 0 | Turn on/off the use of character ngram model. |
textord_force_make_prop_words | F | Force proportional word segmentation on all rows. |
edges_max_children_per_outline | 40 | Max number of children inside a character outline. Increase this value if some of KANJI characters are not recognized (rejected). |