Here we can plan the next releases of Tesseract.
Here are some ideas for future Tesseract releases.
-
Modernize the code using C++11 (see discussions here and here).
-
Use llvm's tools: clang-format, clang-tidy, scan-build, sanitizers.
-
Replace more Tesseract data types by C++ standard types (
GenericVector
, ...), especially for the API. -
Add json (or xml) output format. It will be used for full ocr and for psm 2 - layout info only.
-
Add option to use alternative binarization methods from leptonica.
-
Add an option to output separate files for multipage input (out1.hocr, out2.hocr ...).
-
Add multi-threading option to the command line (openmp will be disabled at runtime in this mode).
-
Explore the option to use Protocol Buffers or FlatBuffers for the traineddata.
-
Improve error handling and don't ignore return values from functions (see discussion).
-
Replace tprintf etc. by advanced logging API with log levels.
Requirements (see also discussion):
Log levels:
- trace
- debug
- info
- warning
- error
- fatal
Related issues:
Useful links:
See the release notes.
See also the discussion for issue #1423.
-
Issues with the "bug" label (see list here)
-
Noise characters recognized with bbox as the entire page #1192
-
Segmentation fault when using integer models for LSTM training #1573
-
Report a warning when the Tesseract initialisation code detects an unsupported locale setting. (See comment)
-
Insufficient error message when output file cannot be created Issue 1424
-
“no best words!!” on mixed language (fra+ara) items (see issue 235)
-
mgr_.Init(traineddata_path.c_str()):Error:Assert failed: #1075 (see issue 1075)
-
https://github.com/zdenop/tessdata_downloader
Script for installing only selected languages from github (see issue)
Depending on available resources and opinions, these suggestions will either be added to the planning for the next or a future release or abandoned.
-
Enhance --list-langs to show additional information for scripts and languages like legacy / LSTM, version
This will make the command slower, because each file must be opened and parsed. Add this as --list-langs-details or as --list-lang-details for one language file based on lang-code?
-
tessedit_load_sublangs should search for the sublangs relative to the parent, not starting in tessdata dir.
-
In addition to the current proprietary format Tesseract could also support ZIP archives (see discussion).
A possible implementation using libarchive is available, but needs more testing.
-
"Training light" - Learning by doing (see issue)
-
Modify text2image to use PrepareDistortedPix() #1052
Tesseract 4.0 should be a full replacement for Tesseract 3.05 and have the same features when used with the old OCR engine (--oem 0
). The following regressions still need verification (are they really regressions, or are they just missing features for LSTM):
These features still work with the old OCR engine (--oem 0
), but are missing and desired for LSTM.
-
#### Black list / White list (See issue). Here is a workaround.Fixed in 4.1.0. -
hOCR font info (See comment)
Here we collect important issues and features for the release(s) following 4.0.0.
-
New LSTM-based OSD detector (see comment).
-
Remove Legacy Tesseract Engine (see issue)
-
Better Multi-language implementation for training (See comment)
-
ARM SIMD support for dot product #519
-
Using OpenMP for dot product #983
-
This does not include OpenCL or the old Tesseract engine.
-
Tesseract creates output for missing input (see issue 1023).
Mostly solved, but could be improved.
-
Issue 1353: Patch for /training/tessopt.cpp (see pull request 13)
It looks like it is not possible to run more than one training in the same process. The pull request describes a possible fix, but does not include a complete implementation (low priority).