-
Notifications
You must be signed in to change notification settings - Fork 79
Output statistics about word confidence values
The hocr output of Tesseract contains confidence values for each word with
x_wconf
-property (values ranges from 0 to 99).
It could be interesting to see for each file (page) the amount of words with
high confidence and the amount of words with low confidence.
This bash scripts goes over all hocr-files and prints out 10 values with the
amount of words with the corresponding word confidences. The first value gives
just the number of words with word confidence starting with 0
, which means that
the word confidence lies in the range 0
up to 09
. This continues in the same
matter such that the last value outputs the number of words with confidence value
starting with 9
, i.e. confidence value must be in the range 90
up to 99
.
#!/bin/bash
for f in *.hocr; do
conf0=$(grep -o "x_wconf 0" "$f" | wc -l)
conf1=$(grep -o "x_wconf 1" "$f" | wc -l)
conf2=$(grep -o "x_wconf 2" "$f" | wc -l)
conf3=$(grep -o "x_wconf 3" "$f" | wc -l)
conf4=$(grep -o "x_wconf 4" "$f" | wc -l)
conf5=$(grep -o "x_wconf 5" "$f" | wc -l)
conf6=$(grep -o "x_wconf 6" "$f" | wc -l)
conf7=$(grep -o "x_wconf 7" "$f" | wc -l)
conf8=$(grep -o "x_wconf 8" "$f" | wc -l)
conf9=$(grep -o "x_wconf 9" "$f" | wc -l)
echo "$f" "$conf0" "$conf1" "$conf2" "$conf3" "$conf4" "$conf5" "$conf6" "$conf7" "$conf8" "$conf9"
done
The output can be further analyzed with Excel or some other tool and looks like this
481659978_08_0458.hocr 0 0 0 0 0 0 1 3 25 296
481659978_08_0459.hocr 0 0 0 0 0 0 0 1 22 311
481659978_08_0460.hocr 0 0 0 0 1 0 1 1 24 318
481659978_08_0461.hocr 0 0 0 0 0 0 3 5 35 301
481659978_08_0462.hocr 0 0 0 0 0 0 0 2 23 308
481659978_08_0463.hocr 0 0 0 0 1 1 2 7 27 271
481659978_08_0464.hocr 0 0 0 2 0 1 2 2 16 305
481659978_08_0465.hocr 0 0 0 0 0 3 2 2 11 322
481659978_08_0466.hocr 1 2 2 1 2 11 4 12 29 169
481659978_08_0467.hocr 0 0 2 0 0 2 1 2 19 276