OCR Cleanup

OCR cleanup takes a folder of .tsv files, extracted using the Tesseract OCR system, and outputs another folder of .tsv files with cleaned strings.

It cleans the OCR using the following steps:

Removes:
- Low confidence from OCR (<10)
- Height too small (<20 px)
- Empty text
Transformations:
- Removes punctuations
- Replace word with closer match to a dutch dictionary
  - If multiple words are detected, takes the average match ratio times the word frequency for each word.

To run a single file:

python clean_ocr.py <folder with .tsv>

Outputs results to data/processed_ocr/videoname

To run all files in a folder:

for filename in ../../data/RTLNieuws\ OCR/*; do (echo $filename) done | parallel -j 4 "python clean_ocr.py"

tqdm==4.29.1
pandas==0.23.4
fuzzyset==0.0.16
wordfreq==2.2.0

Provide feedback