Skip to content

OCR Cleanup

Arthur Câmara edited this page Jan 23, 2019 · 4 revisions

OCR cleanup takes a folder of .tsv files, extracted using the Tesseract OCR system, and outputs another folder of .tsv files with cleaned strings.

It cleans the OCR using the following steps:

  • Removes:
    • Low confidence from OCR (<10)
    • Height too small (<20 px)
    • Empty text
  • Transformations:
    • Removes punctuations
    • Replace word with closer match to a dutch dictionary
      • If multiple words are detected, takes the average match ratio times the word frequency for each word.

To run a single file:

python clean_ocr.py <folder with .tsv>

Outputs results to data/processed_ocr/videoname

To run all files in a folder:

for filename in ../../data/RTLNieuws\ OCR/*; do (echo $filename) done | parallel -j 4 "python clean_ocr.py"


Requirements:

tqdm==4.29.1
pandas==0.23.4
fuzzyset==0.0.16
wordfreq==2.2.0
Clone this wiki locally