-
Notifications
You must be signed in to change notification settings - Fork 0
OCR Cleanup
Arthur Câmara edited this page Jan 23, 2019
·
4 revisions
OCR cleanup takes a folder of .tsv files, extracted using the Tesseract OCR system, and outputs another folder of .tsv files with cleaned strings.
It cleans the OCR using the following steps:
- Removes:
- Low confidence from OCR (<10)
- Height too small (<20 px)
- Empty text
- Transformations:
- Removes punctuations
- Replace word with closer match to a dutch dictionary
- If multiple words are detected, takes the average match ratio times the word frequency for each word.
To run a single file:
python clean_ocr.py <folder with .tsv>
Outputs results to data/processed_ocr/videoname
To run all files in a folder:
for filename in ../../data/RTLNieuws\ OCR/*; do (echo $filename) done | parallel -j 4 "python clean_ocr.py"
tqdm==4.29.1
pandas==0.23.4
fuzzyset==0.0.16
wordfreq==2.2.0