tesseract ocr pdf

It´s a script extract PDF´s images and use tesseract OCR for scan it

Pre-installation dependencies

Before you can run it, you need to install Python 3.8, onwards, and tesseract OCR

Python 3

You can download for your OS from their Oficial Download Page

tesseract

Windows

For Windows, you can download the binary installer from here.

Install dependencies

$ pip install pillow
$ pip install pytesseract
$ pip install opencv-python
$ pip install PyMuPDF

Usage

When finish the installation, you can run the script

$ cd tesseract-ocr-pdf
$ python main.py

Windows

You need to provide the path to your tesseract.exe. For example:

> [!] Insert path to your tesseract.exe
> C:\Users\User\tesseract\tesseract.exe

Then, the path to your PDF´file

> [!] Insert path to your tesseract.exe
> C:\Users\User\Documents\file.pdf

And then, the script starts to extract images, scan and create the file with the text output

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tesseract ocr pdf

Pre-installation dependencies

Python 3

tesseract

Windows

Install dependencies

Usage

Windows

About

Releases

Packages

Languages

ulysses-ck/teseract-ocr-pdf

Folders and files

Latest commit

History

Repository files navigation

tesseract ocr pdf

Pre-installation dependencies

Python 3

tesseract

Windows

Install dependencies

Usage

Windows

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages