Skip to content

This is script extract images, scan text and and transcribe to a .txt file

Notifications You must be signed in to change notification settings

ulysses-ck/teseract-ocr-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

tesseract ocr pdf

It´s a script extract PDF´s images and use tesseract OCR for scan it

Pre-installation dependencies

Before you can run it, you need to install Python 3.8, onwards, and tesseract OCR

Python 3

You can download for your OS from their Oficial Download Page

tesseract

Windows

For Windows, you can download the binary installer from here.

Install dependencies

$ pip install pillow
$ pip install pytesseract
$ pip install opencv-python
$ pip install PyMuPDF

Usage

When finish the installation, you can run the script

$ cd tesseract-ocr-pdf
$ python main.py

Windows

You need to provide the path to your tesseract.exe. For example:

> [!] Insert path to your tesseract.exe
> C:\Users\User\tesseract\tesseract.exe

Then, the path to your PDF´file

> [!] Insert path to your tesseract.exe
> C:\Users\User\Documents\file.pdf

And then, the script starts to extract images, scan and create the file with the text output

About

This is script extract images, scan text and and transcribe to a .txt file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages