Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for image URLs #526

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

marosstruk
Copy link

Fixes #415
In the spirit of the repo being a wrapper, I left URL validation to the tesseract-ocr process. If you think it is something that should be handled here, let me know.

@stefan6419846
Copy link
Contributor

We should probably ensure that the current Tesseract version is indeed capable of using URLs: This requires a minimum version of Tesseract and the binary being linked to libcurl (which can be disabled at build time).

@marosstruk
Copy link
Author

marosstruk commented Nov 22, 2023

@stefan6419846
Copy link
Contributor

URL support is a compile-time feature as previously mentioned: https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/CMakeLists.txt#L101 Ubuntu < 23.04 just does not link against libcurl during build-time: https://packages.ubuntu.com/jammy/tesseract-ocr https://packages.ubuntu.com/lunar/tesseract-ocr (See control files of the source packages as well.)

@marosstruk
Copy link
Author

marosstruk commented Nov 23, 2023

Ah I see what you mean. The logic I added should correctly check if Tesseract was built with libcurl, so I can add it to the testcase as well. Granted that would make it never run until GitHub adds Ubuntu 23.04 as host for actions and the action config is updated, but at least it will allow people who compiled Tesseract themselves use the functionality.

@stefan6419846
Copy link
Contributor

We should probably use the image from GitHub (for example by uploading it inside a comment here) to not rely on external services.

For testing: I am not sure whether we already want to test this here or add a conditional skip. Compiling Tesseract on GitHub actions would work, but probably mean quite some overhead for each build.

@marosstruk
Copy link
Author

marosstruk commented Nov 28, 2023

I found out that the URL of files attached in comments is actually dynamic and changes over time, so I used an URL pointing to an image stored in the repo itself (had to split the URL to pass max line length check). I also added a test to verify the URLSupportException.

I agree that compiling tesseract for the testcases might be a bit too much. Tbh, there are already other test cases that are skipped due to version atm, so I don't think its a big issue (its good futureproofing to have the testcase there), but ofc the final decision lies with the reviewer.

Just for completeness, I am attaching screenshot of the testcase passing on my machine:
test_image_to_string_with_url

@marosstruk marosstruk force-pushed the master branch 6 times, most recently from f55732e to f524a35 Compare December 3, 2023 17:05
@marosstruk
Copy link
Author

@stefan6419846 Please let me know your thoughts based on my previous comment

@@ -209,7 +218,14 @@ def save(image):
try:
with NamedTemporaryFile(prefix='tess_', delete=False) as f:
if isinstance(image, str):
yield f.name, realpath(normpath(normcase(image)))
if image.startswith('http:') or image.startswith('https:'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be shorter using image.startswith(prefix=('http:', 'https:'))?

Copy link
Author

@marosstruk marosstruk Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed @stefan6419846

@bozhodimitrov bozhodimitrov force-pushed the master branch 10 times, most recently from 8ff250d to b0ae13f Compare November 22, 2024 16:15
@bozhodimitrov bozhodimitrov force-pushed the master branch 12 times, most recently from 7fc814d to 6e4f31c Compare November 22, 2024 18:12
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Nov 22, 2024

@marosstruk thank you very much for your contribution.
My only concern is that we are testing with a non-local URL.
Tests might get flaky if we use external resource or if GitHub is down for some reason.
Since we have this file locally, maybe we can serve it temporarily.

I really like the VCR.py approach, but we use the external tesseract executable itself for the request.
So we need to use something like pytest-localserver to server the local jpeg file.

PS: Oh, and almost forgot that we need a documentation entry in the Quickstart guide at least, so people know about that function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add image_url_to_string
3 participants