You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Exploring the code base to develop the functionality proposed in #159 I discovered that the call to tesseract is often performed multiple times on exact same data with same arguments for formatting purposes.
Example :
For function image_to_string it is called 3 times. I checked in terms of process, 3 process are indeed launched.
def image_to_string(
image,
lang=None,
config='',
nice=0,
output_type=Output.STRING,
timeout=0,
):
"""
Returns the result of a Tesseract OCR run on the provided image to string
"""
args = [image, 'txt', lang, config, nice, timeout]
return {
Output.BYTES: lambda: run_and_get_output(*(args + [True])),
Output.DICT: lambda: {'text': run_and_get_output(*args)},
Output.STRING: lambda: run_and_get_output(*args),
}[output_type]()
Consequences
Computation seems to occur in parallel so it doesn't have an immediate impact on computation time.
But it is sub-optimal :
Having a small machine will cause longer wait time
Having multiple "huge" jobs calling tesseract will cause longer wait time
Computation ressources are wasted it is not very energy efficient
Proposition
A small refacto could allow us to reduce by 2 to 3 the number of calls.
Refactoring would look like this.
Remove return_bytes=False option in run_and_get_output always return bytes
Complete implementation would be
It seems like you are mixing something up here: The above construct will only call the branch actually needed, while Tesseract will use multiple threads by default when actually running (this is what you observed).
You can verify this with the following basic example as well:
Context
Exploring the code base to develop the functionality proposed in #159 I discovered that the call to tesseract is often performed multiple times on exact same data with same arguments for formatting purposes.
Example :
For function
image_to_string
it is called 3 times. I checked in terms of process, 3 process are indeed launched.Consequences
Computation seems to occur in parallel so it doesn't have an immediate impact on computation time.
But it is sub-optimal :
Proposition
A small refacto could allow us to reduce by 2 to 3 the number of calls.
Refactoring would look like this.
return_bytes=False
option inrun_and_get_output
always return bytesComplete implementation would be
NB: If we want to avoid changing the signature of this function, we could keep it as it is and always call it with
return_bytes=True
run_and_get_output
and manipulate output to have expected resultsFinally we would need to modify function
get_pandas_output
NB: If we want to avoid changing the signature of this function we could create another one named
get_pandas_from_tesseract_output
.Nice side effects from this refactoring
run_and_get_output
would be simpler :def run_and_get_output(image, extension='', lang=None, config='', nice=0, timeout=0) -> bytes
Conclusion
What do you think ? Did I miss something ?
The text was updated successfully, but these errors were encountered: