[Feature Request] Simplified batch processing CLI #353

athewsey · 2024-04-12T08:34:27Z

I did previously raise similar reqs #19 and #20, but they got stale & closed due to inactivity... Today had first chance in a while to come back to Textractor library and I think the CLI could still be dramatically improved for this use-case.

My proposed use-case is very similar to last time. Given some nested folders of a few multi-page documents like:

+ data/
  + CoolDoc1.pdf
  + subfolder-A/
  | + CoolDocA1.pdf
  | + CoolDocA2.pdf
  + subfolder-B/
    + CoolDocB1.pdf
    + CoolDocB2.pdf

...Then I'd like the CLI to help me produce consolidated Textract JSON results for each doc, in a way that preserves the folder structure and supports mapping to the original files. Something like:

+ data-textracted/
  + CoolDoc1.json
  + subfolder-A/
  | + CoolDocA1.json
  | + CoolDocA2.json
  + subfolder-B/
    + CoolDocB1.json
    + CoolDocB2.json

...And do this as automatically & scalably as it reasonably can, subject to the caveat of not deploying proper cloud infrastructure like SNS topics / step functions / etc: Just the bucket to load files into.

As of today, the CLI docs suggest a solution that (in my proposed order of friction/priority):

Doesn't support mapping from generated JSON file back to the original input filename
Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?
Relies too much on shell script to actually hook the process together, which makes it difficult to try and address e.g. point 2 by customizing.

What I'm really looking for is a utility to abstract the complexities of actually calling Textract for quick PoCs on ~10-100 documents - so I can just get that initial batch run out of the way and turn focus within a few minutes to the analysis and post-processing of the (consolidated, one-per-doc) JSON files.

The text was updated successfully, but these errors were encountered:

Belval · 2024-04-12T14:46:28Z

Few thoughts on the above:

Doesn't support mapping from generated JSON file back to the original input filename

This could be addressed by keeping the name and automatically creating the directory structure so the code change would be minor.

Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?

Perhaps a --wait argument or similar on the get-result invocation would be sufficient to address the above.

Belval added the enhancement New feature or request label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Simplified batch processing CLI #353

[Feature Request] Simplified batch processing CLI #353

athewsey commented Apr 12, 2024

Belval commented Apr 12, 2024

[Feature Request] Simplified batch processing CLI #353

[Feature Request] Simplified batch processing CLI #353

Comments

athewsey commented Apr 12, 2024

Belval commented Apr 12, 2024