Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Simplified batch processing CLI #353

Open
athewsey opened this issue Apr 12, 2024 · 1 comment
Open

[Feature Request] Simplified batch processing CLI #353

athewsey opened this issue Apr 12, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@athewsey
Copy link
Contributor

I did previously raise similar reqs #19 and #20, but they got stale & closed due to inactivity... Today had first chance in a while to come back to Textractor library and I think the CLI could still be dramatically improved for this use-case.

My proposed use-case is very similar to last time. Given some nested folders of a few multi-page documents like:

+ data/
  + CoolDoc1.pdf
  + subfolder-A/
  | + CoolDocA1.pdf
  | + CoolDocA2.pdf
  + subfolder-B/
    + CoolDocB1.pdf
    + CoolDocB2.pdf

...Then I'd like the CLI to help me produce consolidated Textract JSON results for each doc, in a way that preserves the folder structure and supports mapping to the original files. Something like:

+ data-textracted/
  + CoolDoc1.json
  + subfolder-A/
  | + CoolDocA1.json
  | + CoolDocA2.json
  + subfolder-B/
    + CoolDocB1.json
    + CoolDocB2.json

...And do this as automatically & scalably as it reasonably can, subject to the caveat of not deploying proper cloud infrastructure like SNS topics / step functions / etc: Just the bucket to load files into.


As of today, the CLI docs suggest a solution that (in my proposed order of friction/priority):

  1. Doesn't support mapping from generated JSON file back to the original input filename
  2. Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?
  3. Relies too much on shell script to actually hook the process together, which makes it difficult to try and address e.g. point 2 by customizing.

What I'm really looking for is a utility to abstract the complexities of actually calling Textract for quick PoCs on ~10-100 documents - so I can just get that initial batch run out of the way and turn focus within a few minutes to the analysis and post-processing of the (consolidated, one-per-doc) JSON files.

@Belval Belval added the enhancement New feature or request label Apr 12, 2024
@Belval
Copy link
Contributor

Belval commented Apr 12, 2024

Few thoughts on the above:

Doesn't support mapping from generated JSON file back to the original input filename

This could be addressed by keeping the name and automatically creating the directory structure so the code change would be minor.

Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?

Perhaps a --wait argument or similar on the get-result invocation would be sufficient to address the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants