You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did previously raise similar reqs #19 and #20, but they got stale & closed due to inactivity... Today had first chance in a while to come back to Textractor library and I think the CLI could still be dramatically improved for this use-case.
My proposed use-case is very similar to last time. Given some nested folders of a few multi-page documents like:
...Then I'd like the CLI to help me produce consolidated Textract JSON results for each doc, in a way that preserves the folder structure and supports mapping to the original files. Something like:
...And do this as automatically & scalably as it reasonably can, subject to the caveat of not deploying proper cloud infrastructure like SNS topics / step functions / etc: Just the bucket to load files into.
Doesn't support mapping from generated JSON file back to the original input filename
Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?
Relies too much on shell script to actually hook the process together, which makes it difficult to try and address e.g. point 2 by customizing.
What I'm really looking for is a utility to abstract the complexities of actually calling Textract for quick PoCs on ~10-100 documents - so I can just get that initial batch run out of the way and turn focus within a few minutes to the analysis and post-processing of the (consolidated, one-per-doc) JSON files.
The text was updated successfully, but these errors were encountered:
Doesn't support mapping from generated JSON file back to the original input filename
This could be addressed by keeping the name and automatically creating the directory structure so the code change would be minor.
Doesn't address how long we need to wait between kicking off the jobs and fetching the results (maybe we could have an automatic waiter built in if the second command's run too soon?
Perhaps a --wait argument or similar on the get-result invocation would be sufficient to address the above.
I did previously raise similar reqs #19 and #20, but they got stale & closed due to inactivity... Today had first chance in a while to come back to Textractor library and I think the CLI could still be dramatically improved for this use-case.
My proposed use-case is very similar to last time. Given some nested folders of a few multi-page documents like:
...Then I'd like the CLI to help me produce consolidated Textract JSON results for each doc, in a way that preserves the folder structure and supports mapping to the original files. Something like:
...And do this as automatically & scalably as it reasonably can, subject to the caveat of not deploying proper cloud infrastructure like SNS topics / step functions / etc: Just the bucket to load files into.
As of today, the CLI docs suggest a solution that (in my proposed order of friction/priority):
What I'm really looking for is a utility to abstract the complexities of actually calling Textract for quick PoCs on ~10-100 documents - so I can just get that initial batch run out of the way and turn focus within a few minutes to the analysis and post-processing of the (consolidated, one-per-doc) JSON files.
The text was updated successfully, but these errors were encountered: