You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.
Cell In[13], line 4
1 with open("1.json") as input_fp:
2 TDocumentSchema().load(json.load(input_fp))
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
691 def load(
692 self,
693 data: (
(...)
700 unknown: str | None = None,
701 ):
702 """Deserialize a data structure to an object defined by this schema's fields.
703
704 :param data: The data to deserialize.
(...)
720 if invalid data are passed.
721 """
722 return self._do_load(
723 data, many=many, partial=partial, unknown=unknown, postprocess=True
724 )
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
907 exec = ValidationError(errors, data=data, valid_data=result)
908 self.handle_error(exc, data, many=many, partial=partial)
909 raise exc
911 return result
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........
I tried with multi page pdf and single page pdf, but always getting this error.
I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.
Given below are the environment details
Operation System: Windows 11 Pro
Python Version: 3.10.12
If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.
I used the below command to extract text from a pdf using textractor
I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.
https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter
I tried with multi page pdf and single page pdf, but always getting this error.
I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.
Given below are the environment details
Operation System: Windows 11 Pro
Python Version: 3.10.12
amazon-textract-caller==0.2.1
amazon-textract-pipeline-pagedimensions==0.0.9
amazon-textract-prettyprinter==0.1.8
amazon-textract-textractor==1.4.5
amazon-textract-response-parser==1.0.2
marshmallow==3.20.1
textract-trp==0.1.3
Any help to get this error resolved is highly appreciated.
The text was updated successfully, but these errors were encountered: