ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Risho92 · 2023-12-01T21:24:23Z

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
	DocumentLocation=(
		'S3Object': {
			'Bucket': Bucket,
			'Name': Name
			}
		},
		FeatureTypes=['LAYOUT','FORMS'],
		OutputConfig={
			'S3Bucket': S3Bucket,
			'S3Prefix': S3Prefix
		},
	KMSKeyId=KMSKeyId
)

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"

Cell In[13], line 4
	1 with open("1.json") as input_fp:
	2 	TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
	691 def load(
	692 	self,
	693 	data: (
	(...)
	700 unknown: str | None = None,
	701 ):
	702 		"""Deserialize a data structure to an object defined by this schema's fields.
	703
	704 		:param data: The data to deserialize.
	(...)
	720 			if invalid data are passed.
	721			"""
	722 	return self._do_load(
	723 		data, many=many, partial=partial, unknown=unknown, postprocess=True
	724 	)
	
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
	907 	exec = ValidationError(errors, data=data, valid_data=result)
	908 	self.handle_error(exc, data, many=many, partial=partial)
	909 	raise exc
	911 return result
	
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.

Given below are the environment details

Operation System: Windows 11 Pro
Python Version: 3.10.12

amazon-textract-caller==0.2.1
amazon-textract-pipeline-pagedimensions==0.0.9
amazon-textract-prettyprinter==0.1.8
amazon-textract-textractor==1.4.5
amazon-textract-response-parser==1.0.2
marshmallow==3.20.1
textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

The text was updated successfully, but these errors were encountered:

Belval · 2023-12-01T21:29:17Z

If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.

Thanks

Risho92 · 2023-12-01T23:07:13Z

Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.

athewsey added the python Relates to the Python version of TRP label Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Risho92 commented Dec 1, 2023 •

edited by athewsey

Loading

Belval commented Dec 1, 2023

Risho92 commented Dec 1, 2023

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Comments

Risho92 commented Dec 1, 2023 • edited by athewsey Loading

Belval commented Dec 1, 2023

Risho92 commented Dec 1, 2023

Risho92 commented Dec 1, 2023 •

edited by athewsey

Loading