Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Open
Risho92 opened this issue Dec 1, 2023 · 2 comments
Labels
python Relates to the Python version of TRP

Comments

@Risho92
Copy link

Risho92 commented Dec 1, 2023

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
	DocumentLocation=(
		'S3Object': {
			'Bucket': Bucket,
			'Name': Name
			}
		},
		FeatureTypes=['LAYOUT','FORMS'],
		OutputConfig={
			'S3Bucket': S3Bucket,
			'S3Prefix': S3Prefix
		},
	KMSKeyId=KMSKeyId
)

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
Cell In[13], line 4
	1 with open("1.json") as input_fp:
	2 	TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
	691 def load(
	692 	self,
	693 	data: (
	(...)
	700 unknown: str | None = None,
	701 ):
	702 		"""Deserialize a data structure to an object defined by this schema's fields.
	703
	704 		:param data: The data to deserialize.
	(...)
	720 			if invalid data are passed.
	721			"""
	722 	return self._do_load(
	723 		data, many=many, partial=partial, unknown=unknown, postprocess=True
	724 	)
	
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
	907 	exec = ValidationError(errors, data=data, valid_data=result)
	908 	self.handle_error(exc, data, many=many, partial=partial)
	909 	raise exc
	911 return result
	
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.


Given below are the environment details

Operation System: Windows 11 Pro
Python Version: 3.10.12

amazon-textract-caller==0.2.1
amazon-textract-pipeline-pagedimensions==0.0.9
amazon-textract-prettyprinter==0.1.8
amazon-textract-textractor==1.4.5
amazon-textract-response-parser==1.0.2
marshmallow==3.20.1
textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

@Belval
Copy link
Contributor

Belval commented Dec 1, 2023

If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.

Thanks

@Risho92
Copy link
Author

Risho92 commented Dec 1, 2023

Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.

@athewsey athewsey added the python Relates to the Python version of TRP label Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Relates to the Python version of TRP
Projects
None yet
Development

No branches or pull requests

3 participants