Access Non-Axis-Aligned Bounding Boxes #359

zkalson · 2024-04-17T02:38:59Z

Hi all,

Based on my understanding, Textract provides an axis-aligned BoundingBox object and a Polygon object which is composed of more specific points (https://docs.aws.amazon.com/textract/latest/dg/text-location.html). It seems that Textractor only provides the BoundingBox object.

When documents contain significant skew or rotation, axis-aligned boxes will be much larger than non-axis-aligned boxes, and they won't neatly match up with the actual position of the text.

I've attached an example input document, an output text layer using Textractor results, and an output text layer from a different OCR inference that provided non-axis-aligned bounding boxes to hopefully make this easy to visualize.

input_document.pdf
text_layer_non-aabb.pdf
text_layer_textractor_aabb.pdf

Is it possible to add the Polygon object in Textractor? It would be a big help!

zkalson · 2024-04-17T03:31:58Z

As a temporary workaround, I am getting the id field from the word/line and finding the associated polygon in Document.response

Belval · 2024-05-06T14:18:13Z

You can use the word/lines raw_object member to get the polygon without doing an id-based look up.

https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/parsers/response_parser.py#L226

In the future we would definitely like to support Polygon objects, but it will require some work as a lot of the code is tightly coupled with the BoundingBox object.

Belval added the enhancement New feature or request label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access Non-Axis-Aligned Bounding Boxes #359

Access Non-Axis-Aligned Bounding Boxes #359

zkalson commented Apr 17, 2024

zkalson commented Apr 17, 2024

Belval commented May 6, 2024

Access Non-Axis-Aligned Bounding Boxes #359

Access Non-Axis-Aligned Bounding Boxes #359

Comments

zkalson commented Apr 17, 2024

zkalson commented Apr 17, 2024

Belval commented May 6, 2024