Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRP.js v0.4.2(?) release draft #180

Merged
merged 15 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions src-js/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
# Changelog

## 0.4.2 (2024-06-28)
### Added
- Filter content by block type in a variety of contexts, with `includeBlockTypes` (allow-list) and `skipBlockTypes` (deny-list) options. These filters are available in the core `iter/listContent()`, `Layout.iter/listItems()` and `LayoutItem.iter/listLayoutChildren()` accessors, but can also be used to hide certain content (like page headers and footers) when you render with `.html({...})`. ([#179](https://github.com/aws-samples/amazon-textract-response-parser/issues/179))
- Low-level relationship traversal via `iter/listRelatedItemsByRelType()` is now supported from `Page`s (PAGE blocks)
- New accessor on `SelectionElement.isSelected`, in convenient boolean format (versus the 2-member `.selectionStatus` enumeration)
- Form `Field.isCheckbox` and `FieldValue.isCheckbox`, check if a K->V field corresponds to a (label)->(checkbox) pair. Also added `{Field/FieldValue}.isSelected` and `.selectionStatus`, which return `null` for non-'checkbox' fields. (Pre-work for [#183](https://github.com/aws-samples/amazon-textract-response-parser/issues/183))
### Changed
- `WithContent` mixin options refactored to more closely mirror `IBlockTypeFilterOpts`, because WithContent now aligns to `iter/listRelatedItemsByRelType()` under the hood. This will give us more fine-grained but standardised control of missing and unexpected non-content child block type handling, per item class... But means some warning/error behaviour when parsing Textract JSON might have shifted a little (hopefully for the better).
- A page's `Layout` no longer keeps any internal list-of-items state, instead referring to the parent `PAGE` block's child relationships directly.

## 0.4.1 (2024-06-04)
### Added
- `iter/listRelatedItemsByRelType()` utility methods on all host-linked block wrapper objects, as most common use-cases for `relatedBlockIdsByRelType()` were just to then fetch the parsed wrapper for the retrieved block ID. Hope to further standardise across `childBlockIds`, `relatedBlockIdsByRelType`, and these new methods in a future release - but this might require some breaking changes to drive consistency in the handling of invalid JSONs (with missing block IDs, etc).
Expand Down
43 changes: 41 additions & 2 deletions src-js/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,8 @@ const page = doc.pageNumber(1);
const fieldByPage = page.form.getFieldByKey("Address");
```

`field.isCheckbox` is true for fields whose value contain exactly one SelectionElement object: meaning they're a (key=label)->(value=checkbox/radio) pair. For these fields, you can directly use `field.selectionStatus` or `field.isSelected` to look up the value's status. For other (non-checkbox) fields, they'll return `null`.


## Tables

Expand Down Expand Up @@ -225,6 +227,15 @@ page.layout.listItems().forEach((layItem) => {
const children = layItem.listContent(); // Usually text LINEs, but sometimes other Layout* items
console.log(layItem.text + "\n"); // ...Or you can just pull up the text
});

// Filtering by content type is also supported:
for (const layItem of page.layout.listItems({
skipBlockTypes: [
ApiBlockType.LayoutHeader, ApiBlockType.LayoutFooter, ApiBlockType.LayoutPageNumber
],
})) {
console.log(layItem.text);
}
```

If Forms and/or Tables analyses were also enabled, you'll be able to traverse from the relevant Layout object types to these more detailed representations. **However,** because these are separate analyses the correspondence may not be 1-to-1 and TRP is having to do some reconciliation under the hood:
Expand Down Expand Up @@ -299,12 +310,40 @@ Some caveats to be aware of:
- Top-level `Page.html()` and `TextractDocument.html()` currently depend on Layout analysis being enabled, because the Layout results are used to sequence all the elements together.
- Only HTML is supported currently, but we're keen to add `.markdown()` if there's interest

If either of these affects your planned use-cases, please let us know in the GitHub issues to help prioritise!
You can also **filter out** types of content you don't want to include in your HTML.

```typescript
// Most commonly, you'll `skip` high-level layout elements like `LayoutHeader`:
const docHtml = doc.html({
skipBlockTypes: [
ApiBlockType.LayoutHeader, ApiBlockType.LayoutFooter, ApiBlockType.LayoutPageNumber
],
});

// Skipping lower-level blocks is also possible, but can produce weird results:
const docHtmlNoCellsOrSelectors = doc.html({
skipBlockTypes: [ApiBlockType.Cell, ApiBlockType.SelectionElement],
});

// Allow-listing is also possible, but you should include *everything* relevant:
const docTablesHtml = doc.html({
includeBlockTypes: [
ApiBlockType.Page,
ApiBlockType.LayoutTable,
ApiBlockType.Table,
ApiBlockType.Cell,
ApiBlockType.SelectionElement,
ApiBlockType.Word,
],
});
```

If you have feedback about these features, please let us know in the GitHub issues to help prioritise!


### Segment headers and footers from main content

This is another task for which you might find [Textract Layout analysis](https://docs.aws.amazon.com/textract/latest/dg/layoutresponse.html) useful - by looping through layout items and excluding those of type 'header', 'footer', and 'page number'.
This is another task for which you might find [Textract Layout analysis](https://docs.aws.amazon.com/textract/latest/dg/layoutresponse.html) useful - by looping through layout items and filtering out those of type `LayoutHeader`, `LayoutFooter` and `PageNumber`.

However, TRP.js also provides a heuristic function you can try instead:

Expand Down
2 changes: 1 addition & 1 deletion src-js/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The projects use the **local build** of the library for pre-publication testing,

To instead switch to published TRP.js versions (if you're using an example as a skeleton for your own project):

- For NodeJS projects, Replace the package.json relative path in `"amazon-textract-response-parser": "file:../.."` with a normal version spec like `"amazon-textract-response-parser": "^0.4.1"`, and re-run `npm install`
- For NodeJS projects, Replace the package.json relative path in `"amazon-textract-response-parser": "file:../.."` with a normal version spec like `"amazon-textract-response-parser": "^0.4.2"`, and re-run `npm install`
- For browser IIFE projects, edit the `<script>` tag in the HTML to point to your chosen CDN or downloaded `trp.min.js` location


Expand Down
8 changes: 4 additions & 4 deletions src-js/examples/browser-iife/main.html
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@
set up your tag something like the below.

<script
src="https://cdn.jsdelivr.net/npm/[email protected].1"
integrity="sha384-Q5Bfg9FW9h7/A4rxpYHurtcHTc7OT8cPFsAnASCfhF44A+h5wu4tZf/V4MBun15x"
src="https://cdn.jsdelivr.net/npm/[email protected].2"
integrity="sha384-8Ykws8cWb9e4P0qkOQcsZ3qCam0Q0SbKsGqqVUcqliMQgAH1bkSeTxTTEB9PLUPG"
crossorigin="anonymous"
></script>

or:

<script
src="https://unpkg.com/[email protected].0"
integrity="sha384-Q5Bfg9FW9h7/A4rxpYHurtcHTc7OT8cPFsAnASCfhF44A+h5wu4tZf/V4MBun15x"
src="https://unpkg.com/[email protected].2"
integrity="sha384-8Ykws8cWb9e4P0qkOQcsZ3qCam0Q0SbKsGqqVUcqliMQgAH1bkSeTxTTEB9PLUPG"
crossorigin="anonymous"
></script>

Expand Down
124 changes: 70 additions & 54 deletions src-js/examples/browser-iife/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion src-js/examples/nodejs-import/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion src-js/examples/nodejs-require/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion src-js/examples/nodejs-typescript/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions src-js/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading