Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble replicating markdown output #384

Open
bvbg1 opened this issue Jul 14, 2024 · 8 comments
Open

Trouble replicating markdown output #384

bvbg1 opened this issue Jul 14, 2024 · 8 comments
Assignees

Comments

@bvbg1
Copy link

bvbg1 commented Jul 14, 2024

I tried out the code from this example:
https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized

The markdown output I'm getting is different from the above and is incorrect:

| CO.
     | FILE
                              | DEPT.
   | CLOCK
   | NUMBER
   |
|-----|------------------------------|---|---|---|
| ABC | 126543 123456 12345 00000000 |   |   |   |

|                |           |
|----------------|-----------|
| Period ending: | 7/18/2008 |
| Pay date:      | 7/25/2008 |

|          |                       |
|----------|-----------------------|
| Federal: | 3. $25 Additional Tax |
| State:   | 2                     |
| Local:   | 2                     |

| Earnings
          | rate
           | hours
       | this period
          | year to date
           |
|----------|-----------|-------|----------|-----------|
| Regular  | 10.00     | 32.00 | 320.00   | 16,640.00 |
| Overtime | 15.00     | 1.00  | 15.00    | 780.00    |
| Holiday  | 10.00     | 8.00  | 80.00    | 4,160.00  |
| Tuition  |           |       | 37.43    | 1,946.80  |
|          | Gross Pay |       | $ 452.43 | 23,526.80 |

|                 |             |               |
|-----------------|-------------|---------------|
| Other Benefits and

Information                 | this period | total to date |
| Group Term Life | 0.51        | 27.00         |
| Loan Amt Paid   |             | 840.00        |
| Vac Hrs         |             | 40.00         |
| Sick Hrs        |             | 16.00         |
| Title           | Operator    |               |

|            |                     |         |          |
|------------|---------------------|---------|----------|
| Deductions | Statutory

Federal Income Tax                     | -40.60  | 2,111.20 |
|            | Social Security Tax | -28.05  | 1,458.60 |
|            | Medicare Tax        | -6.56   | 341.12   |
|            | NY State Income Tax | -8.43   | 438.36   |
|            | NYC Income Tax      | -5.94   | 308.88   |
|            | NY SUI/SDI Tax      | -0.60   | 31.20    |
|            | Other
 Bond                     | -5.00   | 100.00   |
|            | 401(k)              | -28.85  | 1,500.20 |
|            | Stock Plan          | -15.00  | 150.00   |
|            | Life Insurance      | -5.00   | 50.00    |
|            | Loan                | -30.00  | 150.00   |
|            | Adjustment

Life Insurance                     | + 13.50 |          |
|            | Net Pay             | $291.90 |          |

|                       |             |
|-----------------------|-------------|
| Payroll check number: | 0000000000  |
| Pay date:             | 7/25/2008   |
| Social Security No.   | 987-65-4321 |

|              |                                           |         |
|--------------|-------------------------------------------|---------|
| Pay to the

order of:              | JOHN STILES                               |         |
| This amount: | TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS | $291.90 |

This is my code:

import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
image = Image.open("stub1.jpg").convert("RGB")


extractor = Textractor(region_name="us-west-2")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.FORMS, TextractFeatures.SIGNATURES],
    save_image=True
)
print(document.tables.to_markdown())

I'm using amazon-textract-textractor version 1.8.2 (latest)

@bvbg1
Copy link
Author

bvbg1 commented Jul 16, 2024

@Belval any ideas?

@bvbg1
Copy link
Author

bvbg1 commented Jul 28, 2024

Anyone?

@Belval
Copy link
Contributor

Belval commented Aug 12, 2024

@bvbg1 I need to reproduce the issue to be sure, but there is nothing obviously wrong with the code snippet that you shared. This could be a regression in 1.8. Do you see the same output in 1.7.12?

@Belval Belval self-assigned this Aug 12, 2024
@Belval
Copy link
Contributor

Belval commented Aug 12, 2024

I am unable to reproduce this issue with 1.8.2, the output looks that same as the notebook one. Can you provide the output of pip freeze?

@bvbg1
Copy link
Author

bvbg1 commented Aug 12, 2024

amazon-textract-caller==0.2.4
amazon-textract-helper==0.0.32
amazon-textract-overlayer==0.0.10
amazon-textract-prettyprinter==0.0.16
amazon-textract-response-parser==1.0.2
amazon-textract-textractor==1.8.2

I just checked again, around "Earnings" I'm not getting a proper markdown table.

@bvbg1
Copy link
Author

bvbg1 commented Aug 16, 2024

@Belval have you had a chance to check?

@bvbg1
Copy link
Author

bvbg1 commented Aug 23, 2024

Are you able to reproduce this or is there something wrong on my end?

@bvbg1
Copy link
Author

bvbg1 commented Sep 4, 2024

@Belval ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants