extract paragraph with below table #2699

danny007in · 2023-09-29T10:13:32Z

PDF is http://kea.kar.nic.in/diploma_2022/engg_cutoff_gen_mock.pdf

while extracting all college name is coming and tables are separated, but require line by line to export in excel

JorjMcKie · 2023-09-30T05:54:54Z

Sorry, I don't understand what this means:

while extracting all college name is coming and tables are separated, but require line by line to export in excel

JorjMcKie · 2023-09-30T07:58:28Z

After a closer look I saw that you may be hit by a bug that someone detected and submitted a fix for today.

The table finder has a problem telling apart the many tables on your pages, because there are all sorts of line drawings outside tables, which are confusing it.

My recommendation:

Apply a hotfix by manually modifying table.py in your installation as suggested here. Essentially delete line 1893:
Use the following script to detect all tables. The script uses certain table headers as table delimiters. Then feeds these sub rectangles into the table finder.

import fitz

doc = fitz.open("engg_cutoff_gen_mock.pdf")

for page in doc:
    page.clean_contents()
    print(f"Locating tables on {page}.")
    prect = page.rect

    # Locate words "Ennn" and use their bottom y as table delimiter
    y_coords = [
        w[3]  # bottom coordinate of word
        for w in page.get_text("words")
        if w[4].startswith("E") and w[4][1:].isdigit() and len(w[4]) == 4
    ] + [
        prect.y1
    ]  # always append page bottom coordinate

    clips = []  # look for tables inside these clips

    # compute the clips
    for i in range(len(y_coords) - 1):
        clip = +prect
        clip.y0 = y_coords[i]
        clip.y1 = y_coords[i + 1]
        clips.append(clip)

    # now we hopefully are equipped to find all tables!
    for i, clip in enumerate(clips):
        tabs = page.find_tables(clip=clip)
        if tabs.tables:
            tab = tabs[0]
            page.draw_rect(tab.bbox, color=(1, 0, 0))
            print(f"   Detected table {i}")

doc.ez_save("x.pdf")

danny007in · 2023-09-30T09:23:25Z

Hi @JorjMcKie
me new to python, but my suggestion is to implement linter to this project like https://github.com/pylint-dev/pylint

JorjMcKie · 2023-09-30T09:28:34Z

Did you make the suggested changes in table.py?
If not, then the tables after the first one are not identified.

danny007in · 2023-09-30T11:22:36Z

Did you make the suggested changes in table.py? If not, then the tables after the first one are not identified.

updated table.py, even though it was not working, after refreshing jupyter and vscode (cache issue), its working now

danny007in · 2023-10-01T04:14:02Z

is there any method that can easily organize table

[
  {
    "college_code": "E001",
    "college_name": "E001 University of Visvesvaraya College of Engineering  Bangalore",
    "college_district": "BANGALORE",
    "course": "AI Artificial Intelligence",
    "cast": "1G",
    "rank": 7418
  },
]

JorjMcKie · 2023-10-01T06:37:01Z

is there any method that can easily organize table

[
  {
    "college_code": "E001",
    "college_name": "E001 University of Visvesvaraya College of Engineering  Bangalore",
    "college_district": "BANGALORE",
    "course": "AI Artificial Intelligence",
    "cast": "1G",
    "rank": 7418
  },
]

Sure!
If tab is your table, then df = tab.to_pandas() creates a pandas DataFrame that does all that for you!
But associating the header line "E001 ..." with the table is your code only.

julian-smith-artifex-com · 2023-10-13T09:29:37Z

Fixed in 1.23.5.

This comment was marked as outdated.

Sign in to view

JorjMcKie added the Fixed in next release label Oct 3, 2023

JorjMcKie self-assigned this Oct 3, 2023

JorjMcKie added the bug label Oct 10, 2023

julian-smith-artifex-com removed the Fixed in next release label Oct 13, 2023

julian-smith-artifex-com closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract paragraph with below table #2699

extract paragraph with below table #2699

danny007in commented Sep 29, 2023 •

edited

Loading

JorjMcKie commented Sep 30, 2023

JorjMcKie commented Sep 30, 2023

danny007in commented Sep 30, 2023

This comment was marked as outdated.

JorjMcKie commented Sep 30, 2023

danny007in commented Sep 30, 2023

danny007in commented Oct 1, 2023 •

edited

Loading

JorjMcKie commented Oct 1, 2023 •

edited

Loading

julian-smith-artifex-com commented Oct 13, 2023

extract paragraph with below table #2699

extract paragraph with below table #2699

Comments

danny007in commented Sep 29, 2023 • edited Loading

JorjMcKie commented Sep 30, 2023

JorjMcKie commented Sep 30, 2023

danny007in commented Sep 30, 2023

This comment was marked as outdated.

JorjMcKie commented Sep 30, 2023

danny007in commented Sep 30, 2023

danny007in commented Oct 1, 2023 • edited Loading

JorjMcKie commented Oct 1, 2023 • edited Loading

julian-smith-artifex-com commented Oct 13, 2023

danny007in commented Sep 29, 2023 •

edited

Loading

danny007in commented Oct 1, 2023 •

edited

Loading

JorjMcKie commented Oct 1, 2023 •

edited

Loading