Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract paragraph with below table #2699

Closed
danny007in opened this issue Sep 29, 2023 · 9 comments
Closed

extract paragraph with below table #2699

danny007in opened this issue Sep 29, 2023 · 9 comments
Assignees
Labels

Comments

@danny007in
Copy link

danny007in commented Sep 29, 2023

PDF is http://kea.kar.nic.in/diploma_2022/engg_cutoff_gen_mock.pdf

image

while extracting all college name is coming and tables are separated, but require line by line to export in excel

@JorjMcKie
Copy link
Collaborator

Sorry, I don't understand what this means:

while extracting all college name is coming and tables are separated, but require line by line to export in excel

@JorjMcKie
Copy link
Collaborator

After a closer look I saw that you may be hit by a bug that someone detected and submitted a fix for today.

The table finder has a problem telling apart the many tables on your pages, because there are all sorts of line drawings outside tables, which are confusing it.

My recommendation:

  1. Apply a hotfix by manually modifying table.py in your installation as suggested here. Essentially delete line 1893:
    grafik

  2. Use the following script to detect all tables. The script uses certain table headers as table delimiters. Then feeds these sub rectangles into the table finder.

import fitz

doc = fitz.open("engg_cutoff_gen_mock.pdf")

for page in doc:
    page.clean_contents()
    print(f"Locating tables on {page}.")
    prect = page.rect

    # Locate words "Ennn" and use their bottom y as table delimiter
    y_coords = [
        w[3]  # bottom coordinate of word
        for w in page.get_text("words")
        if w[4].startswith("E") and w[4][1:].isdigit() and len(w[4]) == 4
    ] + [
        prect.y1
    ]  # always append page bottom coordinate

    clips = []  # look for tables inside these clips

    # compute the clips
    for i in range(len(y_coords) - 1):
        clip = +prect
        clip.y0 = y_coords[i]
        clip.y1 = y_coords[i + 1]
        clips.append(clip)

    # now we hopefully are equipped to find all tables!
    for i, clip in enumerate(clips):
        tabs = page.find_tables(clip=clip)
        if tabs.tables:
            tab = tabs[0]
            page.draw_rect(tab.bbox, color=(1, 0, 0))
            print(f"   Detected table {i}")

doc.ez_save("x.pdf")

@danny007in
Copy link
Author

Hi @JorjMcKie
me new to python, but my suggestion is to implement linter to this project like https://github.com/pylint-dev/pylint

@danny007in

This comment was marked as outdated.

@JorjMcKie
Copy link
Collaborator

Did you make the suggested changes in table.py?
If not, then the tables after the first one are not identified.

@danny007in
Copy link
Author

Did you make the suggested changes in table.py? If not, then the tables after the first one are not identified.

updated table.py, even though it was not working, after refreshing jupyter and vscode (cache issue), its working now

@danny007in
Copy link
Author

danny007in commented Oct 1, 2023

is there any method that can easily organize table

[
  {
    "college_code": "E001",
    "college_name": "E001 University of Visvesvaraya College of Engineering  Bangalore",
    "college_district": "BANGALORE",
    "course": "AI Artificial Intelligence",
    "cast": "1G",
    "rank": 7418
  },
]

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Oct 1, 2023

is there any method that can easily organize table

[
  {
    "college_code": "E001",
    "college_name": "E001 University of Visvesvaraya College of Engineering  Bangalore",
    "college_district": "BANGALORE",
    "course": "AI Artificial Intelligence",
    "cast": "1G",
    "rank": 7418
  },
]

Sure!
If tab is your table, then df = tab.to_pandas() creates a pandas DataFrame that does all that for you!
But associating the header line "E001 ..." with the table is your code only.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants