-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract paragraph with below table #2699
Comments
Sorry, I don't understand what this means:
|
After a closer look I saw that you may be hit by a bug that someone detected and submitted a fix for today. The table finder has a problem telling apart the many tables on your pages, because there are all sorts of line drawings outside tables, which are confusing it. My recommendation:
import fitz
doc = fitz.open("engg_cutoff_gen_mock.pdf")
for page in doc:
page.clean_contents()
print(f"Locating tables on {page}.")
prect = page.rect
# Locate words "Ennn" and use their bottom y as table delimiter
y_coords = [
w[3] # bottom coordinate of word
for w in page.get_text("words")
if w[4].startswith("E") and w[4][1:].isdigit() and len(w[4]) == 4
] + [
prect.y1
] # always append page bottom coordinate
clips = [] # look for tables inside these clips
# compute the clips
for i in range(len(y_coords) - 1):
clip = +prect
clip.y0 = y_coords[i]
clip.y1 = y_coords[i + 1]
clips.append(clip)
# now we hopefully are equipped to find all tables!
for i, clip in enumerate(clips):
tabs = page.find_tables(clip=clip)
if tabs.tables:
tab = tabs[0]
page.draw_rect(tab.bbox, color=(1, 0, 0))
print(f" Detected table {i}")
doc.ez_save("x.pdf") |
Hi @JorjMcKie |
This comment was marked as outdated.
This comment was marked as outdated.
Did you make the suggested changes in table.py? |
updated |
is there any method that can easily organize table
|
Sure! |
Fixed in 1.23.5. |
PDF is http://kea.kar.nic.in/diploma_2022/engg_cutoff_gen_mock.pdf
while extracting all college name is coming and tables are separated, but require line by line to export in excel
The text was updated successfully, but these errors were encountered: