-
Notifications
You must be signed in to change notification settings - Fork 551
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Prevent line breaks, deliver reading order.
Refactor plain text and "words" extraction with sort=True: We previously simply sorted the output by ascending bottom and left coordinate. This change collects words (and respectively text) that are approximately on the same line. Apart from extremely malformed pages, words and respectively text is returned in "natural" reading sequence. This change also suppresses line breaks generated by MuPDF just because of large horizontal distances (as it e.g. often happens between table cell content of the same row.
- Loading branch information
Showing
4 changed files
with
210 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
import pymupdf | ||
|
||
import os.path | ||
|
||
|
||
def test_linebreaks(): | ||
"""Test avoidance of linebreaks.""" | ||
path = os.path.abspath(f"{__file__}/../../tests/resources/test-linebreaks.pdf") | ||
doc = pymupdf.open(path) | ||
page = doc[0] | ||
tp = page.get_textpage(flags=pymupdf.TEXTFLAGS_WORDS) | ||
word_count = len(page.get_text("words", textpage=tp)) | ||
line_count1 = len(page.get_text(textpage=tp).splitlines()) | ||
line_count2 = len(page.get_text(sort=True, textpage=tp).splitlines()) | ||
assert word_count == line_count1 | ||
assert line_count2 < line_count1 / 2 |