remove scrubbing of font name #4085

lababidi · 2024-11-25T18:12:39Z

In get_text, I'm expecting the font names to match the same font names from the get_fonts (get_page_fonts). The resulting fonts are different. If the end user wants "scrubbed" font names, applying the scrubbed at the end makes more sense. I don't believe this will break APIs.

For context, having the full font name (AABDFA+Helvetica) is incredibly useful for analyzing PDFs. without it, a lot of the ability is gone.

github-actions · 2024-11-25T18:12:56Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

lababidi · 2024-11-25T18:21:25Z

I have read the CLA Document and I hereby sign the CLA

lababidi · 2024-11-25T18:22:47Z

recheck

lababidi · 2024-11-25T18:25:23Z

@julian-smith-artifex-com any thoughts on this?

JorjMcKie · 2024-11-26T11:59:30Z

I do not think that this PR is able to solve what seems to be your intention.

The list page.get_fonts() reports the font entries in a page's object definition in /Resources. No more - no less. The fonts actually used may be a true subset of those.
The font names appearing in page.get_text("dict") need not (exactly) equal the font names mentioned in that list, and this can never be guaranteed whatever we do. Reasons include:
- By default, we omit the subset prefix "ABCDEF+" from the font name. You can request to include them by setting a global option pymupdf.TOOLS.set_subset_fontnames(True).
- The name in "dict" is reported by MuPDF which looks at the self identification in the font file - not at the PDF object definition. Both names may be (and frequently are) different. Also, MuPDF only returns up to 31 bytes of the font name - the PDF definition may be much longer. So the "dict" output will never be longer than 31 bytes.
- Text extraction works for all document types - not just PDF. So we cannot expect perfect parallelism with get_fonts(), which is a PDF-only function.
- As a consequence of the above, you may see identical font names in "dict" outputs although the respective text pieces in fact are displayed by truly different fonts in the PDF - even when subset fontnames is True.

lababidi · 2024-11-26T16:32:40Z

thank you for the clarification. the global option will help me.

I didn't realize how much integration with mupdf there was with the font system. thanks again.

JorjMcKie · 2024-11-26T20:45:33Z

Thank you for your understanding and you willingness to contribute!

remove scrubbing of font name

b51e916

remove scrubbing of font name

b82ec53

github-actions bot added a commit that referenced this pull request Nov 25, 2024

@lababidi has signed the CLA in #4085

d534d18

Create python-package-ubuntu.yml

90a765c

lababidi closed this Nov 26, 2024

github-actions bot locked and limited conversation to collaborators Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove scrubbing of font name #4085

remove scrubbing of font name #4085

lababidi commented Nov 25, 2024

github-actions bot commented Nov 25, 2024 •

edited

Loading

lababidi commented Nov 25, 2024

lababidi commented Nov 25, 2024

lababidi commented Nov 25, 2024

JorjMcKie commented Nov 26, 2024

lababidi commented Nov 26, 2024

JorjMcKie commented Nov 26, 2024

remove scrubbing of font name #4085

remove scrubbing of font name #4085

Conversation

lababidi commented Nov 25, 2024

github-actions bot commented Nov 25, 2024 • edited Loading

lababidi commented Nov 25, 2024

lababidi commented Nov 25, 2024

lababidi commented Nov 25, 2024

JorjMcKie commented Nov 26, 2024

lababidi commented Nov 26, 2024

JorjMcKie commented Nov 26, 2024

github-actions bot commented Nov 25, 2024 •

edited

Loading