Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove scrubbing of font name #4085

Closed
wants to merge 3 commits into from
Closed

remove scrubbing of font name #4085

wants to merge 3 commits into from

Conversation

lababidi
Copy link

In get_text, I'm expecting the font names to match the same font names from the get_fonts (get_page_fonts). The resulting fonts are different. If the end user wants "scrubbed" font names, applying the scrubbed at the end makes more sense. I don't believe this will break APIs.

For context, having the full font name (AABDFA+Helvetica) is incredibly useful for analyzing PDFs. without it, a lot of the ability is gone.

Copy link
Contributor

github-actions bot commented Nov 25, 2024

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@lababidi
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@lababidi
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Nov 25, 2024
@lababidi
Copy link
Author

@julian-smith-artifex-com any thoughts on this?

@JorjMcKie
Copy link
Collaborator

I do not think that this PR is able to solve what seems to be your intention.

  1. The list page.get_fonts() reports the font entries in a page's object definition in /Resources. No more - no less. The fonts actually used may be a true subset of those.
  2. The font names appearing in page.get_text("dict") need not (exactly) equal the font names mentioned in that list, and this can never be guaranteed whatever we do. Reasons include:
    • By default, we omit the subset prefix "ABCDEF+" from the font name. You can request to include them by setting a global option pymupdf.TOOLS.set_subset_fontnames(True).
    • The name in "dict" is reported by MuPDF which looks at the self identification in the font file - not at the PDF object definition. Both names may be (and frequently are) different. Also, MuPDF only returns up to 31 bytes of the font name - the PDF definition may be much longer. So the "dict" output will never be longer than 31 bytes.
    • Text extraction works for all document types - not just PDF. So we cannot expect perfect parallelism with get_fonts(), which is a PDF-only function.
    • As a consequence of the above, you may see identical font names in "dict" outputs although the respective text pieces in fact are displayed by truly different fonts in the PDF - even when subset fontnames is True.

@lababidi
Copy link
Author

thank you for the clarification. the global option will help me.

I didn't realize how much integration with mupdf there was with the font system. thanks again.

@lababidi lababidi closed this Nov 26, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 26, 2024
@JorjMcKie
Copy link
Collaborator

Thank you for your understanding and you willingness to contribute!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants