Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

Closed
wants to merge 5 commits into from

Conversation

hfmandell
Copy link

@hfmandell hfmandell commented Nov 30, 2023

Pull request

This PR fixes Issue 903 which was raised by me after encountering this problem.

Many vector PDFs have Optional Content Groups (OCGs), also referred to as layers. When extracting LTComponents like LTCurve, LTLine, and LTRect, one may find the need to keep track of which OCG the LTComponent is attributed to. This is accomplished by:

  1. Adding ocg attributes to LTCurve, LTLine, and LTRect in 'pdfminer/layout.py'
  2. Setting the ocg attribute in 'pdfminer/converter.py'
  3. Adding an ocg attribute to the PDFGraphicState object in 'pdfminer/pdfinterp.py'
  4. Setting the PDFGraphicState's ocg attribute in 'pdfminer/pdfinterp.py' when the vector graphic BDC command is encountered in the PDF's stream and ensuring the current ocg value is maintained even when the graphic state is restored with the vector graphic Q command.

How Has This Been Tested?

Please remove this paragraph with a description of how this PR has been tested.
[TODO]

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of the content of the props in do_BDC() that you would like to use?

pdfminer/layout.py Show resolved Hide resolved
pdfminer/pdfinterp.py Show resolved Hide resolved
@hfmandell
Copy link
Author

hfmandell commented Jan 8, 2024

Can you give an example of the content of the props in do_BDC() that you would like to use?

The props are not immediately obviously helpful, in that they are simply an alphanumeric string that is unique to the particular OCG. In testing this, I've seen props that clearly describe an OCG, such as "/oc13". Others are less clear and are not reminiscent of the acronym "OCG". They can be seen in the output of dumppdf.py for a given PDF, with the leading "/".

There's a bit more logic needed to be done to tie these props to the actual name of the OCG in the PDF, for example, the "Roads" layer of a layered PDF map. Still, this functionality of associating a PDF vector drawing with its props allows the user to categorize the LTCurves/Lines/Rects into their OCGs. A future MR could tie it directly to the PDF layer name.

@pietermarsman
Copy link
Member

Thanks for the extra info. I see now why storing the OCG could be useful in some specific cases.

I've been reading 8.11 (Optional Content) from the PDF Reference, but find it quite tricky to understand. Do you happen to have a PDF that has optional content groups that you can share? That would help me to understand them.

As far as I understand now the properties of the BDC operator are also used for other purposes, not just OCG's. Therefore simply converting to string and storing it in the graphics state is not enough. E.g. the test PDF's have a couple of BDC's with a /P tag and some extra properties. I think these are unrelated to OCG's, but correct me if I'm wrong.

@pietermarsman
Copy link
Member

Closing because no response. Feel free to reopen when extra info is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get Layer Associated with Individual Vector Graphic
2 participants