Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

hfmandell · 2023-11-30T04:51:43Z

Pull request

This PR fixes Issue 903 which was raised by me after encountering this problem.

Many vector PDFs have Optional Content Groups (OCGs), also referred to as layers. When extracting LTComponents like LTCurve, LTLine, and LTRect, one may find the need to keep track of which OCG the LTComponent is attributed to. This is accomplished by:

Adding ocg attributes to LTCurve, LTLine, and LTRect in 'pdfminer/layout.py'
Setting the ocg attribute in 'pdfminer/converter.py'
Adding an ocg attribute to the PDFGraphicState object in 'pdfminer/pdfinterp.py'
Setting the PDFGraphicState's ocg attribute in 'pdfminer/pdfinterp.py' when the vector graphic BDC command is encountered in the PDF's stream and ensuring the current ocg value is maintained even when the graphic state is restored with the vector graphic Q command.

How Has This Been Tested?

Please remove this paragraph with a description of how this PR has been tested.
[TODO]

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

pietermarsman

Can you give an example of the content of the props in do_BDC() that you would like to use?

pdfminer/layout.py

pdfminer/pdfinterp.py

hfmandell · 2024-01-08T07:08:40Z

Can you give an example of the content of the props in do_BDC() that you would like to use?

The props are not immediately obviously helpful, in that they are simply an alphanumeric string that is unique to the particular OCG. In testing this, I've seen props that clearly describe an OCG, such as "/oc13". Others are less clear and are not reminiscent of the acronym "OCG". They can be seen in the output of dumppdf.py for a given PDF, with the leading "/".

There's a bit more logic needed to be done to tie these props to the actual name of the OCG in the PDF, for example, the "Roads" layer of a layered PDF map. Still, this functionality of associating a PDF vector drawing with its props allows the user to categorize the LTCurves/Lines/Rects into their OCGs. A future MR could tie it directly to the PDF layer name.

pietermarsman · 2024-01-16T20:30:13Z

Thanks for the extra info. I see now why storing the OCG could be useful in some specific cases.

I've been reading 8.11 (Optional Content) from the PDF Reference, but find it quite tricky to understand. Do you happen to have a PDF that has optional content groups that you can share? That would help me to understand them.

As far as I understand now the properties of the BDC operator are also used for other purposes, not just OCG's. Therefore simply converting to string and storing it in the graphics state is not enough. E.g. the test PDF's have a couple of BDC's with a /P tag and some extra properties. I think these are unrelated to OCG's, but correct me if I'm wrong.

pietermarsman · 2024-06-24T06:00:59Z

Closing because no response. Feel free to reopen when extra info is available.

Hannah Mandell and others added 4 commits November 29, 2023 19:41

add ocg property to pdfgraphicstate

46167f3

update do_Q to keep graphicstate

29c4e64

fix do_BDC to grab props

5e782f9

add ocg attribute to lt objects

672f511

pietermarsman requested changes Dec 22, 2023

View reviewed changes

pdfminer/layout.py Show resolved Hide resolved

pdfminer/pdfinterp.py Show resolved Hide resolved

Merge branch 'master' into hfmandell/master

1a3ca7c

pietermarsman added the status: needs solution label Jan 1, 2024

pietermarsman closed this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

hfmandell commented Nov 30, 2023 •

edited

Loading

pietermarsman left a comment

hfmandell commented Jan 8, 2024 •

edited

Loading

pietermarsman commented Jan 16, 2024

pietermarsman commented Jun 24, 2024

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

Fix #903: Keep track of OCG for LTCurve, LTLine, LTRect #924

Conversation

hfmandell commented Nov 30, 2023 • edited Loading

pietermarsman left a comment

Choose a reason for hiding this comment

hfmandell commented Jan 8, 2024 • edited Loading

pietermarsman commented Jan 16, 2024

pietermarsman commented Jun 24, 2024

hfmandell commented Nov 30, 2023 •

edited

Loading

hfmandell commented Jan 8, 2024 •

edited

Loading