Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF-HUL-45 : What logic is being checked for malformed filters? #971

Open
asciim0 opened this issue Nov 12, 2024 · 8 comments
Open

PDF-HUL-45 : What logic is being checked for malformed filters? #971

asciim0 opened this issue Nov 12, 2024 · 8 comments
Assignees
Milestone

Comments

@asciim0
Copy link
Contributor

asciim0 commented Nov 12, 2024

I'm curious what logic is actually being checked for PDF-HUL-45 error messages. I very much appreciate the fact that filter arrays are now supported and no longer throw an error, however, it seems that most false manipulations I conduct to filter dictionaries pass validation as well.

Please see attached file to try put various dictionary manipulations, e.g.:
The obj should be (and currently is):
malformednew.pdf

22 0 obj << /BitsPerComponent 8 /ColorSpace 23 0 R /Filter /DCTDecode /Height 1042 /Name /X /Subtype /Image /Type /XObject /Width 736 /Length 114577 >>

Changing for example the filter from /DCTDecode to /DXTDecode or something else fictive, still results in a well-formed and valid file.

Could you tell me what exactly JHOVE is checking in a filter dictionary?

malformednew.pdf

@samalloing
Copy link
Collaborator

Hi @asciim0 ,

I made a quick look at filters in the PdfStream.java file. The structure is checked, but not the name of the filter itself (also the decode parameters are stored). If there is a complete list of possible filters, then that would be easy to add I think.

Sam

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 15, 2024

Could you please elaborate on what you mean by "the structure is checked"? what does that include? mandatory keys for all filters? for some?

@samalloing
Copy link
Collaborator

Just a simple check for the PdfObject for example PdfArray or a PdfSimpleObject

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 15, 2024

I'm sorry, I still don't understand what that simple check means. You mean it just checks if an array is a correct array? Could you give a logic translation of the code for checking filters per chance?

@samalloing
Copy link
Collaborator

Hi Micky,

Sure the java code translates the PDF entities to java objects. So for example you have array that is a PDF array. So what the code does, it implements what type of PDF entity is allowed in this specific case a Filter, can be a PDF Object or a PDF array. A Filter can also be an indirect Reference. That was not implemented at first so this gave an error (PDF-HUL-45) until it was added. What the current code does is test if the filter is a PDF Object, PDF array or an indirect reference. If in a PDF something else like I don't know a dictionary, there will be an error. It will also check if the array is correct indeed.

Hope this makes it clear

Sam

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 19, 2024

Just to make sure we're talking about the same thing here:
What is being checked is the value of the Key Filter, right? As per spec (ISO 32000-2:2020, sect 7.4, Table 5 that can be:

  • a name
  • an array of zero, one or several names (of filter(s))

Does that align with what is being checked? Or isn't it the value of the /Filter at all that is being checked?
What I'm trying to do is trigger the PDF-HUL-45 rule by manipulating a file containing a filter ... but I can change the value to pretty much whatever I want to (nothing, integer, indirect reference) and the file is still validated as well-formed and valid.

@samalloing
Copy link
Collaborator

Sure! No the value of the filter is not check. What is checked if it is "an array of zero, one or several names (of filter(s)". And if it is a PDF Object or an Indirect reference. But this is just the structure of the PDF. What I mean in your example a filter is at "17 0 obj". That is the only thing that is checked. If you want to trigger PDF-HUL-45. I'll send you an example.

@asciim0
Copy link
Contributor Author

asciim0 commented Nov 22, 2024

I took a look at the file that Sam shared with me. It seems that what triggers the error is the indirect reference leading to an error. The error was thrown at the end of obj 183:
183 0 obj [/ASCII85Decode /LZWDecode]

obj 183 is referenced by obj 184:
184 0 obj <</Filter 183 0 R /Length 185 0 R>> stream

As far as I understand the spec, arrays (like all objects) can be represented by indirect objects and filter values can be names or arrays ... and therefore also indirect objects. The syntax of the array looks fine.
I therefore believe that there is still a possible case of a false positive for this error, as shown here.

I also still don't understand what the malformed filter then checks :-P

@carlwilson carlwilson self-assigned this Dec 5, 2024
@carlwilson carlwilson added this to the JHOVE 1.34 milestone Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants