-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF-hul: various issues with parsing PDFs #927
Comments
Possibly related - this blog post describes an issue where JHOVE's parsing goes wrong for some PDFs with mixed Unicode/octal encoded text strings (post links to example file). |
I'd like to point out that the blog post mentioned is incorrect:
I would also very strongly suggest NOT referring to such an old PDF 1.4 specification! Please use both the ISO 32000 specifications, and especially checking for any clarified vendor-neutral wording in ISO 32000-2:2020. |
@petervwyatt Yep, I already suspected this (but it's good to have this confirmed). Which actually makes JHOVE's behavior here even worse, because its inability to parse a perfectly valid file indirectly leads to the reporting of a completely unrelated validation error. |
We are the writers of the referred blog (ghost story, in Halloween). We will update it a little bit, based on the discussion, with a changelog. First of all, we are grateful for the discussion about our blog post. We’d like to clarify our views on the matter here below. From the point of view of digital preservation, we still wouldn’t recommend having multiple (unnecessary) encodings, such as UTF-16BE added with octal encoding. I.e. it might not be a sensible practice. In the long-term, each encoding layer raises the risk of causing problems in the future. JHOVE probably is not the only software to get confused by multilayered encoding of metadata. In future migrations, these kinds of issues need to be identified and somehow handled using the software support available then. Unless we handle it today when encountering them. This is actually the main point of our blog. Should JHOVE give an info message about a string starting with "\376\377", which most likely has a multilayered encoding, instead of just skipping it? About the PDF Reference: We’ll update the quote in the blog to the current revision (ISO 32000-2:2020, ch. 7.9.2.2.1). The paragraph still describes 254 and 255 as the first two bytes of a text string, so in our case there is not really much of a difference from the previous wording, although we admit that it does not specifically deny re-encoding to a multilayered encoding. ISO 32000-2:2020 is unclear regarding octal codes: In ch. 7.3.4.2, "\ddd" is described as "character code ddd". This may be understood so that a "character code" should resolve to a character when decoded (from some codepage), which can be confusing combined with using UTF-16BE having characters of 2-4 bytes. Instead of using the term "character code", we would use e.g. "code of a character byte" (i.e. it may also be part of a character) or more broadly "(octal) code of a byte". On the other hand, the ISO 32000-2:2020 also states in ch. 7.3.4.2 that any 8-bit value can be given either as a byte or with the octal "notation described". We feel that the user of the standard can get confused and needs to interpret, whether "notation described" refers to the coding of a character or a byte (for each \ddd). We have certainly learned more about the PDF file format thanks to the discussion here. |
Out of curiosity I did a little test using some of my favourite PDF mangling tools and libraries. First I created a modified version of the PDF in a Hex editor, where I changed the value of the XMP Producer field to "OPF Phantom". This way we can easily see what field(s) each tool actually reports. Below the commands + results for alll tools/libraries. ExifTool
Result: <?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='phantom_modified_xmp.pdf'
xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
xmlns:System='http://ns.exiftool.org/File/System/1.0/'
xmlns:File='http://ns.exiftool.org/File/1.0/'
xmlns:PDF='http://ns.exiftool.org/PDF/PDF/1.0/'
xmlns:XMP-x='http://ns.exiftool.org/XMP/XMP-x/1.0/'
xmlns:XMP-pdf='http://ns.exiftool.org/XMP/XMP-pdf/1.0/'>
<ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
<System:FileName>phantom_modified_xmp.pdf</System:FileName>
<System:Directory>.</System:Directory>
<System:FileSize>5.9 kB</System:FileSize>
<System:FileModifyDate>2024:11:09 00:22:05+00:00</System:FileModifyDate>
<System:FileAccessDate>2024:11:09 00:22:46+00:00</System:FileAccessDate>
<System:FileInodeChangeDate>2024:11:09 00:22:05+00:00</System:FileInodeChangeDate>
<System:FilePermissions>-rw-rw-r--</System:FilePermissions>
<File:FileType>PDF</File:FileType>
<File:FileTypeExtension>pdf</File:FileTypeExtension>
<File:MIMEType>application/pdf</File:MIMEType>
<PDF:PDFVersion>1.4</PDF:PDFVersion>
<PDF:Linearized>No</PDF:Linearized>
<PDF:PageCount>1</PDF:PageCount>
<PDF:Title>Boo</PDF:Title>
<PDF:CreateDate>2024:10:29 13:43:30Z</PDF:CreateDate>
<PDF:Producer>PDF Phantom</PDF:Producer>
<XMP-x:XMPToolkit>Image::ExifTool 12.71</XMP-x:XMPToolkit>
<XMP-pdf:Producer>OPF Phantom</XMP-pdf:Producer>
</rdf:Description>
</rdf:RDF> ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer). Pdfcpu
Result:
Pdfcpu correctly decodes the octal escape sequences (PDF Producer). pdfinfo (Poppler)
Result:
Poppler correctly decodes the octal escape sequences (Producer). VeraPDF
Result includes: <informationDict>
<entry key="Title">Boo</entry>
<entry key="Producer">PDF Phantom#x000000</entry>
<entry key="CreationDate">2024-10-29T13:43:30.000Z</entry>
</informationDict> VeraPDF does decode the octal escape sequences, but shows a null character at the end (edit: this is actually part of the string!). Apache Tika
Result: <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="Boo"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2024-10-29T13:43:30Z"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Boo"/>
<meta name="Content-Length" content="5906"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="pdf:producer" content="OPF Phantom"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="phantom_modified_xmp.pdf"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="PDF Phantom�"/>
<meta name="pdf:docinfo:created" content="2024-10-29T13:43:30Z"/>
<title>Boo</title>
</head>
<body><div class="page"><p/>
</div>
</body></html> Tika reports both strings, but like VeraPDF shows the null character at the end of the octal escape sequences (pdf:docinfo:producer) . Qpdf
Output contains: "9 0 R": {
"/CreationDate": "D:20241029134330Z00'00'",
"/Producer": "PDF Phantom\u0000",
"/Title": "Boo"
} Qpdf reports the octal escape sequences, but like VeraPDF shows a null character at the end. Pdftk
Result:
Pdftk correctly decodes the octal escape sequences. PyMuPDFUsing this simple test script: import pprint
import pymupdf
myPDF = "phantom_modified_xmp.pdf"
doc = pymupdf.open(myPDF)
metadata = doc.metadata
pprint.pp(metadata) Result:
PyMUPDF does decode the octal escape sequences correctly. ConclusionAll the above tools and libraries are able to decode the octal escape sequences. VeraPDF, Tika and Qpdf show a null character at the end of the producer string, but this character is also part of the source. So JHOVE's behaviour really seems to be the exception here. |
Let's clarify a few things first:
In the PDF file in question, the conventional PDF DocInfo Producer key is formally specified as a "text string" so it might be Unicode if the correct BoM bytes are present - like they are as the octal pair to indicate UTF-16BE. This same string could have been a hex string too. The technically correct string that is stored is "PDF Phantom<NUL>" however a lot of software will swallow the explicit <NUL> mostly because of the way that programming languages store their strings (e.g. C/C++) or when passed to the operating system since many O/S output systems are UTF-8 and trim to be only printables. In this case, a human might assess that the final <NUL> byte has no value - but it may have been intentional by the producing application (perhaps it indicates a version or is some other form of proprietary data - we don't know for sure). When this data is transcoded to UTF-8 to save into the XMP Metadata stream, the technically correct solution is to ensure that this trailing <NUL> is again included (and There is also nothing in the core PDF spec that states the conventional DocInfo dictionary and XMP Metadata stream values have to be identical or limit the information in any way. It makes very good sense but it is not mandated. Time-honoured convention based on limitations in the UI of viewers means that very long, multi-line, or other advanced uses of Unicode that is technically permitted with PDF Unicode strings should not be used. A more aggressive test would be to put <NUL> or other non-printables mid-string in the PDF DocInfo Producer key and see what happens. Does the data get truncated at the first NUL or other non-printable? Is the output from tools mangled? A typical example is that PDF Unicode text strings can include BCP-47 2-character language escape sequences - some tools display these, some tools don't (and if you use a screen-reader or other assistive technologies then these might be very important for you!). |
@petervwyatt Thanks for the additional clarifications, I just updated my last comment to clear up (hopefully!) the confusing terminology. |
FYI a short blog post based on this: |
We're happy that our blog has led to a good discussion and experiments. We thank you for your effort! We modified our blog and linked also "Escape from the Phantom of the PDF" in our blog. Related to this, we also reported a small request to PDF Specification Issues. |
@jmlehton Thank you! I have proposed a solution to your issue, but we won't meet again until the new year to get it formally agreed. |
Some issues noted about parsing PDFs:
{
and}
are not PDF delimiter tokens except within Type 4 PostScript functions (i.e. they are PS delimiters only) so using them elsewhere is incorrect. This was a long-standing error in PDF specifications.PDF-hul header check is for
%PDF-1
but spec says it is%PDF-
followed by any digit (0
-9
),.
and another `digit so PDF 2.0 files should report as a PDF file, but with an unsupported PDF version until such time as you support PDF 2.0. JHOVE currently reports PDF 2.0 files as a bytestream which is incorrect. See herePDF-hul crashes if a PDF hex-string contains EOL characters - this is permitted by the PDF spec as whitespace can occur in hex-strings and the EOLs are considered whitespace. (For what it is worth, hex-strings and literal strings are the only 2 types of PDF tokens or keywords that can span multiple lines).
there seem to be assumptions with PDF-hul-xx error codes that a key with an explicit null value is invalid whereas the PDF spec states that such keys should be ignored (same as not present). An easy test is to set
/Annots null
on any page and compare behaviour to not having an/Annots
entry present.Java exception gets thrown if cross-reference sub-section marker lines (of 2 integers) start with a negative number (i.e. for the object number).
FileSpecification.java does not account for the UF entry added with PDF 1.7. This was noticed from a code review.
there is something strange going on when encountering empty names (i.e. just a '/' followed by nothing, which is a valid PDF name). PDump correctly lists as a Name object with empty string
""
, but if 2 empty names are appended to a trailer dictionary (i.e. a valid key/value dictionary entry) then JHOVE doesn't work properly...please consider adding support for UTF-8 text strings introduced with PDF 2.0. This was noted from a code review. Also note that UTF-8 strings do occur in some pre-PDF 2.0 files...
The text was updated successfully, but these errors were encountered: