Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two crashes when parsing PDFs #26

Open
TACIXAT opened this issue Jul 7, 2020 · 2 comments
Open

Two crashes when parsing PDFs #26

TACIXAT opened this issue Jul 7, 2020 · 2 comments
Labels
good first issue Good for newcomers

Comments

@TACIXAT
Copy link

TACIXAT commented Jul 7, 2020

Govdocs -

000899.pdf
001940.pdf

Parsing PDF obj 62 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 38, in _emit_dict
    value_end = value[-1].offset.offset + len(value[-1].token)
IndexError: list index out of range
Parsing PDF obj 424 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in _emit_dict
    ''.join(v.token for v in value),
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in <genexpr>
    ''.join(v.token for v in value),
AttributeError: 'str' object has no attribute 'token'
@nealmcb
Copy link

nealmcb commented Feb 2, 2021

I saw the latter bug also, on the first file I tried it on (a big one that I think was generated from a docx file), under Python 3.8: AttributeError: 'str' object has no attribute 'token'

$ polyfile --version
PolyFile version 0.1.7

@ESultanik
Copy link
Collaborator

This was likely fixed by switching PDF parser implementations. We will confirm and close if that is the case.

@ESultanik ESultanik added the good first issue Good for newcomers label Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants