Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSSFuzz Integration #949

Merged
merged 21 commits into from
Jun 28, 2024
Merged

OSSFuzz Integration #949

merged 21 commits into from
Jun 28, 2024

Conversation

capuanob
Copy link
Contributor

@capuanob capuanob commented Mar 10, 2024

Pull request

This Pull-Request includes the necessary changes to integrate fuzzing into pdfminer.six for OSS-Fuzz continuous fuzzing, as discussed in Issue 918.

In short, this PR adds atheris (the fuzzing framework) as a development dependency, and a new fuzzing directory containing a corpus, some initial harnesses, the necessary CI file to integrate the project into OSSFuzz, and a build script to be used by ClusterFuzz to prepare for nightly fuzzing.

In addition to the above, two simple bug-fixes are resolved to address crashes that were occurring too early into fuzzing, preventing progress.

How Has This Been Tested?

The fuzzing harnesses are tests in and of themselves, so they were tested via coverage analysis and allowing them to run.

NOTE: The CIFuzz.yml job will fail until Google merges the necessary pdfminer Dockerfile into their OSS-Fuzz repository. This can only be done after this PR is merged.

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

@capuanob
Copy link
Contributor Author

@pietermarsman Hey Pieter, pinging for visibility. Looking forward to getting this integrated and uncovering bugs!

@capuanob
Copy link
Contributor Author

@pietermarsman Following up on this, I spent a good amount of time on this and would love to see it integrated!

@capuanob
Copy link
Contributor Author

capuanob commented Jun 10, 2024

@goulu @jstockwin @pudo @tataganesh @pietermarsman I hope all is well, I would really appreciate a review of this PR

@pietermarsman
Copy link
Member

Hi @capuanob,

Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste.

I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities.

I've two initial comments:

  • It looks like you have copied the testing pdf's. Are these even used? Can you also use their equivalents from the samples directory?
  • Is it common to have fuzzing as a top-level directory? I guess it has the same status as the tests and tools, so it seems like the right place. But I'm always reluctant to add top-level files and directories.
  • Is it also possible and useful to run the tests locally? In that case we can/should perhaps add the commands to the noxfile.py.

@capuanob
Copy link
Contributor Author

capuanob commented Jun 24, 2024 via email

@pietermarsman
Copy link
Member

pietermarsman commented Jun 24, 2024

I could update this to copy from the *samples *directory at run-time, rather than hosting them twice

Yes, that is preferable. Maybe use a glob to specify them.

For the projects that I have done, I've either had fuzzing be a
top-level directory or a sub-directory of testing. I can adjust to
whichever you prefer.

I guess I prefer to have it as a top-level directory. Since the tests directory is really the "pytest directory". So no change required here.

While it is possible, I would discourage having it be a local test for
a few reasons. Since fuzzing isn't entirely deterministic, you won't get
the same kind of consistency as you would from unit tests. Secondly, there
isn't a defined end-point for fuzzing so it'd be difficult to set an
arbitrary 'timeout' and be able to definitively say that something has been
sufficiently tested

Ok, good to know.

I've ran out of time for today, so will continue on this next Monday. I noticed that my understanding of fuzzing is very minimal, do you have any good resources that I can use to improve my understanding?

Edit:
I've found this tutorial which gave me a good understanding. I do yet fully understand how the fuzzer can efficiently mutate the corpus pdf files to generate new valid ones though. But perhaps it just tries a gazillion of times.

@capuanob
Copy link
Contributor Author

@pietermarsman I've removed the corpus files from this commit.

Now, instead, the Dockerfile that I will submit to Google's OSS-Fuzz project after this PR is integrated (you can see this Dockerfile here if you are interested) will glob the simple*.pdf files into a corpus.

You've got the right idea on how it makes mutations to the input, since it typically requires gazillions of mutations. To give a bit more context, the atheris Python library will "instrument" the library with extra instructions in each code block (ie, any branch) at runtime. The fuzzing framework strives to achieve depth and breadth in its coverage and will analyze its current code block boundaries to determine what kind of intelligent changes could be made to get deeper into the parsing code.

With time, the fuzzing corpus evolves to be more and more robust. ClusterFuzz also provides great insights on current fuzz-blockers that can be overcome with future PRs to improve the fuzzers.

Another thing I'd be interested in exploring in the future is grammar-based fuzzing. I've never implemented one myself, but am aware of the technique. You can use a grammar (say, the grammar for a PDF) to guide smarter mutations.

Happy to provide more context if desired!

@capuanob
Copy link
Contributor Author

@pietermarsman I just saw your question about resources on fuzz-testing. The LibFuzzer docs are great, I'd also suggest:

https://google.github.io/oss-fuzz/ (More details on this specific program and ClusterFuzz - which you will get access to)

The researchers at Trail of Bits have a good guide on fuzzing as well. https://appsec.guide/docs/fuzzing/python/

The Python section isn't fully built-out, but atheris also uses LibFuzzer (which is commonly used for C/C++ fuzzing).

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top! It is looking great!

Code is running smooth on my machine and I've already found a couple of more bugs. So I'm curious about the results from ClusterFuzz.

I've scattered the MR with a bunch of micro-management comments to get it into the same shape as the rest of the code base. Most importantly is adding the fuzzing directory to the noxfile.py testing dirs so that all the code quality checks run on it.

I can do a couple of small commits to help if you're ok with that.

CHANGELOG.md Outdated Show resolved Hide resolved
fuzzing/extract_text_fuzzer.py Outdated Show resolved Hide resolved
fuzzing/extract_text_fuzzer.py Outdated Show resolved Hide resolved
fuzzing/extract_text_fuzzer.py Outdated Show resolved Hide resolved
fuzzing/extract_text_fuzzer.py Outdated Show resolved Hide resolved
fuzzing/extract_text_fuzzer.py Outdated Show resolved Hide resolved
pdfminer/pdfdocument.py Outdated Show resolved Hide resolved
@capuanob
Copy link
Contributor Author

capuanob commented Jun 27, 2024 via email

@pietermarsman
Copy link
Member

pietermarsman commented Jun 27, 2024

I've fixed all my own comments and think this is now ready.

One big change I did was to subclass all exceptions that pdfminer raises from PSException. Such that the exception handling is now a bit easier. Encapsulating this is also good for the package.

If you confirm that the current code still works with ClusterFuzz I'll merge it.

@capuanob
Copy link
Contributor Author

capuanob commented Jun 28, 2024 via email

@capuanob
Copy link
Contributor Author

@pietermarsman Can confirm that all builds succeed with ClusterFuzz

@pietermarsman pietermarsman enabled auto-merge June 28, 2024 15:24
@pietermarsman pietermarsman disabled auto-merge June 28, 2024 15:24
@pietermarsman pietermarsman added this pull request to the merge queue Jun 28, 2024
Merged via the queue into pdfminer:master with commit ff359dc Jun 28, 2024
9 of 11 checks passed
@pietermarsman
Copy link
Member

Done. Thanks for everything! 👏

jonathanmetzman pushed a commit to google/oss-fuzz that referenced this pull request Jul 1, 2024
This pull requests integrates the Dockerfile needed to build the fuzzers
for pdfminer.six, as merged into upstream in this
[PR](pdfminer/pdfminer.six#949).
@pietermarsman
Copy link
Member

pietermarsman commented Jul 3, 2024

Hi @capuanob ,

I've checked out OSS-Fuzz and monorail but could not find any useful output yet. The build seems to succeed. But there are no issues opened yet. And the coverage suggests that nothing gets past the is_valid_byte_stream check. Maybe the corpus is not loaded correctly.

@capuanob
Copy link
Contributor Author

capuanob commented Jul 5, 2024 via email

@pietermarsman
Copy link
Member

@capuanob

Take your time. Good luck with the move!

Some thoughts:

  1. Not sure what changed, but somethings seems to be working now. There are 3 issues now.
  2. But the coverage is still very low.
  3. I noticed here the corpus is copied to $SRC/corpus but the working dir is $SRC/pdfminer.six. I'm not sure if that is correct. Seems like the corpus is outside the workdir.
  4. I noticed there are integration rewards. Are you applying for those?
  5. I noticed there is the option to [file GitHub issues] as well. Can you set that up for pdfminer.six?

@capuanob
Copy link
Contributor Author

capuanob commented Jul 7, 2024 via email

@capuanob
Copy link
Contributor Author

capuanob commented Jul 8, 2024

@pietermarsman Put in a PR with Google to add GitHub issues, feel free to follow here

@pietermarsman
Copy link
Member

Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true 😃

@capuanob
Copy link
Contributor Author

capuanob commented Jul 8, 2024 via email

DavidKorczynski pushed a commit to google/oss-fuzz that referenced this pull request Jul 8, 2024
This feature was requested by the project maintainer, as seen
[here](pdfminer/pdfminer.six#949 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants