-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent bigger files from being checked in #11508
Conversation
the 1 MB file size limit is arbitrary, but in line with gihub's suggestions i experimented with this check in the following test PRs
The error reporting isn't ideal. the offending file name is NOT printed out to console. It's only visible on the Summary page of the workflow. Hopefully this can be improved in the check itself (offered a patch here: james-callahan/gha-prevent-big-files#3), or the code inlined. I did that, but then using existing action looks so more aesthetically... |
We couldn't use third-party github actions, see #11259 (comment). |
@lewiszlw thanks for pointing this out. i will inline the action code in the check |
220e658
to
a7667fa
Compare
@lewiszlw inlined the action. thanks again for pointing this out. This is how the example error looks like |
.github/workflows/repo_check.yml
Outdated
steps: | ||
- uses: actions/checkout@v4 | ||
with: | ||
fetch-depth: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean to fetch all commits in PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means non-shallow clone.
this is necessary for the references ${{ github.event.pull_request.base.sha }}
and ${{ github.event.pull_request.head.sha }}
to work
.github/workflows/repo_check.yml
Outdated
- name: Check size of new Git objects | ||
env: | ||
# 1 MB ought to be enough for anybody. | ||
# TODO in case we may want to consciously commit a bigger file to the repo without using Git LFS we may disable the check e.g. with a label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if this is a separate workflow then committers can enable/disable the workflow check on the project level https://github.com/apache/datafusion/actions/workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know what branch protections there are configured for this repo. Committers may be also ignore such error. Ignoring workflow erros is definitely a bad habit, but I don't think we risk creating any habit. Unless things work incorrectly, the false positive should be very very rare.
.github/workflows/repo_check.yml
Outdated
if [ "${size}" -gt "${MAX_FILE_SIZE_BYTES}" ]; then | ||
exit_code=1 | ||
echo "Object ${id} [${path}] has size ${size}, exceeding ${MAX_FILE_SIZE_BYTES} limit." >&2 | ||
echo "::error file=${path}::File ${path} has size ${size}, exceeding ${MAX_FILE_SIZE_BYTES} limit." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to duplicate file path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first file=${path}
is necessary to create annotation on the file itself. I suppose github UI would display it if the offending file is a text file. This doesn't show up on the build output log page.
The second File ${path}
is necessary for the error to be easy to understand when looking at the build output log page.
See #11508 (comment) for visualization.
.github/workflows/repo_check.yml
Outdated
# Skip objects which are not files (commits, trees) | ||
if [ ! -z "${path}" ]; then | ||
size="$(git cat-file -s "${id}")" | ||
if [ "${size}" -gt "${MAX_FILE_SIZE_BYTES}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit: should we do an early return exit=1 if we find the large file? or we wanna to output all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intention was to output all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @findepi it looks promising imho
a7667fa
to
f25d8bd
Compare
Thanks @comphead for your review! |
Thanks @findepi
|
correct.
good question
Keeping failure in "deletion is a separate commit" case is right if the PR will be merged or rebase-merged, but unnecessary if the PR will be squash-merged. If the PRs are always squash-merged, we can improve the script to take this into account. But it can be view as overengineering, on the optimistic assumption that this will fail rarely anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @findepi I think this PR is good
@alamb there is a small side effect described above which may give a false positive and basically may require PR recreate, which doesn't seem a big problem for me but for large complex PRs the user might be distressed to recreate it after hard rebase. But this is pretty rare case
or squashing commits (e.g. via git rebase) |
I agree with @findepi that there is a fairly easy workaround that doesn't need to recreate the PR (it would require rebasing it). I think we should try this and we can always adjust as neessary if we find something isn't ideal Thank you very much @findepi and @comphead -- I think this is a nice improvement |
See #10422 for rationale.