-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci][fix] Fix cuda_exp ci #5438
Conversation
See #5425 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for noticing and proposing a fix for this! But I think there's a much simpler fix that will achieve the same goal.
I think doing the following would prevent adding new complexity to .ci/test.sh
:
- replace
treelearner:
in.github/workflows/cuda.yml
withtask:
- remove this line:
LightGBM/.github/workflows/cuda.yml
Line 14 in 702db13
task: cuda - change the following line to
TASK="${{ matrix.task }}"
:LightGBM/.github/workflows/cuda.yml
Line 89 in 702db13
TASK=${{ env.task }}
I remember we had a discussion in #4630 about not adding more complexity to test.sh
: #4630 (comment).
In addition to that request...right now, both cuda_exp
CI jobs are failing with the following error
Fatal Python error: Aborted
We should never merge a PR that will result in CI being broken on master
, so please do one of these:
- push a fix for that issue to this PR
- remove the
cuda_exp
CI jobs in this PR, document the work to add them back in a separate issue, merge this PR, and then push a follow-up PR fixingcuda_exp
and adding back the CI jobs
I hope a fix can just be pushed here, but I don't know how complicated it will be to debug this. If it is very complicated, then we might as well eliminate those two cuda_exp
jobs that are not actually testing cuda_exp
, to save CI time.
And if the "make an issue and fix this later" approach is taken, then I think we should not merge any more cuda_exp
PRs until those CI jobs have been re-enabled.
I've added the label As I mentioned in #5403 (comment) and #5413 (comment), please add a labels when you create pull requests here, so your PR will be correctly categorized in the release notes. |
Thanks for the suggestion. Already applied via 0206da8 and eb34dd5.
I'd prefer the first choice, and already pushed the fixes via ca7df38.
Thanks again for the reminder. |
This is ready. @jameslamb @guolinke Could you please review this again. If it looks good to you, let's merge this so that we can work on related PRs. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for .github/workflows/cuda.yml
and tests/python_package_test/test_utilities.py
except one minor simplification.
CHECK_EQ(gradients_pointer_, gradients_.data()); | ||
CHECK_EQ(hessians_pointer_, hessians_.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these checks slow? Maybe move them under #ifdef DEBUG
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. These checks are fast.
dismissing to prevent blocking, as I might not be able to review again today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any additional comments.
Please address @StrikerRUS 's suggestions, and then I think @guolinke should re-review this (since you've added significant changes since his original review) prior to this being merged.
Thanks for fixing this!
Co-authored-by: Nikita Titov <[email protected]>
/gha run r-valgrind Workflow R valgrind tests has been triggered! 🚀 Status: success ✔️. |
/gha run r-solaris |
@shiyu1994 Solaris support was removed in #5226. The But thank you for being so thorough! |
@guolinke @jameslamb @StrikerRUS Thanks for reviewing this. I'll merge this since all tests have been passed and it is blocking many related PRs including #4827, #5425 and #4266. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Unfortunately, in a very long time (since the merge of #4630), the ci tests for
cuda_exp
has not actually run.Since the task name for both
cuda
andcuda_exp
iscuda
LightGBM/.github/workflows/cuda.yml
Line 14 in 702db13
in
test.sh
thecuda_exp
option cannot be detected, and it will always run for devicecuda
!LightGBM/.ci/test.sh
Lines 198 to 199 in 702db13
That's a serious mistake. And I think we need to fix it right now for our on-going development of
cuda_exp
.