Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration-with-profiles tests failed in CI with "Failed to execute kubectl auth" #136

Closed
orfeas-k opened this issue Sep 12, 2023 · 2 comments · Fixed by #147
Closed

integration-with-profiles tests failed in CI with "Failed to execute kubectl auth" #136

orfeas-k opened this issue Sep 12, 2023 · 2 comments · Fixed by #147
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Sep 12, 2023

integration-with-profiles tests failed in CI with the following error

FAILED tests/integration/test_charm_with_profile.py::test_authorization_for_creating_resources[examples/tfjob.yaml] - AssertionError: Failed to execute kubectl auth (1): no

The error seems intermittent since rerunning the CI fixed it. I created this issue in order to document that we have came across this in the case we stumble upon it again.

Reproduce

Not sure how to reproduce since the error seems intermittent.

Logs

Nothing looks off in the CI logs. These are the last workload logs

2023-09-12T12:10:41.199Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:202 created rotating log file "/var/log/machine-lock.log" with max size 10 MB and max backups 5
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.machinelock machinelock.go:186 machine lock released for training-operator/0 uniter (run start hook)
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter.operation executor.go:121 lock released for training-operator/0
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter resolver.go:188 no operations in progress; waiting for changes
2023-09-12T12:10:41.200Z [container-agent] 2023-09-12 12:10:41 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] idle:

and these the last operator logs

2023-09-12T12:10:14.116Z [training-operator] 2023-09-12T12:10:14Z	INFO	Starting workers	{"controller": "mpijob-controller", "worker count": 1}
2023-09-12T12:10:14.920Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.002027076s 200
2023-09-12T12:10:27.782Z [pebble] GET /v1/plan?format=yaml 3.65407ms 200
2023-09-12T12:10:39.591Z [pebble] GET /v1/plan?format=yaml 133.702µs 200
@orfeas-k orfeas-k added the bug Something isn't working label Sep 12, 2023
@orfeas-k orfeas-k changed the title integration-with-profiles tests failed in CI with Failed to execute kubectl auth integration-with-profiles tests failed in CI with "Failed to execute kubectl auth" Sep 12, 2023
@ca-scribner
Copy link
Contributor

Debugging this locally. I can confirm that this works:

  • deploy kubeflow-profiles and kubeflow-roles
  • create a profile user123 manually
  • try kubectl auth can-i create pytorchjob --as=system:serviceaccount:user123:default-editor --namespace user123 --> result: no (with `Warning: the server doesn't have a resource type 'pytorchjob') (this is the expected result)
  • locally build and deploy training-operator from main
  • try kubectl auth can-i create pytorchjob --as=system:serviceaccount:user123:default-editor --namespace user123 --> result: yes

So at least sometimes, locally, this does work.

@ca-scribner
Copy link
Contributor

Looking through my local setup, I see ~3s delay between a Profile object being created and the RoleBinding that grants that profile's ServiceAccount permission to use Kubeflow resources. I think that is the cause of this flaky test failure. To test this, I added some instrumentation and can see we are using the Profile before the profile-controller has completed instantiating it.

orfeas-k pushed a commit that referenced this issue Nov 22, 2023
…#147)

Adds logic to give a new Profile enough time to be initialized before we try
to use it, as well as better logging to make it clear what is happening.

Fixes #136
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

2 participants