Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump training-operator v1.7 -> v1.8 #162

Merged

Conversation

DnPlas
Copy link
Contributor

@DnPlas DnPlas commented May 31, 2024

This commit introduces the following changes:

  • The charm now renders and applies a ValidatingWebhookConfiguration resource for
    training-operator CRDs.
  • The charm will render the Service to also serve on port 9443 for the webhook service.
  • The oci-image is updated to v1.8 of the training-operator
  • The training-operator Deployment now has a volume mount for mounting
    the secret that is used by the cert-controller to generate and rotate
    certificates for the ValidatingWebhookConfiguration
  • The training-operator Deployment will now take an argument so the webhook service can use the training-operator workload's Service instead of the default
  • Updates the examples directory with examples from kubeflow/training-operator v1.8-branch

Fixes #159

DnPlas added 5 commits May 31, 2024 12:21
This commit refactors the way the training-operator is deployed, as instead of using
a sidecar container that runs the workload, we are now applying the Deployment and all
the Kubernetes resources required by the training-operator controller to be able to mange
training resources.
We are introducing this change in preparation for the upcoming 1.8 version, as it introduces the
hard dependency on a Kubernetes Secret mounted in a volume for the training-operator workload
to start. For more details please refer to #159.
This commit introduces the following changes:
* The charm now renders and applies a ValidatingWebhookConfiguration resource for
  training-operator CRDs.
* The charm will now patch the Service to also serve on port 9443 for the webhook service.
* The oci-image is updated to v1.8 of the training-operator
* The training-operator Deployment now has a volume mount for mounting
  the secret that is used by the cert-controller to generate and rotate
  certificates for the ValidatingWebhookConfiguration

Fixes #159
DnPlas and others added 3 commits June 18, 2024 16:37
* pin integration test dependencies, refactor constants in tests (#155)

Pins dependencies in the integration tests to their corresponding channels for this release.

Ref: canonical/bundle-kubeflow#866

Co-authored-by: Andrew Scribner <[email protected]>
* refactor: deploy the training-operator with kubernetes resources

This commit refactors the way the training-operator is deployed, as instead of using
a sidecar container that runs the workload, we are now applying the Deployment and all
the Kubernetes resources required by the training-operator controller to be able to mange
training resources.
We are introducing this change in preparation for the upcoming 1.8 version, as it introduces the
hard dependency on a Kubernetes Secret mounted in a volume for the training-operator workload
to start. For more details please refer to #159.
@DnPlas DnPlas marked this pull request as ready for review June 19, 2024 12:18
@DnPlas DnPlas requested a review from a team as a code owner June 19, 2024 12:18
Copy link
Contributor

@orfeas-k orfeas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/charm.py Outdated Show resolved Hide resolved
src/charm.py Outdated Show resolved Hide resolved
@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 19, 2024

Thank you @DnPlas, leaving some comments plus IIUC the following changes are missing:

* image + `secretGenerator` field https://github.com/kubeflow/manifests/pull/2699/files#diff-23b73c27a3cd384ed58c680eac804ff86be38f48688cc94742c51c530e4164e8

* changes in aggregation roles https://github.com/kubeflow/manifests/pull/2699/files#diff-8b80f589b6814910a625e07ed23ff511ef76e81cf009dad169eba0f26ed04497

* changes in training-operator role https://github.com/kubeflow/manifests/pull/2699/files#diff-a88903de8c7c015c3e145b77718f9cee059aedfc24e3b0f37d246f979a4bb15b

I have updated the ClusterRole and aggregated rules, as well as the image tag, but the secretGenerator is a field in the kustomize.yaml file rather than in the manifests; it is used for rendering the Secret's metadata.name field.

@DnPlas DnPlas force-pushed the KF-5692-training-1.8-dev-branch branch 2 times, most recently from 5bb0371 to 32ecee6 Compare June 19, 2024 17:07
@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 19, 2024

kubeflow/training-operator#2143 could be affecting (intermittently) the CI.

Copy link
Contributor

@orfeas-k orfeas-k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments

src/charm.py Show resolved Hide resolved
.github/workflows/integrate.yaml Show resolved Hide resolved
src/templates/secret.yaml.j2 Show resolved Hide resolved
@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 26, 2024

Please review #171 before this, it should help working around the CI issue.

@DnPlas
Copy link
Contributor Author

DnPlas commented Jul 1, 2024

This PR is now affected by #174

Copy link
Contributor

@NohaIhab NohaIhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @DnPlas, almost there
We need to update the yamls in /examples, I see some images for the examples were updated in the training-opeartor 1.8 branch

Copy link
Contributor

@NohaIhab NohaIhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DnPlas DnPlas merged commit 8f21f06 into KF-5692-training-1.8-dev-branch Jul 4, 2024
8 checks passed
@DnPlas DnPlas deleted the KF-5692-upgrade-trainig-operator-1.8 branch July 4, 2024 11:56
DnPlas added a commit that referenced this pull request Jul 4, 2024
This commit introduces the following changes:

* The charm now renders and applies a ValidatingWebhookConfiguration resource for 
  training-operator CRDs.
* The charm will render the Service to also serve on port 9443 for the webhook service.
* The oci-image is updated to v1.8 of the training-operator
* The training-operator Deployment now has a volume mount for mounting
  the secret that is used by the cert-controller to generate and rotate
  certificates for the ValidatingWebhookConfiguration
* The training-operator Deployment will now take an argument so the webhook service can use the training-operator workload's Service instead of the default
* Updates the examples directory with examples from kubeflow/training-operator v1.8-branch

Fixes #159
DnPlas added a commit that referenced this pull request Jul 9, 2024
…` for workload, also bumps training-operator 1.7->1.8 (#167)

* pin integration test dependencies, refactor constants in tests (#164)
* refactor: deploy the training-operator with kubernetes resources (#161)
* chore: bump training-operator v1.7 -> v1.8 (#162)
* refactor: apply a workload Service instead of using juju created one (#173)
* tests: skip test_upgrade due to #170 (#171)
* build, tests: bump charmed-kubeflow-chisme 0.4.0 -> 0.4.1 (#172)

Fixes #159
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants