Skip to content

Commit

Permalink
Training operator CICD improvements (kubeflow#2779)
Browse files Browse the repository at this point in the history
* Add the networkpolicies

Signed-off-by: juliusvonkohout <[email protected]>

* rework the training operator tests

Signed-off-by: juliusvonkohout <[email protected]>

* fix the comments

Signed-off-by: juliusvonkohout <[email protected]>

* fix filename

Signed-off-by: juliusvonkohout <[email protected]>

* try to fix the permissions

Signed-off-by: juliusvonkohout <[email protected]>

* try to fix the permissions

Signed-off-by: juliusvonkohout <[email protected]>

* change to the user namespace

Signed-off-by: juliusvonkohout <[email protected]>

* update the image to rc.1

Signed-off-by: juliusvonkohout <[email protected]>

* fixes

Signed-off-by: juliusvonkohout <[email protected]>

* fixes

Signed-off-by: juliusvonkohout <[email protected]>

* fixes

Signed-off-by: juliusvonkohout <[email protected]>

* fixes

Signed-off-by: juliusvonkohout <[email protected]>

* fixes

Signed-off-by: juliusvonkohout <[email protected]>

---------

Signed-off-by: juliusvonkohout <[email protected]>
  • Loading branch information
juliusvonkohout authored Jul 26, 2024
1 parent 591349d commit 5ac0da5
Show file tree
Hide file tree
Showing 10 changed files with 116 additions and 71 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/linting_bash_python_yaml_files.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: Proper linting on Bash, Python, and YAML files

on: [push, pull_request]
on: [pull_request]

jobs:
format_python_files:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/model_registry_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ jobs:
'http://localhost:8081/api/model_registry/v1alpha3/registered_models?pageSize=100&orderBy=ID&sortOrder=DESC' \
-H 'accept: application/json'
# for these steps below ensure same steps as kserve (ie: Istio with ext external authentication, cert-manager, knative) so to achieve same setup
# for these steps below ensure same steps as kserve (ie: Istio with external authentication, cert-manager, knative) so to achieve same setup
- name: Port forward Istio gateway
run: |
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
Expand Down
43 changes: 0 additions & 43 deletions .github/workflows/train_operator_test.yaml

This file was deleted.

57 changes: 57 additions & 0 deletions .github/workflows/training_operator_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Build & Apply Training Operator manifests in KinD
on:
pull_request:
paths:
- .github/workflows/training_operator_test.yaml
- apps/training-operator/upstream/**
- tests/gh-actions/kind-cluster.yaml
- tests/gh-actions/install_kind.sh
- tests/gh-actions/install_kustomize.sh
- tests/gh-actions/install_istio.sh
- common/istio*/**
- tests/gh-actions/kf-objects/tfjob.yaml

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Install KinD
run: ./tests/gh-actions/install_kind.sh

- name: Create KinD Cluster
run: kind create cluster --config tests/gh-actions/kind-cluster.yaml

- name: Install kustomize
run: ./tests/gh-actions/install_kustomize.sh

- name: Install kubectl
run: ./tests/gh-actions/install_kubectl.sh

- name: Install Istio with external authentication
run: ./tests/gh-actions/install_istio_with_ext_auth.sh

- name: Install cert-manager
run: ./tests/gh-actions/install_cert_manager.sh

- name: Create kubeflow namespace
run: kustomize build common/kubeflow-namespace/base | kubectl apply -f -

- name: Install KF Multi Tenancy
run: ./tests/gh-actions/install_multi_tenancy.sh

- name: Install kubeflow-istio-resources
run: kustomize build common/istio-1-22/kubeflow-istio-resources/base | kubectl apply -f -

- name: Create KF Profile
run: kustomize build common/user-namespace/base | kubectl apply -f -

- name: Install training operator
run: ./tests/gh-actions/install_training_operator.sh

- name: Create a PyTorchJob
run: |
kubectl create -f tests/gh-actions/kf-objects/training_operator_job.yaml -n kubeflow-user-example-com
kubectl wait --for=condition=Succeeded PyTorchJob pytorch-simple -n kubeflow-user-example-com --timeout 600s
8 changes: 4 additions & 4 deletions common/networkpolicies/base/training-operator-webhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ spec:
# https://www.elastic.co/guide/en/cloud-on-k8s/1.1/k8s-webhook-network-policies.html
# The kubernetes api server must reach the webhook
ingress:
- ports:
- protocol: TCP
port: 9443
- ports:
- protocol: TCP
port: 9443
policyTypes:
- Ingress
- Ingress
3 changes: 3 additions & 0 deletions tests/gh-actions/install_multi_tenancy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ kubectl -n kubeflow wait --for=condition=Ready pods -l kustomize.component=profi

echo "Installing Multitenancy Kubeflow Roles"
kustomize build common/kubeflow-roles/base | kubectl apply -f -

echo "Installing Multitenancy Network policies"
kustomize build common/networkpolicies/base | kubectl apply -f -
9 changes: 9 additions & 0 deletions tests/gh-actions/install_training_operator.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash
set -euo pipefail
echo "Installing training operator ..."

cd apps/training-operator/upstream
kustomize build overlays/kubeflow | kubectl apply -f -
kubectl wait --for=condition=Ready pods --all --all-namespaces --timeout=600s \
--field-selector=status.phase!=Succeeded
cd -
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
spec:
containers:
- name: test
image: kubeflownotebookswg/jupyter-scipy:v1.9.0-rc.1
image: kubeflownotebookswg/jupyter-scipy:v1.9.0
imagePullPolicy: IfNotPresent
resources:
limits:
Expand Down
21 changes: 0 additions & 21 deletions tests/gh-actions/kf-objects/tfjob.yaml

This file was deleted.

40 changes: 40 additions & 0 deletions tests/gh-actions/kf-objects/training_operator_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# from https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml
# and disabled istio as stated in the documentation https://www.kubeflow.org/docs/components/training/user-guides/pytorch/
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"

0 comments on commit 5ac0da5

Please sign in to comment.