-
Notifications
You must be signed in to change notification settings - Fork 46
Installing KubeFlow Training with CodeFlare SDK
-
Access the spawner page by going to your Open Data Hub dashboard
-
install the kubeflow-training SDK for KubeFlow Training Operator
-
Getting the training-operator SDK to work from a Jupyter DataScience Notebook
Kubeflow Training: https://github.com/kubeflow/training-operator
goal is to get one of these running: https://www.kubeflow.org/docs/components/training/
0.1 OpenShift
0.2 Logged onto the OC UI
0.3 Also logged into the terminal with oc login
0.4 Have an opendatahub namespace created:
oc create new-project opendatahub
Using your terminal where you're logged in with oc login, issue this command:
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/opendatahub-operator.openshift-operators: ""
name: opendatahub-operator
namespace: openshift-operators
spec:
channel: fast
installPlanApproval: Automatic
name: opendatahub-operator
source: community-operators
sourceNamespace: openshift-marketplace
startingCSV: opendatahub-operator.v2.4.0
EOF
You can check it started with:
oc get pods -n openshift-operators
cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
labels:
app.kubernetes.io/created-by: opendatahub-operator
app.kubernetes.io/instance: default
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: datasciencecluster
app.kubernetes.io/part-of: opendatahub-operator
name: example-dsc
namespace: opendatahub
spec:
components:
codeflare:
managementState: Removed
dashboard:
managementState: Managed
datasciencepipelines:
managementState: Removed
kserve:
managementState: Removed
modelmeshserving:
managementState: Removed
ray:
managementState: Managed
workbenches:
managementState: Managed
EOF
You can check that the pods all started with:
oc get pods -n opendatahub
and it should look like this:
READY STATUS RESTARTS AGE
kuberay-operator-5d9567bdf4-7rt2n 1/1 Running 0 59s
notebook-controller-deployment-6468bbf669-rlt64 1/1 Running 0 71s
odh-dashboard-649fdc86bb-2jdv2 2/2 Running 0 73s
odh-dashboard-649fdc86bb-pgkzw 2/2 Running 0 73s
odh-notebook-controller-manager-86d9b47b54-s9g45 1/1 Running 0 72s
oc apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
Note: I'm trying master branch here:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
Can see that it started with:
oc get pods -n kubeflow
Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:
oc set image deployment training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n kubeflow
Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:
oc patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager", "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'
https://odh-dashboard-$ODH_NAMESPACE.apps.<your cluster's uri>
You can find it with this command:
oc get route -n opendatahub |grep dash
For example: https://odh-dashboard-odh.apps.jimbig412.cp.fyre.ibm.com/
- If prompted, give it your kubeadmin user and password
- If prompted, grant it access as well
4.1 One the far left, click on "Data Science Projects" and the click on Create a Data Science Project. (This will be a new namespace name)
for example:
Name: demo-dsp
Description: Demo's DSP
Then press "Create"
4.2 Within your new Data Science Project, select "Create workbench"
- give it a name, like "demo-wb"
- choose "Jupyter Data Science" for the image
- click "Create workbench" at the bottom.
4.3 You'll see the status as "Starting" initially.
- Once it's in the running status, click on the blue "Open" link in the workbench to get access to the notebook.
4.4 Click on the black "Terminal" under Other section to open up a terminal window.
Inside this terminal, do an "oc login" so that terminal has access to your OpenShift Cluster. For example:
oc login --token=sha256~lamzJ-exoR16UsbltkT-l0nKCL7XTSvLqqB4i54psBM --server=https://api.jimmed414.cp.fyre.ibm.com:6443
4.5 Now you should be able to see the pods on your OpenShift cluster. For example:
oc get pods
Will return the pods in your newly created namespace:
NAME READY STATUS RESTARTS AGE
demo-wb-0 2/2 Running 0 14m
5. In your Jupyter Notebook image, install the kubeflow-training SDK for KubeFlow Training Operator:
pip install kubeflow-training
Note - if you want the copy from main, then do step 5 and then install by hand
5.1 In your juptyer Notebook, clone the training-operator repo:
git clone https://github.com/kubeflow/training-operator.git
5.2 If you're installing the sdk by hand, do this
cd /opt/app-root/src/training-operator/sdk/python
pip install -e .
5.2 On the left-hand side, Expand the path to find simple.yaml
--> training-operator --> examples --> pytorch --> simple.yaml
5.3 Open up simple.yaml and change the namespace:
from:
kubeflow
to
demo-dsp
And then save the simple.yaml
Note, if your system is rate-limited pulling images from docker.io, then you can also switch the simple.yaml to use this image"
Change (in 2 places):
docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
to
quay.io/jbusche/pytorch-mnist:v1beta1-45c5727
5.4 Using the terminal, apply the simple.yaml
cd /opt/app-root/src/training-operator/examples/pytorch
oc apply -f simple.yaml
and then watch it:
watch oc get pods,pytorchjobs -n demo-dsp
and it should look like this when it's all done:
Every 2.0s: oc get pods,pytorchjobs demo-wb-0: Thu Feb 1 18:45:46 2024
NAME READY STATUS RESTARTS AGE
pod/demo-wb-0 2/2 Running 0 12m
pod/pytorch-simple-master-0 0/1 Completed 0 6m
pod/pytorch-simple-worker-0 0/1 Completed 0 6m
NAME STATE AGE
pytorchjob.kubeflow.org/pytorch-simple Succeeded 6m
5.5 Then you can delete it:
oc delete pytorchjob pytorch-simple
Note, for this demo, we had to built an image using the given Dockerfile. I've built the image and have stored it out on Quay.io.
6.1 On the left-hand side, navigate to training-operator --> examples --> pytorch --> cpu-demo
6.2 Double-click on the demo.yaml and change in two places:
image: pytorch-cpu:py3.8
to
image: quay.io/jbusche/pytorch-cpu:py3.8
6.3 Save the demo.yaml and then using the terminal, submit it:
cd /opt/app-root/src/training-operator/examples/pytorch/cpu-demo
oc apply -f demo.yaml
6.4 and then watch it:
watch oc get pods,pytorchjobs -n demo-dsp
and it should look like this when it's all done:
Every 2.0s: oc get pods,pytorchjobs -n demo-dsp demo-wb-0: Thu Feb 1 19:07:06 2024
NAME READY STATUS RESTARTS AGE
pod/demo-wb-0 2/2 Running 0 34m
pod/torchrun-cpu-master-0 0/1 Completed 0 2m41s
pod/torchrun-cpu-worker-0 0/1 Completed 0 2m41s
NAME STATE AGE
pytorchjob.kubeflow.org/torchrun-cpu Succeeded 2m41s
6.5 And then you can delete it when you're done:
oc delete pytorchjob torchrun-cpu
7.0 Let's try a MNIST example: Note, for this demo, we had to built an image using the given Dockerfile. I've built the image and have stored it out on Quay.io.
7.1 On the left-hand side, navigate to training-operator --> examples --> pytorch --> mnist --> v1
7.2 Double-click on the pytorch_job_mnist_gloo.yaml and change in two places:
image: gcr.io/<your_project>/pytorch_dist_mnist:latest
to
image: quay.io/jbusche/pytorch_dist_mnist:latest
and also change the gpu to 0 if you don't have any GPU
nvidia.com/gpu: 1
to
nvidia.com/gpu: 0
7.3 save the file and then using the terminal, submit it:
cd /opt/app-root/src/training-operator/examples/pytorch/mnist/v1
oc apply -f pytorch_job_mnist_gloo.yaml
7.4 and then watch it:
watch oc get pods,pytorchjobs -n demo-dsp
and it should look like this when it's all done (mine took about 9 minutes):
Every 2.0s: oc get pods,pytorchjobs -n demo-dsp demo-wb-0: Thu Feb 1 20:11:02 2024
NAME READY STATUS RESTARTS AGE
pod/demo-wb-0 2/2 Running 0 98m
pod/pytorch-dist-mnist-gloo-master-0 0/1 Completed 0 8m53s
pod/pytorch-dist-mnist-gloo-worker-0 0/1 Completed 0 8m53s
NAME STATE AGE
pytorchjob.kubeflow.org/pytorch-dist-mnist-gloo Succeeded 8m53s
7.5 And then you can delete it when you're done:
oc delete pytorchjob pytorch-dist-mnist-gloo
Problem 1. The release version of the kubeflow-training SDK doesn't have all the constants needed to run the examples.
Usually you would install the sdk with this command:
pip install kubeflow-training
and you'd get what is currently the 1.7.0 version of the sdk.
But to actually get it to work with the examples, you'd want to build from the main branch:
git clone https://github.com/kubeflow/training-operator.git
cd /opt/app-root/src/training-operator/sdk/python
pip install -e .
Problem 2. We need to add the workbench service account into the training-operator clusterrolebinding so that the pytorchjobs have create permission
oc edit clusterrolebinding training-operator
and at the end, append the service account you're using, for example:
- kind: ServiceAccount
name: demo-wb
namespace: demo-dsp
Problem 3. For the sdk to be able to pull the log results, a training-operator clusterrole rule needs to be added under - pods
:
Do this:
oc edit clusterrole training-operator
and under the - pods
resources, add:
- pods/log
otherwise, we see:
cannot get resource \"pods/log\" in API group \"\" in the namespace \"demo-dsp\"","reason":"Forbidden"
Problem 4. For the sdk to use pvc, appending the following to the training-operator clusterrole fixes that error too: Do this:
oc edit clusterrole training-operator
and under the - pods
resources, add:
- persistentvolumeclaims
otherwise, we see:
forbidden: User \"system:serviceaccount:demo-dsp:demo-wb\" cannot create resource \"persistentvolumeclaims\" in API group \"\" in the namespace \"demo-dsp\"","reason":"Forbidden","details":{"kind":"persistentvolumeclaims"},"code":403}
There are four sdk demos: https://github.com/kubeflow/training-operator/tree/master/examples/sdk