Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] add sample RayCluster using kube-rbac-proxy for dashboard access control #2578

Merged

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

This adds an example RayCluster using https://github.com/brancz/kube-rbac-proxy for access control to the Ray dashboard.

There will be a follow-up PR to reference this example in a guide / tutorial.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim
Copy link
Collaborator Author

How to test:

Apply manifest

$ kubectl apply -f ray-operator/config/samples/ray-cluster.auth.yaml
configmap/kube-rbac-proxy created
serviceaccount/kube-rbac-proxy created
clusterrolebinding.rbac.authorization.k8s.io/kube-rbac-proxy created
clusterrole.rbac.authorization.k8s.io/kube-rbac-proxy created
raycluster.ray.io/ray-cluster-with-auth created

Check cluster

$ kubectl get po
NAME                                                 READY   STATUS    RESTARTS   AGE
ray-cluster-with-auth-head-bfmj6                     2/2     Running   0          30s
ray-cluster-with-auth-worker-group-worker-8msfl      1/1     Running   0          30s
ray-cluster-with-auth-worker-group-worker-969zs      1/1     Running   0          30s
ray-cluster-with-auth-worker-group-worker-bb7z5      1/1     Running   0          30s
ray-cluster-with-auth-worker-group-worker-hnvrw      1/1     Running   0          30s

Create session:

$ kubectl ray session ray-cluster-with-auth &
Ray Dashboard: http://localhost:8265
Ray Interactive Client: http://localhost:10001

Forwarding from 127.0.0.1:8265 -> 8265
Forwarding from [::1]:8265 -> 8265
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001

Submit a job, expect 401

$ ray job submit --working-dir .  -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Handling connection for 8265
Traceback (most recent call last):
  File "/usr/local/google/home/andrewsy/code/python/.env/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
           ^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 264, in submit
    client = _get_sdk_client(
             ^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client
    client = JobSubmissionClient(
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 109, in __init__
    self._check_connection_and_version(
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
    self._check_connection_and_version_with_url(min_version, version_error_message)
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 267, in _check_connection_and_version_with_url
    r.raise_for_status()
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: http://localhost:8265/api/version

Use a dummy token with no access, expect 403

$ kubectl create serviceaccount dummy
serviceaccount/dummy created

$ export RAY_JOB_HEADERS="{\"Authorization\": \"Bearer $(kubectl create token dummy)\"}"

$ ray job submit --working-dir .  -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Handling connection for 8265
Handling connection for 8265
Job submission server address: http://localhost:8265
Handling connection for 8265
2024-11-26 23:11:54,039	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_7cdffea633c32bf1.zip.
2024-11-26 23:11:54,041	INFO packaging.py:530 -- Creating a file package for local directory '.'.
Handling connection for 8265
Traceback (most recent call last):
  File "/usr/local/google/home/andrewsy/code/python/.env/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
           ^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/cli.py", line 273, in submit
    job_id = client.submit_job(
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py", line 214, in submit_job
    self._upload_working_dir_if_needed(runtime_env)
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 398, in _upload_working_dir_if_needed
    upload_working_dir_if_needed(runtime_env, upload_fn=_upload_fn)
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/_private/runtime_env/working_dir.py", line 98, in upload_working_dir_if_needed
    upload_fn(working_dir, excludes=excludes)
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 391, in _upload_fn
    self._upload_package_if_needed(
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 377, in _upload_package_if_needed
    self._upload_package(
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 358, in _upload_package
    self._raise_error(r)
  File "/usr/local/google/home/andrewsy/code/python/.env/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 403: Forbidden (user=system:serviceaccount:default:dummy, verb=update, resource=, subresource=)
.

Use my personal access token with admin access:

export RAY_JOB_HEADERS="{\"Authorization\": \"Bearer $(gcloud auth print-access-token)\"}"
$ ray job submit --working-dir .  -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Handling connection for 8265
Handling connection for 8265
Job submission server address: http://localhost:8265
Handling connection for 8265
2024-11-26 23:12:47,382	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_7cdffea633c32bf1.zip.
2024-11-26 23:12:47,385	INFO packaging.py:530 -- Creating a file package for local directory '.'.
Handling connection for 8265
Handling connection for 8265

-------------------------------------------------------
Job 'raysubmit_YeabgKrUavLfaQz2' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_YeabgKrUavLfaQz2
  Query the status of the job:
    ray job status raysubmit_YeabgKrUavLfaQz2
  Request the job to be stopped:
    ray job stop raysubmit_YeabgKrUavLfaQz2

Handling connection for 8265
Tailing logs until the job exits (disable with --no-wait):
Handling connection for 8265
2024-11-26 15:12:49,427	INFO worker.py:1456 -- Using address 10.112.1.39:6379 set in the environment variable RAY_ADDRESS
2024-11-26 15:12:49,427	INFO worker.py:1596 -- Connecting to existing Ray cluster at address: 10.112.1.39:6379...
2024-11-26 15:12:49,438	INFO worker.py:1772 -- Connected to Ray cluster. View the dashboard at 10.112.1.39:8443
{'node:__internal_head__': 1.0, 'node:10.112.1.39': 1.0, 'object_store_memory': 11373512292.0, 'memory': 38654705664.0, 'CPU': 10.0, 'node:10.112.0.42': 1.0, 'node:10.112.2.27': 1.0, 'node:10.112.1.40': 1.0, 'node:10.112.0.41': 1.0}
Handling connection for 8265

------------------------------------------
Job 'raysubmit_YeabgKrUavLfaQz2' succeeded
------------------------------------------

spec:
containers:
- name: ray-worker
image: rayproject/ray:2.34.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using Ray 2.39.0 instead, in case Ray 2.34.0 has an issue with the dashboard agent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

containers:
- name: ray-worker
image: rayproject/ray:2.34.0
resources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fewer resources to ensure users can follow the example in their environments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

spec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can user directly send a request to 8843?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can still send requests to 8443. You would have to use NetworkPolicy or similar to ensure only port 8265 is used.

I tried to bind all processes in the Head container to 127.0.0.1, but I think that broke some communication with the Raylets.

Copy link
Collaborator Author

@andrewsykim andrewsykim Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm, I made a mistake in my testing. dashboard-host: 127.0.0.1 actually works, so users cannot directly access port 8443. Only way would be execing into the container or port-forwarding

memory: "4Gi"
readinessProbe:
httpGet:
path: "/api/gcs_healthz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause issues when the raylet or dashboard agent crashes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll update to use exec probes, but we should revisit #2360

@kevin85421 kevin85421 self-assigned this Nov 27, 2024
@andrewsykim andrewsykim force-pushed the add-ray-cluster-auth-sample branch 2 times, most recently from 513be21 to c7d26bd Compare November 27, 2024 04:29
@andrewsykim andrewsykim force-pushed the add-ray-cluster-auth-sample branch from c7d26bd to 63479be Compare November 27, 2024 04:31
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess KubeRay also requires some updates. For example, the HTTP requests in RayJob/RayService from the KubeRay operator to the Ray dashboard. Are these follow-up PRs?

Would you mind opening an umbrella issue to track the progress for the entire KubeRay K8s-native auth. Thanks!

template:
metadata:
spec:
serviceAccountName: kube-rbac-proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when the autoscaler is enabled, KubeRay will automatically reconcile RBAC resources if serviceAccountName is not specified.

The current configuration may cause the autoscaler to not work. It's fine for this YAML, but it is worth mentioning in the documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, for the autoscaler case, we'll need to add additional resources in the autoscaler ClusterRole

@andrewsykim
Copy link
Collaborator Author

I guess KubeRay also requires some updates. For example, the HTTP requests in RayJob/RayService from the KubeRay operator to the Ray dashboard. Are these follow-up PRs?

Yes, these are things we'll need to automate eventually for RayJob / RayService. But I think authentication for RayJob / RayService is not a high priority because users typically don't use them interatively like RayCluster. Authentication is more important for long-lived RayCluster that are used interactively, in this scenario I don't think we need to make any changes in KubeRay, except for automating the installation of the sidecar eventually

command:
- bash
- -c
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open an issue to avoid users configuring the probes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 kevin85421 merged commit 0532645 into ray-project:master Nov 27, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants