Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admitted RayJobs remain in pending state when manageJobsWithoutQueueName is true #1568

Closed
astefanutti opened this issue Jan 11, 2024 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@astefanutti
Copy link
Member

What happened:

When a RayJob managed by Kueue configured with manageJobsWithoutQueueName is admitted, it remains in pending state.

The Job that KubeRay creates to submit the actual job to the Ray cluster stays in suspended state.

What you expected to happen:

The RayJob should run successfully.

How to reproduce it (as minimally and precisely as possible):

  1. Set manageJobsWithoutQueueName: true is Kueue configuration
  2. Create a RayJob

Anything else we need to know?:

Relates to #1434.

Environment:

  • Kubernetes version (use kubectl version): v1.25.3
  • Kueue version (use git describe --tags --dirty --always): v0.6.0-devel-146-ged81667f-dirty
@astefanutti astefanutti added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2024
@mimowo
Copy link
Contributor

mimowo commented Dec 12, 2024

I'm wondering if this is more related to #1434 or to the child-owner management. I think there has been numerous changes in Kueue to the child-parent management so would be good to re-test e2e if this remains a problem.

@mimowo
Copy link
Contributor

mimowo commented Dec 12, 2024

cc @dgrove-oss @andrewsykim who recently worked on related aspects of the problem.

@mimowo
Copy link
Contributor

mimowo commented Dec 12, 2024

/remove-lifecycle stale
The issue is looking for a contributor to re-test it e2e

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2024
@kaisoz
Copy link
Contributor

kaisoz commented Dec 12, 2024

/assign

I can take care of this, but if any other contributor also wants to have a look is more than fine for me 😊

@dgrove-oss
Copy link
Contributor

Assuming kuberay links the RayJob and the RayCluster via a controller ref, then I agree this should work now.

@andrewsykim
Copy link
Member

The submitter Job, that runs "ray job submit" may not be accounted for. But a reasonable workaround is adding labels to map that submitter Job to a specific local queue

@kaisoz
Copy link
Contributor

kaisoz commented Dec 17, 2024

I've tested this on both main and v0.9.1 and I can confirm that this works now. For each version I've:

  1. Deployed a kind cluster, kueue and the Ray operator
  2. Modified the kueue CM and set manageJobsWithoutQueueName: true . Then I restarted the kueue controller pods
  3. Applied the single clusterqueue setup from the examples
  4. Deploy the RayJob from the example

And I can see that after deploying the rayjob

$> kubectl get rayjobs
NAME            JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                 START TIME             END TIME   AGE
rayjob-sample                Initializing        rayjob-sample-raycluster-jmmwb   2024-12-17T08:38:28Z              8s

The job is created and run

$> kubectl get jobs
NAME            STATUS    COMPLETIONS   DURATION   AGE
rayjob-sample   Running   0/1           3s         3s
$> kubectl describe job rayjob-sample   
Name:             rayjob-sample
Namespace:        default
Selector:         batch.kubernetes.io/controller-uid=cc396c82-06b5-48a5-8de7-ea093d003eeb
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  ray.io/originated-from-cr-name=rayjob-sample
                  ray.io/originated-from-crd=RayJob
Annotations:      <none>
Controlled By:    RayJob/rayjob-sample
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Suspend:          false
Backoff Limit:    2
Start Time:       Tue, 17 Dec 2024 09:38:58 +0100
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 2 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=cc396c82-06b5-48a5-8de7-ea093d003eeb
           batch.kubernetes.io/job-name=rayjob-sample
           controller-uid=cc396c82-06b5-48a5-8de7-ea093d003eeb
           job-name=rayjob-sample
  Containers:
   ray-job-submitter:
    Image:      rayproject/ray:2.9.0
    Port:       <none>
    Host Port:  <none>
    Command:
      ray
      job
      submit
      --address
      http://rayjob-sample-raycluster-jmmwb-head-svc.default.svc.cluster.local:8265
      --runtime-env-json
      {"env_vars":{"counter_name":"test_counter"},"pip":["requests==2.26.0","pendulum==2.1.2"]}
      --submission-id
      rayjob-sample-q9jht
      --
      python
      /home/ray/samples/sample_code.py
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  200Mi
    Environment:
      PYTHONUNBUFFERED:       1
      RAY_DASHBOARD_ADDRESS:  rayjob-sample-raycluster-jmmwb-head-svc.default.svc.cluster.local:8265
      RAY_JOB_SUBMISSION_ID:  rayjob-sample-q9jht
    Mounts:                   <none>
  Volumes:                    <none>
  Node-Selectors:             <none>
  Tolerations:                <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  21s   job-controller  Created pod: rayjob-sample-th89s
  Normal  SuccessfulCreate  8s    job-controller  Created pod: rayjob-sample-fg7zv

@mimowo
Copy link
Contributor

mimowo commented Dec 17, 2024

sgtm, thank you for testing @kaisoz . Still the automated sanity tests for Ray will be useful: #3829
/close

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closing this issue.

In response to this:

sgtm, thank you for testing @kaisoz .
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

8 participants