Skip to content

v1.1.0 release

Compare
Choose a tag to compare
@kevin85421 kevin85421 released this 23 Mar 04:05
· 333 commits to master since this release
8adc538

Highlights

  • RayJob improvements

  • Structured logging

    • In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
  • RayService improvements

    • Refactor health check mechanism to improve the stability.
    • Deprecate the deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to avoid unintentional preparation of new RayCluster custom resource.
  • TPU multi-host PodSlice support

    • The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
  • Stop publishing images on DockerHub; instead, we will only publish on Quay.

RayJob

RayJob state machine refactor

  • [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
  • [RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421)
  • [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
  • [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
  • [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
  • [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
  • [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
  • [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
  • [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
  • [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
  • [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
  • [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
  • [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending (#1798, @kevin85421)
  • [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
  • [RayJob][Status][15/n] Unify the codepath for the status transition to Suspended (#1805, @kevin85421)
  • [RayJob][Status][16/n] Refactor Running status (#1807, @kevin85421)
  • [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
  • [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
  • [RayJob][Status][19/n] Transition to Complete if the K8s Job fails (#1833, @kevin85421)

Others

  • [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
  • [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
  • Add apply configurations to generated client (#1818, @astefanutti)
  • RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
  • [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED (#1919, @kevin85421)
  • add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
  • Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
  • rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
  • RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
  • rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
  • chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
  • [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
  • [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
  • [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
  • [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
  • [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
  • [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
  • [RayJob] Set missing CPU limit (#1899, @kevin85421)
  • [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
  • [RayJob] Improve dashboard client log (#1903, @kevin85421)
  • [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
  • [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
  • [Test][RayJob] Transition to Complete if the JobStatus is STOPPED (#1871, @kevin85421)
  • [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
  • [RayJob] Transition to Complete if the JobStatus is STOPPED (#1855, @kevin85421)
  • [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
  • [RayJob] Validate RayJob spec (#1813, @kevin85421)
  • [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
  • [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
  • [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
  • [Bug][RayJob] Avoid nil pointer dereference (#1756, @kevin85421)
  • [RayJob]: Add RayJob with RayCluster spec e2e test (#1636, @astefanutti)

Logging

RayService

Health-check mechanism refactor

  • [RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster (#1656, @kevin85421)
  • [RayService][Health-Check][2/n] Remove the hotfix to prevent unnecessary HTTP requests (#1658, @kevin85421)
  • [RayService][Health-Check][3/n] Update the definition of HealthLastUpdateTime for DashboardStatus (#1659, @kevin85421)
  • [RayService][Health-Check][4/n] Remove the health check for Ray Serve applications. (#1660, @kevin85421)
  • [RayService][Health-Check][5/n] Remove unused variable deploymentUnhealthySecondThreshold (#1664, @kevin85421)
  • [RayService][Health-Check][6/n] Remove ServiceUnhealthySecondThreshold (#1665, @kevin85421)
  • [RayService][Health-Check][7/n] Remove LastUpdateTime from multiple places (#1666, @kevin85421)
  • [RayService][Health-Check][8/n] Add readiness / liveness probes (#1674, @kevin85421)

Others

RayCluster

  • [GCS FT] Enhance observability of redis cleanup job (#1709, @evalaiyc98)
  • [Feature] Support for overwriting the generated ray start command with a user-specified container command (#1704, @kevin85421)
  • Support suspension of RayClusters (#1711, @andrewsykim)
  • fix: validate RayCluster name with validating webhook (#1732, @davidxia)
  • [Hotfix][Bug] suspend is not a stateless operation (#1741, @kevin85421)
  • chore: remove HeadGroupSpec.Replicas from raycluster_types.go (#1589, @davidxia)
  • chore: remove all deprecated HeadGroupSpec.replicas (#1588, @davidxia)
  • Add volcano taskSpec annotations to pod (#1754, @Tongruizhe)
  • [Nit] Remove redundant code snippet (#1810, @evalaiyc98)
  • [Chore] Improve the appearance of compute resources status in the output of kubectl describe (#1802, @kevin85421)
  • [Refactor][GCS FT] Use DeleteAllOf to delete cluster pods before cleaning up redis (#1785, @rueian)
  • [Feature][GCS FT] Best-effort redis cleanup job (#1766, @rueian)
  • feat: show RayCluster's total resources (#1748, @davidxia)
  • [Feature] Adding RAY_CLOUD_INSTANCE_ID as unique id for Ray node (#1759, @kevin85421)
  • [Refactor] Use RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV as timeout of status check in tests (#1755, @rueian)
  • Check existing pods for suspended RayCluster before calling DeleteCollection (#1745, @andrewsykim)
  • [Refactor][RayCluster] Replace RayClusterReconciler.Log with LogConstructor (#1952, @rueian)
  • ray-operator: disallow pod creation in namespaces outside of RayCluster namespace (#1951, @andrewsykim)
  • [Bug][GCS FT] Clean up the Redis key before the head Pod is deleted (#1989, @kevin85421)
  • [Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970, @rueian)
  • [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973, @kevin85421)
  • [Refactor][RayCluster] RayClusterHeadPodsAssociationOptions and RayClusterWorkerPodsAssociationOptions (#2023, @rueian)

Helm charts

  • introduce batch.jobs rules for multiple namespace role (#1707, @riccardomc)
  • Add common containerEnv section to Helm Chart (#1932, @chainlink)
  • Update securityContext values.yaml for kuberay-operator to safe defaults. (#1896, @vinayakankugoyal)
  • RayCluster Helm: Make volumeMounts and volumes optional for workers (#1689, @calizarr)
  • Exposing min/max replica counts for default worker group (#1963, @sercanCyberVision)
  • [Fix][Helm chart] Move service.headService -> head.headService in values.yaml (#1998, @jjaniec)
  • Add seccompProfile.type=RuntimeDefault to kuberay-operator. (#1955, @vinayakankugoyal)
  • [Bug] Reconciler error when changing the value of nameOverride in values.yaml of helm installation for Ray Cluster (#1966, @chrisxstyles)

TPU

KubeRay API Server

CI

Documentation

Others