v1.1.0 release
Highlights
-
RayJob improvements
- Gang / Priority scheduling with Kueue:
- ActiveDeadlineSeconds (new field): A feature to control the lifecycle of a RayJob. See this doc and #1933 for more details.
- submissionMode (new field): Users can specify “K8sJobMode” or “HTTPMode”. The default value is “K8sJobMode”. In HTTPMode, the submitter K8s Job will not be created. Instead, KubeRay sends a HTTP request to the Ray head Pod to create a Ray job. See this doc and #1893 for more details.
- Fix a lot of stability issues.
-
Structured logging
- In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
-
RayService improvements
- Refactor health check mechanism to improve the stability.
- Deprecate the
deploymentUnhealthySecondThreshold
andserviceUnhealthySecondThreshold
to avoid unintentional preparation of new RayCluster custom resource.
-
TPU multi-host PodSlice support
- The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
-
Stop publishing images on DockerHub; instead, we will only publish on Quay.
- https://quay.io/repository/kuberay/operator?tab=tags
- Users should use docker pull
quay.io/kuberay/operator:v1.1.0
instead of docker pullkuberay/operator:v1.1.0
.
RayJob
RayJob state machine refactor
- [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
- [RayJob][Status][2/n] Redefine
ready
for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421) - [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
- [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
- [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
- [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
- [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
- [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
- [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
- [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
- [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
- [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
- [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status
Suspending
(#1798, @kevin85421) - [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
- [RayJob][Status][15/n] Unify the codepath for the status transition to
Suspended
(#1805, @kevin85421) - [RayJob][Status][16/n] Refactor
Running
status (#1807, @kevin85421) - [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
- [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
- [RayJob][Status][19/n] Transition to
Complete
if the K8s Job fails (#1833, @kevin85421)
Others
- [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
- [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
- Add apply configurations to generated client (#1818, @astefanutti)
- RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
- [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus
Complete
and a JobStatusSUCCEEDED
(#1919, @kevin85421) - add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
- Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
- rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
- RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
- rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
- chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
- [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
- [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
- [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
- [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
- [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
- [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
- [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
- [RayJob] Set missing CPU limit (#1899, @kevin85421)
- [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
- [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
- [RayJob] Improve dashboard client log (#1903, @kevin85421)
- [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
- [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
- [Test][RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1871, @kevin85421) - [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
- [RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1855, @kevin85421) - [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
- [RayJob] Validate RayJob spec (#1813, @kevin85421)
- [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
- [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
- [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
- [Bug][RayJob] Avoid nil pointer dereference (#1756, @kevin85421)
- [RayJob]: Add RayJob with RayCluster spec e2e test (#1636, @astefanutti)
Logging
- Support json structured logging (#1912, @andrewsykim)
- [Structure Logging][1/n] Make the format of the controller name consistent (#1938, @kevin85421)
- [Structure Logging][2/n] Add context to each log message (#1945, @kevin85421)
- [structure logging][3/n] Remove verbosity (#1953, @kevin85421)
- [Refactor][1/n] Replace logrus with logr to keep logging consistent (#1835, @rueian)
- [Refactor] Remove any unnecessary logger (#1894, @kevin85421)
RayService
Health-check mechanism refactor
- [RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster (#1656, @kevin85421)
- [RayService][Health-Check][2/n] Remove the hotfix to prevent unnecessary HTTP requests (#1658, @kevin85421)
- [RayService][Health-Check][3/n] Update the definition of HealthLastUpdateTime for DashboardStatus (#1659, @kevin85421)
- [RayService][Health-Check][4/n] Remove the health check for Ray Serve applications. (#1660, @kevin85421)
- [RayService][Health-Check][5/n] Remove unused variable deploymentUnhealthySecondThreshold (#1664, @kevin85421)
- [RayService][Health-Check][6/n] Remove ServiceUnhealthySecondThreshold (#1665, @kevin85421)
- [RayService][Health-Check][7/n] Remove LastUpdateTime from multiple places (#1666, @kevin85421)
- [RayService][Health-Check][8/n] Add readiness / liveness probes (#1674, @kevin85421)
Others
- [Refactor] Define the value type of the concurrent map explicitly to avoid type conversion (#1789, @kevin85421)
- [Refactor] Rename EnableAgentService to EnableServeService (#1673, @kevin85421)
- [Refactor][RayService] Use ServeServiceNameForRayService to get the k8s svc name for a RayService (#1931, @rueian)
- [RayService] Refactor to Rely More on RayService Status in RayService E2E Tests (#1928, @Yicheng-Lu-llll)
- [RayService] Add New Status: NumServeEndpoints (#1901, @Yicheng-Lu-llll)
- [RayService] Avoid Duplicate Serve Service (#1867, @Yicheng-Lu-llll)
- [RayService][Bug] Serve Service May Select Pods That Are Actually Unready for Serving Traffic (#1856, @Yicheng-Lu-llll)
- [RayService] Deprecate the built-in ingress support of RayService (#1843, @kevin85421)
- [RayService][Status][1/n] Remove DashboardStatus (#1839, @kevin85421)
- [RayService][Hotfix] Hotfix for Flaky Zero Downtime Rollout Test (#1837, @Yicheng-Lu-llll)
- [RayService][Status][2/n] Remove WaitForDashboard (#1840, @kevin85421)
- [RayService][HA] Fix flaky tests (#1823, @kevin85421)
- [RayService] Move HTTP Proxy's Health Check to Readiness Probe for Workers (#1808, @Yicheng-Lu-llll)
- [RayService] Fixed issue where the custom serve port is not reflected in the serve health check for worker Pods (#1816, @Yicheng-Lu-llll)
- [RayService] Remove everything related to Ray Serve V1 API (#1790, @kevin85421)
- [RayService] Unify multi-app and single-app codepath (#1787, @architkulkarni)
- [RayService] Remove serve v1 API (#1779, @architkulkarni)
- [RayService] Allow updating WorkerGroupSpecs without rolling out new cluster (#1734, @architkulkarni)
- [RayService] Use DashboardPort for RayService instead of DashboardAgentPort (#1742, @architkulkarni)
- [rayservice] Remove dagdriver from ray_v1alpha1_rayservice.yaml (#1649, @zcin)
- Fix Log to indicate we are Using DashboardPort in RayService (#2001, @Yicheng-Lu-llll)
- [RayService] fix kubebuilder printcolumn annotations for RayService (#1981, @andrewsykim)
- [RayService] Address Recent Flakiness in RayService Zero Downtime Rollout Test (#1979, @Yicheng-Lu-llll)
RayCluster
- [GCS FT] Enhance observability of redis cleanup job (#1709, @evalaiyc98)
- [Feature] Support for overwriting the generated ray start command with a user-specified container command (#1704, @kevin85421)
- Support suspension of RayClusters (#1711, @andrewsykim)
- fix: validate RayCluster name with validating webhook (#1732, @davidxia)
- [Hotfix][Bug]
suspend
is not a stateless operation (#1741, @kevin85421) - chore: remove
HeadGroupSpec.Replicas
fromraycluster_types.go
(#1589, @davidxia) - chore: remove all deprecated
HeadGroupSpec.replicas
(#1588, @davidxia) - Add volcano taskSpec annotations to pod (#1754, @Tongruizhe)
- [Nit] Remove redundant code snippet (#1810, @evalaiyc98)
- [Chore] Improve the appearance of compute resources status in the output of
kubectl describe
(#1802, @kevin85421) - [Refactor][GCS FT] Use DeleteAllOf to delete cluster pods before cleaning up redis (#1785, @rueian)
- [Feature][GCS FT] Best-effort redis cleanup job (#1766, @rueian)
- feat: show RayCluster's total resources (#1748, @davidxia)
- [Feature] Adding RAY_CLOUD_INSTANCE_ID as unique id for Ray node (#1759, @kevin85421)
- [Refactor] Use RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV as timeout of status check in tests (#1755, @rueian)
- Check existing pods for suspended RayCluster before calling DeleteCollection (#1745, @andrewsykim)
- [Refactor][RayCluster] Replace RayClusterReconciler.Log with LogConstructor (#1952, @rueian)
- ray-operator: disallow pod creation in namespaces outside of RayCluster namespace (#1951, @andrewsykim)
- [Bug][GCS FT] Clean up the Redis key before the head Pod is deleted (#1989, @kevin85421)
- [Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970, @rueian)
- [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973, @kevin85421)
- [Refactor][RayCluster] RayClusterHeadPodsAssociationOptions and RayClusterWorkerPodsAssociationOptions (#2023, @rueian)
Helm charts
- introduce batch.jobs rules for multiple namespace role (#1707, @riccardomc)
- Add common containerEnv section to Helm Chart (#1932, @chainlink)
- Update securityContext values.yaml for kuberay-operator to safe defaults. (#1896, @vinayakankugoyal)
- RayCluster Helm: Make volumeMounts and volumes optional for workers (#1689, @calizarr)
- Exposing min/max replica counts for default worker group (#1963, @sercanCyberVision)
- [Fix][Helm chart] Move service.headService -> head.headService in values.yaml (#1998, @jjaniec)
- Add seccompProfile.type=RuntimeDefault to kuberay-operator. (#1955, @vinayakankugoyal)
- [Bug] Reconciler error when changing the value of nameOverride in values.yaml of helm installation for Ray Cluster (#1966, @chrisxstyles)
TPU
- Add NumOfHosts to WorkerGroupSpec (CRD change only) (#1834, @richardsliu)
- [Refactor][Multi-host] Create a function to associate RayCluster and the headless svc (#1948, @kevin85421)
- TPU Multi-Host Support (#1913, @ryanaoleary)
- Build Headless Service for Multi-Host TPU Worker Pods (#1920, @ryanaoleary)
- [TPU] Add envtests for multi-host (#1950, @kevin85421)
- Add NumOfHosts to RayCluster helm-chart template (#1969, @ryanaoleary)
- Add v4 TPU manifests samples (#1968, @richardsliu)
- Add missing labels on RayCluster TPU manifests (#1987, @richardsliu)
KubeRay API Server
- removed serve v1 support (#1825, @blublinsky)
- Fixing Python client handling of env from (#1845, @blublinsky)
- Enhancements to e2e test, adding Autoscaling (#1765, @blublinsky)
- added support for secure API server build (#1749, @blublinsky)
- add autoscaler support (#1699, @blublinsky)
- fixed JobSubmission API (#1717, @blublinsky)
- fixed some bugs in e2e-test (#1682, @blublinsky)
- Increased time precision using uint (#1675, @blublinsky)
- Fixed the issue with jobSubmitter resources (#1676, @blublinsky)
- Adding capability to create ray cluster with serve support -clean (#1672, @blublinsky)
- Added Job submission support to the API server (#1639, @blublinsky)
- Add end to end tests to apiserver (#1460, @z103cb)
- Flip Min and max replicas for apiserver workerNodeSpec (#1638, @tedhtchang)
- Added security to the API server (#1677, @blublinsky)
CI
- Clean up WorkersToDelete field during the CI test (#1763, @Yicheng-Lu-llll)
- [Bug] Clean up WorkersToDelete after the scaling process finishes (#1747, @kevin85421)
- Upgrade dependencies to address CVEs (#1865, @ChristianZaccaria)
- run ./hack/update-codegen.sh in generate make target (#1848, @andrewsykim)
- fix applyconfiguration generated code (#1847, @andrewsykim)
- Improve flexibility in RayCluster yaml test (#1812, @evalaiyc98)
- Bump tj-actions/verify-changed-files from 11.1 to 17 in /.github/workflows (#1795, @dependabot[bot])
- Only build/push Multi Arch images when merging to master (#1764, @Yicheng-Lu-llll)
- Upgrade to address High CVEs (#1731, @ChristianZaccaria)
- Publish Multi Arch images (#1716, @tedhtchang)
- [test] Upgrade envtest to latest version (#1720, @astefanutti)
- Bump golang.org/x/net from 0.14.0 to 0.17.0 in /experimental (#1701, @dependabot[bot])
- Upgrade Kubernetes dependencies to v0.28.3 and Golang to 1.20 (#1648, @astefanutti)
- chore: mark generated files as such (#1663, @davidxia)
- Update kind version. (#1957, @vinayakankugoyal)
- [Refactor] Rewrite RayCluster envtest (#1949, @kevin85421)
- Make KubeRay Operator Image FIPS compliant (#1633, @anishasthana)
- [CI] Fix image release pipeline (#1878, @kevin85421)
- [CI] Do not load Ray into
kind
cluster (#1863, @architkulkarni) - [CI] stream operator logs from kind in go e2e tests (#1793, @rueian)
- [CI] Fix variable initializations used in test case declarations (#1775, @rueian)
- [CI] Stop to publish new images to DockerHub (#1702, @kevin85421)
- [CI] Skip the flaky compatibility test
test_detached_actor
until ray-project/ray#41343 (#1694, @rueian) - [CI]: Kuberay operator e2e tests (#1575, @astefanutti)
- [CI] Don't need to publish the security proxy image (#1885, @kevin85421)
- Remove generate target from build/test targets (#1874, @andrewsykim)
- [CI] Fix apiserver test in image-release process (#1880, @kevin85421)
- [CI] Stop publishing images to DockerHub (#1926, @Yicheng-Lu-llll)
- [CI] Don't push new images to DockerHub (#1923, @kevin85421)
- [CI] Use quay as the default image registry (#1939, @kevin85421)
- [Refactor][envtest] Centralize all helpers in envtest for better DX (#1977, @rueian)
- Use standard golang image as build image and distroless image as base image for kuberay operator. (#1967, @vinayakankugoyal)
- [CI] Pin crd-ref-docs to v0.0.10 (#1988, @kevin85421)
- ray-operator: parameterize Test_ShouldDeletePod (#2000, @MadhavJivrajani)
- [Test][RayCluster] Test redis cleanup job in the e2e compatibility test (#2026, @rueian)
- Bump google.golang.org/protobuf from 1.32.0 to 1.33.0 in /experimental (#1992, @dependabot[bot])
- Bump google.golang.org/protobuf from 1.32.0 to 1.33.0 in /cli (#1993, @dependabot[bot])
Documentation
- [Doc] Improve DEVELOPMENT.md by adding more guidances (#1794, @rueian)
- [Ray 2.9.0 Release] Update Ray versions from 2.8.0 to 2.9.0 (#1770, @architkulkarni)
- docs: add comment explaining
util.go:calculatePodResource()
(#1767, @davidxia) - Fix typo in DEVELOPMENT.md (#1698, @kevin85421)
- chore: Update K8s compatibility (#1696, @kevin85421)
- [Doc] Add deprecations to ServiceUnhealthySecondThreshold and DeploymentUnhealthySecondThreshold (#1688, @rueian)
- [Doc] Add blogs and talks to readme (#1691, @architkulkarni)
- Update feature-request.yml (#1907, @anyscalesam)
- Update bug-report.yml (#1906, @anyscalesam)
- [Doc] Support consistency check for API reference in CI (#1655, @rudeigerc)
- [Doc] Support CRD docs generation (#1625, @rudeigerc)
- [Doc] Update release docs (#1621, @kevin85421)
- Post release 1.0.0 (#1651, @kevin85421)
- Update CHANGELOG for v1.0.0 (#1650, @kevin85421)
- [release][v1.1.0] Improve release doc and update KubeRay API server chart's repository (#1960, @kevin85421)
- add best practices for ray cluster on ACK from Alibaba Cloud blog (#1985, @kadisi)
Others
- Use a default user agent 'kuberay-operator' instead of the default user-agent from controller-runtime (#1982, @andrewsykim)
- [Telemetry] KubeRay version and CRD (#2024, @kevin85421)
- chore: improve coverage for
util.go:CheckAllPodsRunning()
(#1929, @davidxia) - Fixes to shorten generated Route name with consideration for namespace (#1883, @neilisaur)
- [Bug] Fix rebase error (#1897, @kevin85421)
- Refactor to Ensure Consistent Use of CRDType (#1892, @Yicheng-Lu-llll)
- Fix versioning in sample manifests (#1857, @andrewsykim)
- [Feature] Split
ray.io/originated-from
intoray.io/originated-from-cr-name
andray.io/originated-from-crd
(#1864, @kevin85421) - Add
ray.io/originated-from
labels (#1830, @rueian) - Add structured config and default sidecar container configuration (#1822, @andrewsykim)
- [CRD] Sync v1alpha1 CRD with v1 CRD (#1788, @kevin85421)
- [CRD] Delete CRD v1alpha1 (#1771, @kevin85421)
- Revert "[CRD] Delete CRD v1alpha1 (#1771)" (#1784, @kevin85421)
- [CRD] Delete CRD v1alpha1 (#1771, @kevin85421)
- chore: add
kuberay-
name prefix to validating webhook Service (#1729, @davidxia) - chore webhook: change K8s annotations to use
kuberay-operator
(#1730, @davidxia) - [Refactor] Move constant.go from common to utils to avoid circular dependency (#1726, @kevin85421)
- Update overwrite-container-cmd example (#1722, @kevin85421)
- [Refactor] Standardize all
k8s.io/api/core/v1
imports ascorev1
(#1721, @rueian) - [Bug] Avoid assigning an entry to a map that is nil (#1715, @kevin85421)
- [Feature] Override the
block
option ofrayStartParams
to true (#1718, @rueian) - Set imagePullPolicy in manager.yaml (#1710, @evalaiyc98)
- fix operator: remove unused mutating and conversion webhook configs (#1705, @davidxia)
- updated python client (#1700, @blublinsky)
- Add flag leader-election-namespace (#1624, @chenk008)
- feat: add all three CRDs to the all category (#1683, @davidxia)
- chore: Remove the sanity check YAML for Quay (#1695, @kevin85421)
- [Post Ray 2.8.0 Release] Update Ray versions to Ray 2.8.0 (#1678, @kevin85421)
- Add validating webhook (#1584, @davidxia)