-
Notifications
You must be signed in to change notification settings - Fork 706
Issues: kubeflow/training-operator
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Optimize the time for creating pods and services when creating new Job.
kind/feature
lifecycle/needs-triage
#2361
opened Dec 23, 2024 by
lishangyuzi
Manage Kubeflow TrainJobs in a multi-cluster environment
area/sdk
kind/discussion
kind/feature
#2358
opened Dec 19, 2024 by
andreyvelich
[SDK] Show available Runtime accelerators to users
area/sdk
kind/discussion
kind/feature
#2355
opened Dec 13, 2024 by
andreyvelich
[SDK] Get the correct TrainJob components using
get_job()
API
area/sdk
kind/feature
release/2.0
#2348
opened Dec 11, 2024 by
andreyvelich
[SDK] Snapshot users' workspace into distributed TrainJob workload
area/sdk
kind/feature
#2347
opened Dec 10, 2024 by
andreyvelich
Set restartPolicy and backoff limit, it seems like not effect.
kind/bug
lifecycle/needs-triage
#2342
opened Dec 2, 2024 by
shaoqingyang
"zero-trust" security / networking for training jobs
kind/feature
lifecycle/needs-triage
#2341
opened Nov 29, 2024 by
astefanutti
KEP-2170: Add AMD ROCm Torch Distributed Training Runtime
area/runtime
kind/feature
#2335
opened Nov 26, 2024 by
astefanutti
How can I change the default MASTER_ADDR in Pytorchjob?
kind/bug
lifecycle/needs-triage
#2331
opened Nov 22, 2024 by
Jmengfei
pytorchjob didn't create worker pod ,seems hang
kind/bug
lifecycle/needs-triage
#2327
opened Nov 15, 2024 by
Twilighter9527
KEP-2170: Support hundreds and thousands worker nodes for a single training Job
kind/feature
#2318
opened Nov 4, 2024 by
tenzen-y
Kubeflow Training Operator Logo
kind/discussion
kind/feature
#2314
opened Oct 29, 2024 by
andreyvelich
Use Debian images for Python components in the Training Operator V2
good first issue
help wanted
kind/feature
#2311
opened Oct 28, 2024 by
andreyvelich
KEP-2170: Add unit and E2E tests for model and dataset initializers
kind/feature
#2305
opened Oct 23, 2024 by
andreyvelich
Pytorch job running with pod exception unable to recover after retry
kind/bug
lifecycle/needs-triage
#2300
opened Oct 22, 2024 by
shaoqingyang
KEP-2170: Replace UPSERT operation for the objects with SSA PATCH
kind/feature
#2297
opened Oct 20, 2024 by
tenzen-y
KEP-2170: Implement Job Pipeline Framework plugins
kind/feature
#2290
opened Oct 18, 2024 by
tenzen-y
4 of 5 tasks
Add environment variables to containers
kind/feature
lifecycle/needs-triage
#2284
opened Oct 16, 2024 by
tarekabouzeid
KEP-2170: Migrate the container resource calculation mechanism to k/k library
kind/cleanup
kind/feature
#2280
opened Oct 10, 2024 by
tenzen-y
Document the spec.managedBy field and its use for MultiKueue
area/docs
kind/feature
#2279
opened Oct 9, 2024 by
mimowo
PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set
kind/bug
lifecycle/needs-triage
#2277
opened Oct 8, 2024 by
alenawang
Previous Next
ProTip!
Updated in the last three days: updated:>2024-12-20.