Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azcopy can't resume cloning job if controller pod is killed during job run #1950

Open
RomanBednar opened this issue Jun 24, 2024 · 4 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@RomanBednar
Copy link
Contributor

What happened:

While testing azure file cloning in OpenShift we noticed that if a controller pod running azcopy job is killed any clone PVC still being copied gets stuck in Pending phase.

This seems to be the effect of azcopy relying on job plan files (AZCOPY_JOB_PLAN_LOCATION or ~/.azcopy by default) to track jobs which can be lost if stored on ephemeral volume.

Is this a know limitation or is there a recommended solution?

What you expected to happen:

Clone job surviving a lost controller pod.

How to reproduce it:

  1. Create Azure File PVC/PV
  2. Create a new clone PVC referencing the origin PVC as a source
  3. Kill Azure File leader controller pod
  4. Clone PVC is stuck in Pending state

Anything else we need to know?:

Checking the helm charts in this repo the destination for those job plan files seems to be ephemeral with emptyDir volume so the issue would occur with this deployment as well:

Environment:

  • CSI Driver version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@andyzhangx
Copy link
Member

yes, that's a limitation since the job status can only be stored locally in ~/.azcopy, as long as the controller pod is restarted, copy jobs could not be continued.

@jsafrane
Copy link
Contributor

jsafrane commented Jul 9, 2024

How can users detect such state? How can they recover / resume?

Who deletes incompletely cloned volumes in Azure? There is no PV created for them. How is an user supposed to find them in the first place?

I am afraid the cloning is not very useful if it can leak volumes that need manual cleanup.

@gnufied
Copy link
Contributor

gnufied commented Sep 20, 2024

Can Kubernetes Job objects be used for scheduling these azcopy operations, so as these cloning operations can be tracked separately? If controller pod dies, then cloning can continue and when controller pod restarts, it can find existing cloning Jobs (via label or something) and then continue with creation of PV etc.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants