Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements simple k8s API retries #102

Merged
merged 13 commits into from
Oct 14, 2019
Merged

Implements simple k8s API retries #102

merged 13 commits into from
Oct 14, 2019

Conversation

dleehr
Copy link
Member

@dleehr dleehr commented Oct 9, 2019

Uses tenacity to retry API calls that may fail due to temporary unavailable k8s API

This work was intended to tolerate automated GKE cluster master upgrades (during which the API server is unavailable). While it addresses that and continues the workflow after the API server returns, the upgrade/autoscaling was causing nodes to delete running pods and terminate them prematurely.

In the interest of seeing a large workflow run, we'll not use auto-scaling on GKE for now, and revisit this later if necessary.

Fixes the APIExceptions, and ConnectionErrors related to #96. The Error Collecting Output issues were occurring in my testing when a GKE node was killed and terminated pods running on it.

@dleehr dleehr requested a review from johnbradley October 9, 2019 17:46
Copy link
Contributor

@johnbradley johnbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider more specific retry logic for deleting pods and documenting the new retry environment variables.

calrissian/retry.py Show resolved Hide resolved
calrissian/k8s.py Show resolved Hide resolved
- 404 means pod not found, so the deletion is unnecessary
@dleehr dleehr merged commit 3f781c7 into master Oct 14, 2019
@dleehr dleehr deleted the k8s-api-retry-simple branch October 14, 2019 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants