Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale up/down catalog-web instances as needed #4549

Closed
2 tasks done
FuhuXia opened this issue Dec 7, 2023 · 2 comments
Closed
2 tasks done

scale up/down catalog-web instances as needed #4549

FuhuXia opened this issue Dec 7, 2023 · 2 comments
Assignees
Labels
bug Software defect or bug O&M Operations and maintenance tasks for the Data.gov platform

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Dec 7, 2023

User Story

In order to keep catalog site performing well, data.gov team wants to scale up catalog-web instances count under a CPU usage spike, and scale back to normal when it is over to save memory usage.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN catalog-web CPU usage is checked on a regular basis
    WHEN AVG CPU usage is above 320% for 2 connective checks
    THEN catalog-web instances are scaled up by 2 (max 9 total)
    AND write logs as comment to a sticky issue on the catalog repo.

  • GIVEN catalog-web CPU usage is checked on a regular basis
    WHEN AVG CPU usage is below 250% for 2 connective checks
    THEN catalog-web instances are scaled down by 2 ( minimal 5 or as defined in manifest)
    AND write logs as comment to a sticky issue on the catalog repo.

Background

This is to address catalog performance issue under stress.
cloud.gov does not offer auto scaling feature yet, so we have to implement our own.

Security Considerations (required)

None.

Sketch

This should be added into the restart script we have. We use the same script for both actions, but run the CPU check (and scaling when need) more frequent than the restart. For example, we do the CPU every five minutes, do the restart every 30 minutes. We do not want them run on different scripts to avoid two action overlapping. An ongoing restart will confuse the CPU check.

We do auto scaling every 5 mins. Any other ongoing deployment will make the task quit and wait another 5 mins and then try the auto scale again.

@FuhuXia FuhuXia added the bug Software defect or bug label Dec 7, 2023
@gujral-rei gujral-rei moved this to 📔 Product Backlog in data.gov team board Dec 7, 2023
@jbrown-xentity jbrown-xentity added the O&M Operations and maintenance tasks for the Data.gov platform label Dec 7, 2023
@FuhuXia FuhuXia mentioned this issue Dec 8, 2023
10 tasks
@FuhuXia
Copy link
Member Author

FuhuXia commented Jan 17, 2024

catalog-web could be struggling for hours before manually scaled up.

image

@gujral-rei gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Jan 18, 2024
@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Jan 24, 2024
@FuhuXia FuhuXia self-assigned this Jan 24, 2024
@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Feb 7, 2024
@FuhuXia
Copy link
Member Author

FuhuXia commented Feb 7, 2024

it is deployed and auto scaling prod.

...
Running command: datagov/bin/check-and-renew catalog-web scale
No job running for app catalog-web
Current total instances: 5
Average CPU is 331.72. Too High.
Scaling up to 7
Scaling catalog-web to 7
Scaling app catalog-web in org gsa-datagov / space prod as ***...
...

@FuhuXia FuhuXia closed this as completed Feb 7, 2024
@github-project-automation github-project-automation bot moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Feb 7, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

2 participants