Skip to content

Project 4 : Custos deployment and architecture load testing

Shubham Mohapatra edited this page May 5, 2022 · 8 revisions

Custos: A Distributed Systems Case Study

  1. We've deployed the Custos architecture on jetstream at js-168-203.jetstream-cloud.org.

  2. We created a wrapper Flask application based on the Custos Python SDK.

  3. Performed load testing of our deployed architecture using JMeter as part of the final assignment.

  4. More on the deployment procedure here.

Implementation Details:

  1. Our testing-strategy uses REST for synchronous requests, and may be limited by the capacity of the Flask application to handle large volume of requests.

  2. The app has been containerized with a gunicorn wsgi server having 2 workers for better local load-balancing.

  3. More on the app here.

Operational Testing Strategy:

Load Test Plan

● Hardware Setup : The hardware setup for the custos architecture is as follows :

-> Platform - Jetstream 1
-> 3 Nodes each of m1.medium. Each VM has 6 cores and 16GB RAM and 60 GB SSD.
-> Setup - 1 master and 2 workers
-> Set of replicas - 1

● Tool to be used for Load Testing : Jmeter

● Tests to be performed : Spike Testing and Endurance testing

● Container Orchestration System to be used: Kubernates

● Memory and CPU cores Usage:

AggResponseTimesCustos AggResponseTimesCustos AggResponseTimesCustos

Top 5 services that consume a lot of CPU that could possibly hamper the system's performance :

  1. custos-configuration-service (CPU cores used - 61m)
  2. keycloak-operator (CPU cores used - 31m)
  3. iam-admin-core-service (CPU cores used - 28m)
  4. consul (CPU cores used - 25m)
  5. consul (CPU cores used - 23m)

Top 5 services that consume a lot of memory that could possibly hamper the system's performance :

  1. mysql (Memory used - 697 Mi bytes)
  2. keycloak (Memory used - 526 Mi bytes)
  3. identity-core-service (Memory used - 417 Mi bytes)
  4. iam-admin-core-service (Memory used - 412 Mi bytes)
  5. user-profile-core-service (Memory used - 395 Mi bytes)

● Hosted Custos on: https://js-168-203.jetstream-cloud.org:30079

● Custos UI hosted on https://js-156-203.jetstream-cloud.org/

● Details :
"client_id": "custos-uipafoibwln4vv052ere-10000001",
"client_secret": "Hh8PrnbtYHsMq2eekVFofVN6VC6kqDdWkt5hkZ1M",
"client_name":"scapsulators",
"requester_email":"[email protected]",
"admin_username":"productionadmin",
"admin_first_name":"Shubham",
"admin_last_name":"Mohapatra",
"admin_email":"[email protected]",
"contacts":["[email protected]","[email protected]"],
"redirect_uris":["http://localhost:8080/callback*",
"https://js-168-203.jetstream-cloud.org/callback*"],
"scope":"openid profile email org.cilogon.userinfo",
"domain":"js-168-203.jetstream-cloud.org",
"admin_password":"shubham123",
"client_uri":"https://js-168-203.jetstream-cloud.org/",
"logo_uri":"https://js-168-203.jetstream-cloud.org/",
"application_type":"web",
"comment":"Custos super tenant for production"

1. SPIKE TESTING RESULTS

What is Spike Testing :

Spike testing is a type of performance testing in which an application receives a sudden and extreme increase or decrease in load. The goal of spike testing is to determine the behavior of a software application when it receives extreme variations in traffic.

SpikeTestPlan

Results :

The custos system can handle a maximum load of 1100 requests/minute for a span 60 secs with 0.01% error rate and a throughput of 3.7 requests/sec.

Analysis :

We observed that the custos system was able to handle maximum load of 1100 requests/minute for a span 60 secs with 0.01% error rate and a throughput of 3.7 requests/sec. When we increased the load, we were getting a significant error rate. From the graphs we can see that the services Registers-users and create-groups are taking longer response times.

a. AGGREGATE SERVICE RESPONSE TIMES

AggResponseTimesCustos

b. LATENCY OVER TIME

SpikeTestResponseTimesCustos

2. ENDURANCE TESTING RESULTS:

What is Endurance Testing :

Endurance test (soak testing or longevity testing or capacity testing) is a type of non-functional testing which is done to check if the software system can sustain under a huge expected load continued over a long period of time.

AggResponseTimesCustos

Results :

The custos system can handle a maximum consistent load of 1500 requests/minute for 30 minutes with 0% error rate and a throughput of 2.4 requests/sec.

Analysis :

We observed that the custos system was able to handle a load of 1500 requests/minute for 30 minutes with 0% error rate and a throughput of 2.4 requests/sec. When we increased the load, we were getting a significant error rate. From the graphs we can see that the services - Allocate-users-to-groups along with Allocate-child-group-to-parent- group is taking maximum time to execute requests.

Plots :

a. AGGREGATE SERVICE RESPONSE TIMES

AggResponseTimesCustos

b. LATENCY OVER TIME

LoadTestResponseTimesCustos

Proposed Architecture changes based on our tests conducted :

  1. Most microservices in custos architecture depend on a single microservice - custos config service which can act as a single point of failure. We can scale up the custos config services or divide the functionalities into multiple microservices so that custos config service does not act as a single point of failure.
  2. Keycloak and Vault services are internal services for custos architecture but are accessible publicly. Keycloak and Vault services configurations should be changed so that they cannot be accessed publicly and this will make custos architecture more secure.
  3. Custos has seperate services for iam core and identity and access management. We can combine these two services if possible as too many microservices affects performance of the system negatively.
  4. There is no error logging for api's in Custos architcture. Logging functionality must be added to provide developers the ability to debug and retrace the actions.
  5. We have mentioned above the top 5 services that consume more memory and more CPU. We would suggest to make changes to these services so that they consume less memory and CPU thereby enhancing the performance of the system.
  6. Spike Testing and Endurance testing results revealed services that were causing higher response times (documented above). These services should be modified to perform better in terms of response times so that the overall performance and the response times of the system is enhanced.
  7. For testing purposes, we have used a single master node and 2 worked nodes to deploy the custos architecture. We can scale up the services and use bigger sized VM's to enhance the performance of Custos Architecture.
  8. If custos is using any external dependencies or libraries, make sure that these external libraries are safe and secure to use.