Fault Tolerance Testing

We performed fault tolerance testing with a replica count of 3 for the data microservice – our heaviest microservice. We manually deleted 1 pod while testing for 100 user threads via JMeter to evaluate the performance when 1 pod is down.

As expected, our throughput remained the same (20.4/min) even after deleting 1 pod as Kubernetes respawned a new pod to maintain the total number of replicas specified in the deployment YAML.

The service experienced an error of 4% as the response time of this service is high (8-10 seconds under normal conditions). Hence when a pod was deleted, the services in it’s queue were rejected, resulting in error.

Below is the demonstration of our fault tolerance testing. Deleting 1 pod manually:

After deletion, new replica of pod is created by K8s:

Aggregate graph results after deletion of pod:

Throughput impact due to deletion remained negligible, however the error rate went up to 4% due to the pod restarting.

Response time after deleting 1 pod during execution of requests