Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add stateless redis disruptor proposal #331

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

nadiamoe
Copy link
Member

@nadiamoe nadiamoe commented Sep 6, 2023

Description

This PR adds a proposal to add a redis disruptor. Details about the use cases and implementation proposals can be found in the added design doc.

@pablochacin pablochacin changed the title docs: add stateless redis disruptor propoosal docs: add stateless redis disruptor proposal Sep 12, 2023
Copy link
Collaborator

@pablochacin pablochacin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my comments, I would prefer that we explore the stateful proxy as a goal and leave the stateless proxy as a PoC stage toward that goal as part of the implementation plan instead of as an alternative implementation.

The rationale for this is that the project should not aim for simple implementations but for meaningful developers experiences, and given the limitations of the stateless proxy, I'm not sure we would like to offer it to the users.


## Background

Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Caching services like Redis are a common way to improve the performance of distributed systems, but sometimes make difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as increase of latency, or unexpected miss rate increase can affect a distributed system in qualitative ways and lead to catastrophic failure.
Caching services like Redis are a common way to improve the performance of applications, but sometimes it is difficult to know or estimate how a given system will behave when the caching service stops behaving as expected. Non-catastrophic failure modes such as an increase in latency, or unexpected miss rate increase can affect a system in significant ways and lead to catastrophic failure.


## Problem statement

A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of distributes systems when using common patterns such as caching. A common example of a metastable failure is system that is responding well to a certain load thanks to warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.
A big reliability challenge, and a common source of incidents, is the [metastable behavior](http://charap.co/metastable-failures-in-distributed-systems/) ([archive.org](https://web.archive.org/web/20230502171335/http://charap.co/metastable-failures-in-distributed-systems/)) of applications when using caching. In this scenario, the application is responding well to a certain load thanks to a warm cache, but when this cache is lost due to an instance restart, or a node failure, the sudden cache miss events overload the backing database preventing the system from recovering.


## Goals

Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Add baseline redis faulting functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:
Add Redis fault injection functionality to the disruptor, so application and platform engineers can perform tests and understand how their systems respond to non-ideal conditions. This document proposes the addition of two types of faults:


Without the requirement of being able to correlate responses with the requests that originated them, a RESP proxy can be made stateless. This reduces the complexity at the cost of, as expected, not being to correlate those responses. However, it should still be possible to meet the goals above with an stateless proxy.

A stateless RESP proxy accepts connections from Redis clients. It will read messages sent by clients, parse them, and decide if any action is necessary, such as modifying the request, or delaying it. It simply passes through responses from the server back to the client, without needing to decode them. A stateless proxy always needs to forward requests, modified or not, to the upstream server. As it is not aware of the flow of responses, it should be compatible with server pushes without needing any additional logic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand from the description of the fault injection in the following sections, I think this approach is rather limiting:

  1. Having to change the keys in the upstream requests instead of intercepting and modifying the responses. This may not have any side effects, but still, I found it "hacky"
  2. Not allowing latency per command, but per message (i understand, a message can have multiple commands)

I would like to evaluate the complexity of an alternative approach that is aware of the responses.


### Advantages

- Easier to implement and less error-prone than a stateful proxy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is valid to implement a PoC using the stateless approach, but I would prefer that we address a full-fledged implementation in this design document.


#### Disadvantages

- Code is more complex, requiring more development time and increasing the surface for bugs to appear.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when this is a valid concern, I think we should explore this option and leave the stateless proxy as a PoC of the final goal.

@pablochacin
Copy link
Collaborator

Regarding the complexity of implementing the Redis protocol, we can explore and learn from existing projects:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants