Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetOrGenerateKeys locking leads to significant service degredation with high request saturation #3863

Open
4 of 5 tasks
awill1988 opened this issue Oct 25, 2024 · 7 comments · Fixed by #3870 · May be fixed by #3883
Open
4 of 5 tasks

GetOrGenerateKeys locking leads to significant service degredation with high request saturation #3863

awill1988 opened this issue Oct 25, 2024 · 7 comments · Fixed by #3870 · May be fixed by #3883
Labels
bug Something is not working.

Comments

@awill1988
Copy link
Contributor

awill1988 commented Oct 25, 2024

Preflight checklist

Ory Network Project

No response

Describe the bug

When operating Hydra in production, it has been observed through tracing that the GetOrGenerateKeys is called routinely for the Authorization Endpoint, Token Endpoint, and WellKnown endpoints. That function leverages two RWMutexes to store the keysets in memory.

As identified in #3662, there can be problems when the tail latency of the database query to retrieve JWKs is above a certain threshold under highly concurrent conditions.

Additionally, when considering the locking logic for GetOrGenerateKeys, there may be an opportunity to leverage RLock() when reading from the database and then acquiring the write lock only when generating to enable more throughput for this endpoint.

Reproducing the bug

To recreate the issue locally:

Strategy

  1. Database Latency - Run hydra with an artificially set latency for the persister_jwk's GetKeySet that mimics real-world conditions (>= 25ms).
  2. Request Saturation - Using any load testing tool like bombardier to saturate the server with hundreds of concurrent client connections against /.well-known/openid-configuration endpoint.

Steps

  1. Install bombardier: go install github.com/codesenberg/bombardier@latest
  2. Run cockroachdb instance: docker compose -f quickstart.yml -f quickstart-cockroach.yml up -d cockroachd
  3. Run migrations: docker compose -f quickstart.yml -f quickstart-cockroach.yml run hydra-migrate
  4. Add artificial latency to persister_jwk.go: time.Sleep(25 * time.Millisecond)
  5. Run code: DSN='cockroach://root@localhost:26257/defaultdb?sslmode=disable&max_conns=20&max_idle_conns=4' go run . serve all --dev -c contrib/quickstart/5-min/hydra.yml
  6. Run load test with 270 - 300 concurrent connections bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration

Expected Outcome

Request saturation would result in a moderate service degradation. Here's a test with only 100 concurrent connections:

❯ bombardier -d 30s -t 30s -a -c 100 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 100 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec        26.14      17.31      57.67
  Latency         3.62s   708.85ms      3.90s
  Latency Distribution
     50%      3.83s
     75%      3.86s
     90%      3.89s
     95%      3.89s
     99%      3.90s
  HTTP codes:
    1xx - 0, 2xx - 881, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:    54.73KB/s

Actual Results

If the logarithmic growth in request handler timing reaches 10s, request latency degrades significantly.

❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[=====================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec         8.70      17.32      52.68
  Latency        28.31s     22.94s      0.85m
  Latency Distribution
     50%     41.07s
     75%      0.85m
     90%      0.85m
     95%      0.85m
     99%      0.85m
  HTTP codes:
    1xx - 0, 2xx - 519, 3xx - 0, 4xx - 0, 5xx - 0
    others - 10
  Errors:
    the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection - 10
  Throughput:    19.69KB/s

Potential Fix

If GetOrGenerateKeys were to leverage a read lock and only acquiring a write lock when it wants to generate, then this request saturation does not have the same effect, but this may not provide the desired synchronization guarantees intended by the authors.

 func GetOrGenerateKeys(ctx context.Context, r InternalRegistry, m Manager, set, kid, alg string) (private *jose.JSONWebKey, err error) {
-       getLock(set).Lock()
-       defer getLock(set).Unlock()
-
+       getLock(set).RLock()
        keys, err := m.GetKeySet(ctx, set)
+       getLock(set).RUnlock()
+
        if errors.Is(err, x.ErrNotFound) || keys != nil && len(keys.Keys) == 0 {
                r.Logger().Warnf("JSON Web Key Set \"%s\" does not exist yet, generating new key pair...", set)
+               getLock(set).Lock()
+               defer getLock(set).Unlock()
                keys, err = m.GenerateAndPersistKeySet(ctx, set, kid, alg, "sig")
                if err != nil {
                        return nil, err
❯ bombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Bombarding http://localhost:4444/.well-known/openid-configuration for 30s using 270 connection(s)
[===============================================================================================================================================] 30s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      1379.25     412.60    3602.03
  Latency      195.31ms   321.83ms      9.35s
  Latency Distribution
     50%   110.06ms
     75%   199.68ms
     90%   377.27ms
     95%   565.53ms
     99%      1.44s
  HTTP codes:
    1xx - 0, 2xx - 41610, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     2.82MB/s

Relevant log output

No response

Relevant configuration

No response

Version

v2.2.0

On which operating system are you observing this issue?

Linux

In which environment are you deploying?

Kubernetes

Additional Context

This was observed in a production Kubernetes deployment where 4 replicas of a hydra pods were getting thousands of requests per second and it led to a significant increase in latency. Upon examination of the traces, the large gap between spans before persistence.sql.GetKeySet was found to be a signature for the issue.

@awill1988 awill1988 added the bug Something is not working. label Oct 25, 2024
@awill1988 awill1988 changed the title GetOrGenerateKeys locking leads to significant service latency degredation under high concurrency GetOrGenerateKeys locking leads to significant service degredation with high request saturation Oct 25, 2024
@aeneasr
Copy link
Member

aeneasr commented Oct 26, 2024

Nice find! SGTM

@terev
Copy link
Contributor

terev commented Oct 27, 2024

Since Hydra can scale horizontally maybe it makes sense to use a distributed lock in this situation? It seems like optimistic locking might fit this scenario best, considering there's many readers but few writers. The algorithm I'm thinking for generating a keyset is:

  • Query latest key for keyset
  • If no key, initialize version to 0
  • Else return key
  • Generate key and execute insert command
  • If success, return key
  • Else if duplicate key collision occurs query latest key for keyset and return key

To de-duplicate this operation within a single Hydra it may make sense to utilize something like singleflight https://pkg.go.dev/golang.org/x/sync/singleflight .

Additionally, I could see this being utilized to ease implementing automatic time based key rolling. Nice find btw!

@aeneasr
Copy link
Member

aeneasr commented Oct 28, 2024

I like the idea, but it’s not needed. The lock is just there to check the local state (are we initialized or not). Could probably be done with an atomic int

@terev
Copy link
Contributor

terev commented Oct 28, 2024

That's fair, it seems like there's a small race condition for initialization with two or more Hydra instances. But I think everything would continue to work even if it was triggered. The only real impact is that there would be extra keys generated and potentially used for signing.

@harvhan
Copy link

harvhan commented Dec 2, 2024

Hi team, I see that there are two open PRs for this issue. Just curious, is there currently a blocker for approving them?

@alnr
Copy link
Contributor

alnr commented Dec 3, 2024

I've only skimmed those PRs but they are IMO both too complicated and don't solve the issue cleanly.

Reposting my recommended implementation strategy from here:

SELECT ... FROM hydra_jwk
a) if found, return
b) if not found, generate a new one in memory (could be done under a mutex/waitgroup with limit to throttle CPU usage)
INSERT INTO hydra_jwk ON CONFLICT DO NOTHING
SELECT ... FROM hydra_jwk
Continue with the key returned from the last select.

@terev
Copy link
Contributor

terev commented Dec 4, 2024

@alnr I initially posted this option here as well but it was vetoed. I agree with you this seems like a reasonable approach.

FWIW the PR I submitted would get us local deduplication, CPU throttling, fewer db trips, and less network saturation, which are beneficial even if the strategy is adapted to utilize the uniqueness constraint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working.
Projects
None yet
6 participants
@aeneasr @alnr @terev @harvhan @awill1988 and others