Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

Open
ashking94 opened this issue Dec 19, 2024 · 0 comments · May be fixed by #16788
Open

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

ashking94 opened this issue Dec 19, 2024 · 0 comments · May be fixed by #16788
Labels
bug Something isn't working Storage:Snapshots

Comments

@ashking94
Copy link
Member

ashking94 commented Dec 19, 2024

Describe the bug

An issue has been identified where port exhaustion is causing indexing failures and partial snapshots in OpenSearch clusters with high indexing loads. The problem manifests in the following ways:

  1. Periodic spikes in 5xx HTTP status codes during indexing operations.
  2. Exceptions with the message "Cannot assign requested address" appearing in logs, particularly during stale segment deletion.
  3. Failures in translog uploads due to the same "Cannot assign requested address" error.
  4. Partial snapshots due to shard failures, with error messages indicating metadata files are not present for certain primary terms and generations.

Root Cause:

The issue appears to stem from the synchronous S3 client creating new sockets for every request under high load, leading to port exhaustion. This primarily affects operations like stale segment deletion, which can involve a large number of files becoming eligible for deletion between events.

Impact:

  • Degraded indexing performance
  • Incomplete or failed snapshots

Related component

Storage:Snapshots

To Reproduce

The issue can be reproduced by:

  1. Creating a single-node remote store enabled OpenSearch domain with a large instance type
  2. Creating an index with a high number of primary shards (e.g., 200) and no replicas.
  3. Initiating heavy indexing operations.
  4. Create snapshot periodically

Expected behavior

The ports should not exhausted since there is a default setting that limits the max connection to 500.

Observed behaviour

  1. Indexing rate dips and 5xx error spikes occur at regular intervals, coinciding with snapshot operations.
  2. Snapshot status shows as PARTIAL with multiple shard failures.

Additional Details

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Snapshots
Projects
Status: 🆕 New
Development

Successfully merging a pull request may close this issue.

1 participant