[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

ashking94 · 2024-12-19T09:10:13Z

Describe the bug

An issue has been identified where port exhaustion is causing indexing failures and partial snapshots in OpenSearch clusters with high indexing loads. The problem manifests in the following ways:

Periodic spikes in 5xx HTTP status codes during indexing operations.
Exceptions with the message "Cannot assign requested address" appearing in logs, particularly during stale segment deletion.
Failures in translog uploads due to the same "Cannot assign requested address" error.
Partial snapshots due to shard failures, with error messages indicating metadata files are not present for certain primary terms and generations.

Root Cause:

The issue appears to stem from the synchronous S3 client creating new sockets for every request under high load, leading to port exhaustion. This primarily affects operations like stale segment deletion, which can involve a large number of files becoming eligible for deletion between events.

Impact:

Degraded indexing performance
Incomplete or failed snapshots

Related component

Storage:Snapshots

To Reproduce

The issue can be reproduced by:

Creating a single-node remote store enabled OpenSearch domain with a large instance type
Creating an index with a high number of primary shards (e.g., 200) and no replicas.
Initiating heavy indexing operations.
Create snapshot periodically

Expected behavior

The ports should not exhausted since there is a default setting that limits the max connection to 500.

Observed behaviour

Indexing rate dips and 5xx error spikes occur at regular intervals, coinciding with snapshot operations.
Snapshot status shows as PARTIAL with multiple shard failures.

Additional Details

No response

ashking94 added bug Something isn't working untriaged labels Dec 19, 2024

github-actions bot added the Storage:Snapshots label Dec 19, 2024

github-project-automation bot added this to Storage Project Board Dec 19, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Dec 19, 2024

ashking94 removed the untriaged label Dec 19, 2024

ashking94 linked a pull request Dec 19, 2024 that will close this issue

Use async client for delete blob or path in S3 Blob Container #16788

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

ashking94 commented Dec 19, 2024 •

edited

Loading

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots #16883

Comments

ashking94 commented Dec 19, 2024 • edited Loading

Describe the bug

Root Cause:

Impact:

Related component

To Reproduce

Expected behavior

Observed behaviour

Additional Details

ashking94 commented Dec 19, 2024 •

edited

Loading