Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache the shard routings with no weight for faster access #12989

Merged
merged 5 commits into from
Apr 3, 2024

Conversation

backslasht
Copy link
Contributor

@backslasht backslasht commented Mar 31, 2024

Description

The list of shards to run a query is determined for every request and the weight of the nodes guides the shard selection. Currently, IndexRoutingTable caches the shard routings with weight for faster access. But, during cases where the fail open option is enabled, shards with no weight is also returned lower in the order along with shards with weights. They will be used as fall back if the shards with weights can't be used due to some error.

The shard routing with no weight is not cached, hence it does a full loop for every request, this impacts the search latency when the number of shards to query or the number of nodes in the cluster is high. The latency impact is very high when both the number of shards and the number of nodes are high.

This change introduces a caching mechanism for shard routing with no weights similar to the existing cache for shard routing with weights.

Check List

  • New functionality includes testing.
    • All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • [ ] Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • [ ] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

The list of shards to run a query is determined for every request and
the weight of the nodes guides the shard selection. Currently, IndexRoutingTable
caches the shard routings with weight for faster access. But, during cases
where the fail open option is enabled, shards with no weight is also returned
lower in the order along with shards with weights. They will be used as fall
back if the shards with weights can't be used due to some error.

The shard routing with no weight is not cached, hence it does a full loop for
every request, this impacts the search latency when the number of shards to
query or the number of nodes in the cluster is high. The latency impact is
very high when both the number of shards and the number of nodes are high.

This change introduces a caching mechanism for shard routing with no weights
similar to the existing cache for shard routing with weights.

Signed-off-by: Prabhakar Sithanandam <[email protected]>
@backslasht backslasht added the Performance This is for any performance related enhancements or bugs label Mar 31, 2024
Copy link
Contributor

github-actions bot commented Mar 31, 2024

Compatibility status:

Checks if related components are compatible with change 547e3ab

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/performance-analyzer.git]

Copy link
Contributor

❌ Gradle check result for 396c0df: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise looks good

Copy link
Contributor

@stephen-crawford stephen-crawford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@anshu1106 anshu1106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Prabhakar . Changes look good to me

Copy link
Contributor

github-actions bot commented Apr 2, 2024

❌ Gradle check result for 396c0df: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 2, 2024

✅ Gradle check result for 08fac5a: SUCCESS

Copy link

codecov bot commented Apr 2, 2024

Codecov Report

Attention: Patch coverage is 90.69767% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 71.37%. Comparing base (b15cb0c) to head (547e3ab).
Report is 124 commits behind head on main.

Files Patch % Lines
...search/cluster/routing/IndexShardRoutingTable.java 90.69% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12989      +/-   ##
============================================
- Coverage     71.42%   71.37%   -0.05%     
- Complexity    59978    60354     +376     
============================================
  Files          4985     5025      +40     
  Lines        282275   284399    +2124     
  Branches      40946    41190     +244     
============================================
+ Hits         201603   202998    +1395     
- Misses        63999    64605     +606     
- Partials      16673    16796     +123     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

❕ Gradle check result for 037aa62: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

❌ Gradle check result for cc24120: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 3, 2024

❌ Gradle check result for 547e3ab: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Apr 3, 2024

✅ Gradle check result for 547e3ab: SUCCESS

@Bukhtawar Bukhtawar self-requested a review April 3, 2024 08:09
@Bukhtawar Bukhtawar merged commit fb5d036 into opensearch-project:main Apr 3, 2024
34 checks passed
@Bukhtawar Bukhtawar added the backport 2.x Backport to 2.x branch label Apr 3, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 3, 2024
* Cache the shard routings with no weight for faster access

The list of shards to run a query is determined for every request and
the weight of the nodes guides the shard selection. Currently, IndexRoutingTable
caches the shard routings with weight for faster access. But, during cases
where the fail open option is enabled, shards with no weight is also returned
lower in the order along with shards with weights. They will be used as fall
back if the shards with weights can't be used due to some error.

The shard routing with no weight is not cached, hence it does a full loop for
every request, this impacts the search latency when the number of shards to
query or the number of nodes in the cluster is high. The latency impact is
very high when both the number of shards and the number of nodes are high.

This change introduces a caching mechanism for shard routing with no weights
similar to the existing cache for shard routing with weights.

Signed-off-by: Prabhakar Sithanandam <[email protected]>
Co-authored-by: Prabhakar Sithanandam <[email protected]>
(cherry picked from commit fb5d036)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Bukhtawar pushed a commit that referenced this pull request Apr 3, 2024
…13050)

* Cache the shard routings with no weight for faster access

The list of shards to run a query is determined for every request and
the weight of the nodes guides the shard selection. Currently, IndexRoutingTable
caches the shard routings with weight for faster access. But, during cases
where the fail open option is enabled, shards with no weight is also returned
lower in the order along with shards with weights. They will be used as fall
back if the shards with weights can't be used due to some error.

The shard routing with no weight is not cached, hence it does a full loop for
every request, this impacts the search latency when the number of shards to
query or the number of nodes in the cluster is high. The latency impact is
very high when both the number of shards and the number of nodes are high.

This change introduces a caching mechanism for shard routing with no weights
similar to the existing cache for shard routing with weights.


Signed-off-by: Prabhakar Sithanandam <[email protected]>
Co-authored-by: Prabhakar Sithanandam <[email protected]>
@backslasht backslasht deleted the shard_routing_fix branch April 3, 2024 10:35
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…-project#12989)

* Cache the shard routings with no weight for faster access

The list of shards to run a query is determined for every request and
the weight of the nodes guides the shard selection. Currently, IndexRoutingTable
caches the shard routings with weight for faster access. But, during cases
where the fail open option is enabled, shards with no weight is also returned
lower in the order along with shards with weights. They will be used as fall
back if the shards with weights can't be used due to some error.

The shard routing with no weight is not cached, hence it does a full loop for
every request, this impacts the search latency when the number of shards to
query or the number of nodes in the cluster is high. The latency impact is
very high when both the number of shards and the number of nodes are high.

This change introduces a caching mechanism for shard routing with no weights
similar to the existing cache for shard routing with weights.

Signed-off-by: Prabhakar Sithanandam <[email protected]>
Co-authored-by: Prabhakar Sithanandam <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Performance This is for any performance related enhancements or bugs skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants