Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct cancelAfter broadcast when resources are exhausted #1767

Merged

Conversation

danieldoglas
Copy link
Contributor

@danieldoglas danieldoglas commented Jun 4, 2024

Details

Currently, we're facing issues with the replication timing. Because that's happening, the number of replication threads is growing too much, causing resource exhaustion.

Considering that all replication messages can arrive in parallel and not necessarily in order, followers could get stuck in the following case:

  • receives commit 1, creates thread 1
  • receives commit 3, creates thread 2
  • receive commit 2, tries to create thread 3 but resources are exhausted and it fails

Since no threads were created for commit 2, it is never applied. Thread 2, which depends on commit 2 to be completed, gets stuck in an infinite loop. The server then never goes back to a searching state, and can only go back to the pool after a restart.

Fixed Issues

Fixes GH_LINK

Tests


Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

@danieldoglas danieldoglas self-assigned this Jun 4, 2024
sqlitecluster/SQLiteNode.cpp Outdated Show resolved Hide resolved
sqlitecluster/SQLiteNode.cpp Outdated Show resolved Hide resolved
@@ -1932,7 +1939,7 @@ void SQLiteNode::_changeState(SQLiteNodeState newState) {
// If we were following, and now we're not, we give up an any replications.
if (_state == SQLiteNodeState::FOLLOWING) {
_replicationThreadsShouldExit = true;
uint64_t cancelAfter = _leaderCommitNotifier.getValue();
uint64_t cancelAfter = commitIDToCancelAfter > 0 ? commitIDToCancelAfter : _leaderCommitNotifier.getValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use the boolean value of commitIDToCancelAfter instead of comparing to 0.

// and waiting for the transaction that failed will be stuck in an infinite loop. To prevent that
// we're changing the state to SEARCHING and sending the cancelAfter property to drop all threads
// that depend on the transaction that failed to be threaded.
uint64_t cancelAfter = message.calcU64("NewCount") - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to check NewCount against 0 first, but this is an edge case that I don't think we can actually trigger.

@danieldoglas danieldoglas requested a review from johnmlee101 June 4, 2024 18:01
@tylerkaraszewski tylerkaraszewski merged commit 69a1ba7 into main Jun 4, 2024
1 check passed
@tylerkaraszewski tylerkaraszewski deleted the dsilva_useCorrectCancelAfterWhenResourcesAreExhausted branch June 4, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants