Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrement thread count on exception. #1929

Merged
merged 2 commits into from
Oct 31, 2024

Conversation

tylerkaraszewski
Copy link
Contributor

@tylerkaraszewski tylerkaraszewski commented Oct 31, 2024

Details

So, we increment the thread count here, before starting a replication thread:

auto threadID = _replicationThreadCount.fetch_add(1);

Normally, we will decrement this inside the replication thread when it exits, which is handled here:

ScopedDecrement<decltype(_replicationThreadCount)> decrementer(_replicationThreadCount);

However, if the thread fails to start and throws system_error, we will get to this point:

_changeState(SQLiteNodeState::SEARCHING, message.calcU64("NewCount") - 1);

Without ever having run the thread, and thus, without the thread count ever being decremented. Further, change state will wait for the thread count to be 0 before it completes, which means the above line blocks and we never log the warning or throw the error.

Here's where _changeState blocks on this:

while (_replicationThreadCount) {
if (infoCount % 100 == 0) {
SINFO("Waiting for " << _replicationThreadCount << " remaining replication threads.");
}
infoCount++;
usleep(10'000);
}

The fix is to decrement the counter if we hit the exception case. I've also moved logging the warning to above the _changeState call so that it will be visible sooner that we've hit this exception.

Fixed Issues

Fixes https://github.com/Expensify/Expensify/issues/440475

Tests


Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

@tylerkaraszewski tylerkaraszewski self-assigned this Oct 31, 2024
@flodnv
Copy link
Contributor

flodnv commented Oct 31, 2024

cc @danieldoglas since IIRC you looked at this code this year - #1767

Copy link
Contributor

@flodnv flodnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great investigation and explanation! 👍

SWARN("Caught system_error starting _replicate thread with " << _replicationThreadCount.load() << " threads. e.what()=" << e.what());
_changeState(SQLiteNodeState::SEARCHING, message.calcU64("NewCount") - 1);
STHROW("Error starting replicate thread so giving up and reconnecting.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens after we throw here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go synchronizing as if the node was just turned on.

@tylerkaraszewski tylerkaraszewski merged commit 583a5cb into main Oct 31, 2024
1 check passed
@tylerkaraszewski tylerkaraszewski deleted the tyler-fix-stuck-state-change branch October 31, 2024 19:36
Copy link
Contributor

@danieldoglas danieldoglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that counter existed when I changed this, thanks for finding and fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants