Decrement thread count on exception. #1929

tylerkaraszewski · 2024-10-31T17:21:28Z

Details

So, we increment the thread count here, before starting a replication thread:

Bedrock/sqlitecluster/SQLiteNode.cpp

Line 1644 in a30c7e1

auto threadID = _replicationThreadCount.fetch_add(1);

Normally, we will decrement this inside the replication thread when it exits, which is handled here:

Bedrock/sqlitecluster/SQLiteNode.cpp

Line 199 in a30c7e1

    
           ScopedDecrement<decltype(_replicationThreadCount)> decrementer(_replicationThreadCount);

However, if the thread fails to start and throws system_error, we will get to this point:

Bedrock/sqlitecluster/SQLiteNode.cpp

Line 1655 in a30c7e1

_changeState(SQLiteNodeState::SEARCHING, message.calcU64("NewCount") - 1);

Without ever having run the thread, and thus, without the thread count ever being decremented. Further, change state will wait for the thread count to be 0 before it completes, which means the above line blocks and we never log the warning or throw the error.

Here's where _changeState blocks on this:

Bedrock/sqlitecluster/SQLiteNode.cpp

Lines 1874 to 1880 in a30c7e1

    
           while (_replicationThreadCount) { 
        
               if (infoCount % 100 == 0) { 
        
                   SINFO("Waiting for " << _replicationThreadCount << " remaining replication threads."); 
        
               } 
        
               infoCount++; 
        
               usleep(10'000); 
        
           }

The fix is to decrement the counter if we hit the exception case. I've also moved logging the warning to above the _changeState call so that it will be visible sooner that we've hit this exception.

Fixed Issues

Fixes https://github.com/Expensify/Expensify/issues/440475

Tests

Internal Testing Reminder: when changing bedrock, please compile auth against your new changes

flodnv · 2024-10-31T18:55:30Z

cc @danieldoglas since IIRC you looked at this code this year - #1767

flodnv

Thanks for the great investigation and explanation! 👍

flodnv · 2024-10-31T19:03:05Z

sqlitecluster/SQLiteNode.cpp

                    SWARN("Caught system_error starting _replicate thread with " << _replicationThreadCount.load() << " threads. e.what()=" << e.what());
+                    _changeState(SQLiteNodeState::SEARCHING, message.calcU64("NewCount") - 1);
                    STHROW("Error starting replicate thread so giving up and reconnecting.");


What happens after we throw here?

We should go synchronizing as if the node was just turned on.

danieldoglas

I didn't know that counter existed when I changed this, thanks for finding and fixing!

tylerkaraszewski added 2 commits October 31, 2024 10:12

I think this fixes the issue.

78962d9

Warn before changing state

7aaec43

tylerkaraszewski self-assigned this Oct 31, 2024

tylerkaraszewski requested review from flodnv and cead22 October 31, 2024 17:23

cead22 approved these changes Oct 31, 2024

View reviewed changes

tylerkaraszewski requested a review from mjasikowski October 31, 2024 18:56

flodnv approved these changes Oct 31, 2024

View reviewed changes

flodnv reviewed Oct 31, 2024

View reviewed changes

tylerkaraszewski merged commit 583a5cb into main Oct 31, 2024
1 check passed

tylerkaraszewski deleted the tyler-fix-stuck-state-change branch October 31, 2024 19:36

danieldoglas reviewed Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrement thread count on exception. #1929

Decrement thread count on exception. #1929

tylerkaraszewski commented Oct 31, 2024 •

edited

Loading

flodnv commented Oct 31, 2024

flodnv left a comment

flodnv Oct 31, 2024

tylerkaraszewski Oct 31, 2024

danieldoglas left a comment

	while (_replicationThreadCount) {
	if (infoCount % 100 == 0) {
	SINFO("Waiting for " << _replicationThreadCount << " remaining replication threads.");
	}
	infoCount++;
	usleep(10'000);
	}

Decrement thread count on exception. #1929

Decrement thread count on exception. #1929

Conversation

tylerkaraszewski commented Oct 31, 2024 • edited Loading

Details

Fixed Issues

Tests

flodnv commented Oct 31, 2024

flodnv left a comment

Choose a reason for hiding this comment

flodnv Oct 31, 2024

Choose a reason for hiding this comment

tylerkaraszewski Oct 31, 2024

Choose a reason for hiding this comment

danieldoglas left a comment

Choose a reason for hiding this comment

tylerkaraszewski commented Oct 31, 2024 •

edited

Loading