Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NPE issue during recovery add #3620

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

zhaohaidao
Copy link
Contributor

Descriptions of the changes in this PR:
Fix NPE issue during recovery add

Motivation

The issue details can is #3618

Changes

Check if the op has been recycled after handling bookie failure
image

Master Issue: #3618


In order to uphold a high standard for quality for code contributions, Apache BookKeeper runs various precommit
checks for pull requests. A pull request can only be merged when it passes precommit checks.


Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

If this PR is a BookKeeper Proposal (BP):

  • Make sure the PR title is formatted like:
    <BP-#>: Description of bookkeeper proposal
    e.g. BP-1: 64 bits ledger is support
  • Attach the master issue link in the description of this PR.
  • Attach the google doc link if the BP is written in Google Doc.

Otherwise:

  • Make sure the PR title is formatted like:
    <Issue #>: Description of pull request
    e.g. Issue 123: Description ...
  • Make sure tests pass via mvn clean apache-rat:check install spotbugs:check.
  • Replace <Issue #> in the title with the actual Issue number.

@@ -266,6 +266,10 @@ public void run() {
// We are about to send. Check if we need to make an ensemble change
// because of delayed write errors
lh.maybeHandleDelayedWriteBookieFailure();
// This op maybe recycled during bookie failures
if (maybeRecycled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybeHandleDelayedWriteBookieFailure triggers something that recycles this pending add op (I haven't looked that closely)?

Then the fix masks the problem.
One side of the problem is that lh becomes null / results in NPE later.
Another side of the problem is that pending add opp can be recycled and immediately reused by another add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. maybeHandleDelayedWriteBookieFailure will replace failed bookie with healthy one. However if NotEnoughBookieException is throwed. maybeHandleDelayedWriteBookieFailure will recycle this pending add op.
You can see this recycle logic in followed code logic
`

@Override
void handleBookieFailure(final Map<Integer, BookieId> failedBookies) {
    // handleBookieFailure should always run in the ordered executor thread for this
    // ledger, so this synchronized should be unnecessary, but putting it here now
    // just in case (can be removed when we validate threads)
    synchronized (metadataLock) {
        ...
        try {
            List<BookieId> newEnsemble = EnsembleUtils.replaceBookiesInEnsemble(
                    clientCtx.getBookieWatcher(), metadata, currentEnsemble, failedBookies, logContext);
        ...
        } catch (BKException.BKNotEnoughBookiesException e) {
            LOG.error("Could not get additional bookie to remake ensemble, closing ledger: {}", ledgerId);

            handleUnrecoverableErrorDuringAdd(e.getCode());
            return;
        }

`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybeHandleDelayedWriteBookieFailure triggers something that recycles this pending add op (I haven't looked that closely)?

Then the fix masks the problem. One side of the problem is that lh becomes null / results in NPE later. Another side of the problem is that pending add opp can be recycled and immediately reused by another add.

  • One side of the problem is that lh becomes null / results in NPE later:
    If I understand correctly, only maybeHandleDelayedWriteBookieFailure has a chance to recycle this operation causing lh to become null which leads to this npe issue. Although this behavior is not as described in the followed comments about recycle condition, I think it is reasonable.
    private void maybeRecycle() {
        /**
         * We have opportunity to recycle two objects here.
         * PendingAddOp#toSend and LedgerHandle#pendingAddOp
         *
         * A. LedgerHandle#pendingAddOp: This can be released after all 3 conditions are met.
         *    - After the callback is run
         *    - After safeRun finished by the executor
         *    - Write responses are returned from all bookies
         *      as BookieClient Holds a reference from the point the addEntry requests are sent.
         *
         * B. ByteBuf can be released after 2 of the conditions are met.
         *    - After the callback is run as we will not retry the write after callback
         *    - After safeRun finished by the executor
         * BookieClient takes and releases on this buffer immediately after sending the data.
         *
         * The object can only be recycled after the above conditions are met
         * otherwise we could end up recycling twice and all
         * joy that goes along with that.
         */
        if (hasRun && callbackTriggered) {
            ReferenceCountUtil.release(toSend);
            toSend = null;
        }
        // only recycle a pending add op after it has been run.
        if (hasRun && toSend == null && pendingWriteRequests == 0) {
            recyclePendAddOpObject();
        }
    }

  • Another side of the problem is that pending add opp can be recycled and immediately reused by another add.: I don't understand this very well, can you explain in more detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side of the problem is that pending add opp can be recycled and immediately reused by another add.

I don't understand this very well, can you explain in more detail?

        lh.maybeHandleDelayedWriteBookieFailure();
        
        // Add Op is recycled at this point, maybeRecycled() has a chance to check lh == null
        // but the recycler can give that op to another thread/request (via RECYCLER.get() call)
        // so it can be re-initialized with different lh and other data
        // so maybeRecycled returns false and the rest of the method is executed against different op
        // to avoid such problems and avoid locks we use ordered executors / guarantee order of calls
        
        if (maybeRecycled()) {

@zhaohaidao I hope this helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side of the problem is that pending add opp can be recycled and immediately reused by another add.

I don't understand this very well, can you explain in more detail?

        lh.maybeHandleDelayedWriteBookieFailure();
        
        // Add Op is recycled at this point, maybeRecycled() has a chance to check lh == null
        // but the recycler can give that op to another thread/request (via RECYCLER.get() call)
        // so it can be re-initialized with different lh and other data
        // so maybeRecycled returns false and the rest of the method is executed against different op
        // to avoid such problems and avoid locks we use ordered executors / guarantee order of calls
        
        if (maybeRecycled()) {

@zhaohaidao I hope this helps

Thanks a lot for your explanation. I submitted a revision based on your suggestions.
Please have a look if you have time.
@dlg99

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaohaidao Thanks for your contribution.

The root cause of this NPE is the different implementations of handleBookieFailure in LedgerHandle and ReadOnlyLedgerHandle.

In the LedgerHandle implementation, it handles ensemble change results in a new task with the same thread

.run().whenCompleteAsync((metadata, ex) -> {
if (ex != null) {
LOG.warn("{}[attempt:{}] Exception changing ensemble", logContext, attempts.get(), ex);
handleUnrecoverableErrorDuringAdd(BKException.getExceptionCode(ex, WriteException));
} else if (metadata.getValue().isClosed()) {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Metadata closed during attempt to replace bookie."
+ " Another client must have recovered the ledger.", logContext, attempts.get());
}
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerClosedException);
} else if (metadata.getValue().getState() == LedgerMetadata.State.IN_RECOVERY) {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Metadata marked as in-recovery during attempt to replace bookie."
+ " Another client must be recovering the ledger.", logContext, attempts.get());
}
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerFencedException);
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Success updating metadata.", logContext, attempts.get());
}
List<BookieId> newEnsemble = null;
Set<Integer> replaced = null;
synchronized (metadataLock) {
if (!delayedWriteFailedBookies.isEmpty()) {
Map<Integer, BookieId> toReplace = new HashMap<>(delayedWriteFailedBookies);
delayedWriteFailedBookies.clear();
ensembleChangeLoop(origEnsemble, toReplace);
} else {
newEnsemble = getCurrentEnsemble();
replaced = EnsembleUtils.diffEnsemble(origEnsemble, newEnsemble);
LOG.info("New Ensemble: {} for ledger: {}", newEnsemble, ledgerId);
changingEnsemble = false;
}
}
if (newEnsemble != null) { // unsetSuccess outside of lock
unsetSuccessAndSendWriteRequest(newEnsemble, replaced);
}
}
}, clientCtx.getMainWorkerPool().chooseThread(ledgerId));

In the ReadOnlyLedgerHandle, it handles the ensemble change results and exceptions in the current task and current thread, which will lead to the NPE.

I suggest keeping the same with LedgerHandle implementation to change handleUnrecoverableErrorDuringAdd(e.getCode()); to executeOrdered(() -> handleUnrecoverableErrorDuringAdd(e.getCode()));

@zhaohaidao
Copy link
Contributor Author

zhaohaidao commented Nov 8, 2022

Thanks for your advice. I don't quite understand why processing in a different task and current thread would not be a problem. Could you please explain this in more detail?
@hangc0276

@zhaohaidao
Copy link
Contributor Author

@zhaohaidao Thanks for your contribution.

The root cause of this NPE is the different implementations of handleBookieFailure in LedgerHandle and ReadOnlyLedgerHandle.

In the LedgerHandle implementation, it handles ensemble change results in a new task with the same thread

.run().whenCompleteAsync((metadata, ex) -> {
if (ex != null) {
LOG.warn("{}[attempt:{}] Exception changing ensemble", logContext, attempts.get(), ex);
handleUnrecoverableErrorDuringAdd(BKException.getExceptionCode(ex, WriteException));
} else if (metadata.getValue().isClosed()) {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Metadata closed during attempt to replace bookie."
+ " Another client must have recovered the ledger.", logContext, attempts.get());
}
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerClosedException);
} else if (metadata.getValue().getState() == LedgerMetadata.State.IN_RECOVERY) {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Metadata marked as in-recovery during attempt to replace bookie."
+ " Another client must be recovering the ledger.", logContext, attempts.get());
}
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerFencedException);
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("{}[attempt:{}] Success updating metadata.", logContext, attempts.get());
}
List<BookieId> newEnsemble = null;
Set<Integer> replaced = null;
synchronized (metadataLock) {
if (!delayedWriteFailedBookies.isEmpty()) {
Map<Integer, BookieId> toReplace = new HashMap<>(delayedWriteFailedBookies);
delayedWriteFailedBookies.clear();
ensembleChangeLoop(origEnsemble, toReplace);
} else {
newEnsemble = getCurrentEnsemble();
replaced = EnsembleUtils.diffEnsemble(origEnsemble, newEnsemble);
LOG.info("New Ensemble: {} for ledger: {}", newEnsemble, ledgerId);
changingEnsemble = false;
}
}
if (newEnsemble != null) { // unsetSuccess outside of lock
unsetSuccessAndSendWriteRequest(newEnsemble, replaced);
}
}
}, clientCtx.getMainWorkerPool().chooseThread(ledgerId));

In the ReadOnlyLedgerHandle, it handles the ensemble change results and exceptions in the current task and current thread, which will lead to the NPE.

I suggest keeping the same with LedgerHandle implementation to change handleUnrecoverableErrorDuringAdd(e.getCode()); to executeOrdered(() -> handleUnrecoverableErrorDuringAdd(e.getCode()));

Thanks a lot for your explanation. I submitted a revision based on your suggestions.
Please have a look if you have time.
@hangc0276

@hangc0276
Copy link
Contributor

@merlimat @eolivelli @dlg99 @zymap @horizonzy Please help take a look at this Pr, thanks a lot.

Lists.newArrayList(b1, b2, b3)));
ReadOnlyLedgerHandle lh = new ReadOnlyLedgerHandle(clientCtx, 0, md, BookKeeper.DigestType.CRC32C,
ClientUtil.PASSWD, true);
lh.notifyWriteFailed(0, b1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal case, it can't happen.
lh.notifyWriteFailed(0, b1); should not before op.run(), it just for test I guess.
I'm curious about why the problem happened. Could you add a test to mock the real case? thanks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to give a scenario where the problem might occur.
The client tries to write 945 and 946 to the same ledger (id=1256) successively, and writes 3 copies of each entry. Write 945 is successful but one of the bookies returns an exception, so LedgerHandle.delayedWriteFailedBookies will not be empty. Then when writing 946, if a BKNotEnoughBookiesException is thrown, an NPE issue will occur.

It is a little troublesome to construct the above scenario, so I ensure that delayedWriteFailedBookies is not empty (by lh.notifyWriteFailed(0, b1)) from the beginning of my test case, so that I can quickly verify the npe issue.

It should be noted that this is a scenario I deduced from the error log. In the log, only 946 write failures caused by BKNotEnoughBookiesException can be seen.

@zhaohaidao
Copy link
Contributor Author

Sorry, this change may introduce a new problem. There is a case: The ledger handles asyncAddEntry 10000 times. So there are 10000 PendingAddOp will be sent to the LedgerHandle#pendingAddOps.

The 10000 PendingAddOp will be executed one by one in the future. Before this change, errorOutPendingAdds is a sync operation; it takes all the PendingAddOp from LedgerHandle#pendingAddOps, then submits a callback. After submitting callback, all the PendingAddOp#callbackTriggered will be set to true. Because we execute every PendingAddOp in the thread pool, every PendingAddOp will be run at some time.

if (callbackTriggered) {
// this should only be true if the request was failed due
// to another request ahead in the pending queue,
// so we can just ignore this request
maybeRecycle();
return;
}

At now, we check PendingAddOp#callbackTriggered value is true or not; if true, it will return directly, didn't send any request to the server; it's quick.
But if we make errorOutPendingAdds async, all the 10000 PendingAddOp will send requests to the server, because the Runnable is ordered. After 10000 PendingAddOp executing, the errorOutPendingAdds will be executed. It's not acceptable.

@horizonzy Thank you very much for your careful explanation, your concern is very reasonable. However in the scenario you described, when bkclient writes multiple messages to the same ledger. If the first message fails, LedgerHandle makes subsequent requests fail by calling errorOutPendingAdds asynchronously.
There is a difference between the implementation of ReadOnlyLedgeHandle and the implementation of LedgerHandle. The latter is a sync operation.
The change of this pr is actually referring to the implementation of handleUnrecoverableErrorDuringAdd of LedgerHandle, according to the suggestion of @hangc0276 .
Another fix I can think of is to interrupt the execution of safeRun when NotEnoughException is encountered, but this may require adding an exception signature to handleBookieFailure. I understand that this may violate the original design intention? Do you have any better suggestions for changes? Welcome to continue the discussion.

It is worth mentioning that although LedgerHandle.handleUnrecoverableErrorDuringAdd may also call errorOutPendingAdds synchronously, the logic of this code will never happen, because the ledger recovery process will definitely call the ReadOnlyLedgeHandle method.

// close the ledger and send fails to all the adds in the pipeline
void handleUnrecoverableErrorDuringAdd(int rc) {
if (getLedgerMetadata().getState() == LedgerMetadata.State.IN_RECOVERY) {
// we should not close ledger if ledger is recovery mode
// otherwise we may lose entry.
errorOutPendingAdds(rc);
return;
}
LOG.error("Closing ledger {} due to {}", ledgerId, BKException.codeLogger(rc));
asyncCloseInternal(NoopCloseCallback.instance, null, rc);
}

@horizonzy
Copy link
Member

There are some places to invoke handleUnrecoverableErrorDuringAdd, the LedgerHandle will also invoke handleUnrecoverableErrorDuringAdd. This pr also change the behavior in LedgerHandle.

image

@@ -1769,7 +1769,7 @@ void handleUnrecoverableErrorDuringAdd(int rc) {
if (getLedgerMetadata().getState() == LedgerMetadata.State.IN_RECOVERY) {
// we should not close ledger if ledger is recovery mode
// otherwise we may lose entry.
errorOutPendingAdds(rc);
executeOrdered(() -> errorOutPendingAdds(rc));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@horizonzy I may not have explained it clearly. Although this is changed to an async operation, this code block will never be executed, because when in IN_RECOVERY mode, the method of ReadOnlyLedgerHandle must be called.
What do you think if I roll back the changes to LedgerHandle and only keep the changes to ReadOnlyLedgerHandle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

3 similar comments
@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank you for fixing this and adding the test.

@zhaohaidao
Copy link
Contributor Author

Hey, @hangc0276 Can you help me continue to review this pr when you are free? Also, the ci seems to be stuck and cannot be re-triggered.

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@hangc0276 hangc0276 closed this Oct 7, 2023
@hangc0276 hangc0276 reopened this Oct 7, 2023
@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

2 similar comments
@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@zhaohaidao
Copy link
Contributor Author

rerun failure checks

@eolivelli eolivelli removed this from the 4.17.0 milestone May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants