-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NPE issue during recovery add #3620
base: master
Are you sure you want to change the base?
Fix NPE issue during recovery add #3620
Conversation
@@ -266,6 +266,10 @@ public void run() { | |||
// We are about to send. Check if we need to make an ensemble change | |||
// because of delayed write errors | |||
lh.maybeHandleDelayedWriteBookieFailure(); | |||
// This op maybe recycled during bookie failures | |||
if (maybeRecycled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybeHandleDelayedWriteBookieFailure triggers something that recycles this pending add op (I haven't looked that closely)?
Then the fix masks the problem.
One side of the problem is that lh becomes null / results in NPE later.
Another side of the problem is that pending add opp can be recycled and immediately reused by another add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. maybeHandleDelayedWriteBookieFailure will replace failed bookie with healthy one. However if NotEnoughBookieException is throwed. maybeHandleDelayedWriteBookieFailure will recycle this pending add op.
You can see this recycle logic in followed code logic
`
@Override
void handleBookieFailure(final Map<Integer, BookieId> failedBookies) {
// handleBookieFailure should always run in the ordered executor thread for this
// ledger, so this synchronized should be unnecessary, but putting it here now
// just in case (can be removed when we validate threads)
synchronized (metadataLock) {
...
try {
List<BookieId> newEnsemble = EnsembleUtils.replaceBookiesInEnsemble(
clientCtx.getBookieWatcher(), metadata, currentEnsemble, failedBookies, logContext);
...
} catch (BKException.BKNotEnoughBookiesException e) {
LOG.error("Could not get additional bookie to remake ensemble, closing ledger: {}", ledgerId);
handleUnrecoverableErrorDuringAdd(e.getCode());
return;
}
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybeHandleDelayedWriteBookieFailure triggers something that recycles this pending add op (I haven't looked that closely)?
Then the fix masks the problem. One side of the problem is that lh becomes null / results in NPE later. Another side of the problem is that pending add opp can be recycled and immediately reused by another add.
- One side of the problem is that lh becomes null / results in NPE later:
If I understand correctly, only maybeHandleDelayedWriteBookieFailure has a chance to recycle this operation causing lh to become null which leads to this npe issue. Although this behavior is not as described in the followed comments about recycle condition, I think it is reasonable.
private void maybeRecycle() {
/**
* We have opportunity to recycle two objects here.
* PendingAddOp#toSend and LedgerHandle#pendingAddOp
*
* A. LedgerHandle#pendingAddOp: This can be released after all 3 conditions are met.
* - After the callback is run
* - After safeRun finished by the executor
* - Write responses are returned from all bookies
* as BookieClient Holds a reference from the point the addEntry requests are sent.
*
* B. ByteBuf can be released after 2 of the conditions are met.
* - After the callback is run as we will not retry the write after callback
* - After safeRun finished by the executor
* BookieClient takes and releases on this buffer immediately after sending the data.
*
* The object can only be recycled after the above conditions are met
* otherwise we could end up recycling twice and all
* joy that goes along with that.
*/
if (hasRun && callbackTriggered) {
ReferenceCountUtil.release(toSend);
toSend = null;
}
// only recycle a pending add op after it has been run.
if (hasRun && toSend == null && pendingWriteRequests == 0) {
recyclePendAddOpObject();
}
}
- Another side of the problem is that pending add opp can be recycled and immediately reused by another add.: I don't understand this very well, can you explain in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another side of the problem is that pending add opp can be recycled and immediately reused by another add.
I don't understand this very well, can you explain in more detail?
lh.maybeHandleDelayedWriteBookieFailure();
// Add Op is recycled at this point, maybeRecycled() has a chance to check lh == null
// but the recycler can give that op to another thread/request (via RECYCLER.get() call)
// so it can be re-initialized with different lh and other data
// so maybeRecycled returns false and the rest of the method is executed against different op
// to avoid such problems and avoid locks we use ordered executors / guarantee order of calls
if (maybeRecycled()) {
@zhaohaidao I hope this helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another side of the problem is that pending add opp can be recycled and immediately reused by another add.
I don't understand this very well, can you explain in more detail?
lh.maybeHandleDelayedWriteBookieFailure(); // Add Op is recycled at this point, maybeRecycled() has a chance to check lh == null // but the recycler can give that op to another thread/request (via RECYCLER.get() call) // so it can be re-initialized with different lh and other data // so maybeRecycled returns false and the rest of the method is executed against different op // to avoid such problems and avoid locks we use ordered executors / guarantee order of calls if (maybeRecycled()) {@zhaohaidao I hope this helps
Thanks a lot for your explanation. I submitted a revision based on your suggestions.
Please have a look if you have time.
@dlg99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaohaidao Thanks for your contribution.
The root cause of this NPE is the different implementations of handleBookieFailure
in LedgerHandle
and ReadOnlyLedgerHandle
.
In the LedgerHandle
implementation, it handles ensemble change results in a new task with the same thread
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java
Lines 1962 to 2004 in 43bdfb3
.run().whenCompleteAsync((metadata, ex) -> { | |
if (ex != null) { | |
LOG.warn("{}[attempt:{}] Exception changing ensemble", logContext, attempts.get(), ex); | |
handleUnrecoverableErrorDuringAdd(BKException.getExceptionCode(ex, WriteException)); | |
} else if (metadata.getValue().isClosed()) { | |
if (LOG.isDebugEnabled()) { | |
LOG.debug("{}[attempt:{}] Metadata closed during attempt to replace bookie." | |
+ " Another client must have recovered the ledger.", logContext, attempts.get()); | |
} | |
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerClosedException); | |
} else if (metadata.getValue().getState() == LedgerMetadata.State.IN_RECOVERY) { | |
if (LOG.isDebugEnabled()) { | |
LOG.debug("{}[attempt:{}] Metadata marked as in-recovery during attempt to replace bookie." | |
+ " Another client must be recovering the ledger.", logContext, attempts.get()); | |
} | |
handleUnrecoverableErrorDuringAdd(BKException.Code.LedgerFencedException); | |
} else { | |
if (LOG.isDebugEnabled()) { | |
LOG.debug("{}[attempt:{}] Success updating metadata.", logContext, attempts.get()); | |
} | |
List<BookieId> newEnsemble = null; | |
Set<Integer> replaced = null; | |
synchronized (metadataLock) { | |
if (!delayedWriteFailedBookies.isEmpty()) { | |
Map<Integer, BookieId> toReplace = new HashMap<>(delayedWriteFailedBookies); | |
delayedWriteFailedBookies.clear(); | |
ensembleChangeLoop(origEnsemble, toReplace); | |
} else { | |
newEnsemble = getCurrentEnsemble(); | |
replaced = EnsembleUtils.diffEnsemble(origEnsemble, newEnsemble); | |
LOG.info("New Ensemble: {} for ledger: {}", newEnsemble, ledgerId); | |
changingEnsemble = false; | |
} | |
} | |
if (newEnsemble != null) { // unsetSuccess outside of lock | |
unsetSuccessAndSendWriteRequest(newEnsemble, replaced); | |
} | |
} | |
}, clientCtx.getMainWorkerPool().chooseThread(ledgerId)); |
In the ReadOnlyLedgerHandle
, it handles the ensemble change results and exceptions in the current task and current thread, which will lead to the NPE.
bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadOnlyLedgerHandle.java
Line 232 in 43bdfb3
handleUnrecoverableErrorDuringAdd(e.getCode()); |
I suggest keeping the same with LedgerHandle
implementation to change handleUnrecoverableErrorDuringAdd(e.getCode());
to executeOrdered(() -> handleUnrecoverableErrorDuringAdd(e.getCode()));
Thanks for your advice. I don't quite understand why processing in a different task and current thread would not be a problem. Could you please explain this in more detail? |
Thanks a lot for your explanation. I submitted a revision based on your suggestions. |
@merlimat @eolivelli @dlg99 @zymap @horizonzy Please help take a look at this Pr, thanks a lot. |
bookkeeper-server/src/test/java/org/apache/bookkeeper/client/PendingAddOpTest.java
Outdated
Show resolved
Hide resolved
Lists.newArrayList(b1, b2, b3))); | ||
ReadOnlyLedgerHandle lh = new ReadOnlyLedgerHandle(clientCtx, 0, md, BookKeeper.DigestType.CRC32C, | ||
ClientUtil.PASSWD, true); | ||
lh.notifyWriteFailed(0, b1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal case, it can't happen.
lh.notifyWriteFailed(0, b1);
should not before op.run()
, it just for test I guess.
I'm curious about why the problem happened. Could you add a test to mock the real case? thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to give a scenario where the problem might occur.
The client tries to write 945 and 946 to the same ledger (id=1256) successively, and writes 3 copies of each entry. Write 945 is successful but one of the bookies returns an exception, so LedgerHandle.delayedWriteFailedBookies will not be empty. Then when writing 946, if a BKNotEnoughBookiesException is thrown, an NPE issue will occur.
It is a little troublesome to construct the above scenario, so I ensure that delayedWriteFailedBookies is not empty (by lh.notifyWriteFailed(0, b1)) from the beginning of my test case, so that I can quickly verify the npe issue.
It should be noted that this is a scenario I deduced from the error log. In the log, only 946 write failures caused by BKNotEnoughBookiesException can be seen.
@horizonzy Thank you very much for your careful explanation, your concern is very reasonable. However in the scenario you described, when bkclient writes multiple messages to the same ledger. If the first message fails, LedgerHandle makes subsequent requests fail by calling errorOutPendingAdds asynchronously. It is worth mentioning that although LedgerHandle.handleUnrecoverableErrorDuringAdd may also call errorOutPendingAdds synchronously, the logic of this code will never happen, because the ledger recovery process will definitely call the ReadOnlyLedgeHandle method. bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java Lines 1767 to 1777 in 7b5b6b2
|
@@ -1769,7 +1769,7 @@ void handleUnrecoverableErrorDuringAdd(int rc) { | |||
if (getLedgerMetadata().getState() == LedgerMetadata.State.IN_RECOVERY) { | |||
// we should not close ledger if ledger is recovery mode | |||
// otherwise we may lose entry. | |||
errorOutPendingAdds(rc); | |||
executeOrdered(() -> errorOutPendingAdds(rc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@horizonzy I may not have explained it clearly. Although this is changed to an async operation, this code block will never be executed, because when in IN_RECOVERY mode, the method of ReadOnlyLedgerHandle must be called.
What do you think if I roll back the changes to LedgerHandle and only keep the changes to ReadOnlyLedgerHandle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
rerun failure checks |
3 similar comments
rerun failure checks |
rerun failure checks |
rerun failure checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thank you for fixing this and adding the test.
Hey, @hangc0276 Can you help me continue to review this pr when you are free? Also, the ci seems to be stuck and cannot be re-triggered. |
rerun failure checks |
…e-issue-during-recovery-add
rerun failure checks |
rerun failure checks |
2 similar comments
rerun failure checks |
rerun failure checks |
Descriptions of the changes in this PR:
Fix NPE issue during recovery add
Motivation
The issue details can is #3618
Changes
Check if the op has been recycled after handling bookie failure
Master Issue: #3618