-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects #22110
base: master
Are you sure you want to change the base?
Conversation
After making the changes, there are a lot of unit test failures. I haven't had a chance to look into the details. This PR is still a very early proposal about how to start detecting "double release" and "use after release" bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recycled objects are widely used in Pulsar, not only for EntryImpl
. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.
The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java
Outdated
Show resolved
Hide resolved
@BewareMyPower Yes that's a good point. It looks like the AbstractCASReferenceCounted/AbstractReferenceCounted base class prevents a lot of the problems since there would be exceptions at some point if there would be execution paths where "double release" or "use after release" bugs existed.
Yes, those are cases where the problem is getting hidden. It would be great to find a better way to track down the issues. That's why I started this PR which is more like an experiment to find some way that would work for detecting "double release" and "use after release" bugs and also raise awareness of such bug patterns with the recycled objects. These are bug patterns that most Java developers have never had to deal with because of Java's garbage collection. With recycled objects and Netty ByteBufs, that all changes. |
I guess there's also the possibility of "double release" and "use after release" bugs with Netty ByteBufs. In Netty, there's the leak detector for detecting when you don't release buffers, but there seems to be nothing for detecting the "use after release" bugs. I guess "double release" would be detected with the io.netty.buffer.AbstractReferenceCountedByteBuf base class which will throw an exception on double release. It seems that the solution might be a Java Agent written with Byte Buddy etc. which would add additional checks with byte code instrumentation when the agent is activated. Thinking something like https://github.com/reactor/BlockHound but for a completely different purpose, to help detect "use after release" bugs. |
Getting back to this one more time. Netty protects against most "use after release" ByteBuf bugs by setting the fields to null and the NPEs would be popping up as a sign of issues. Therefore it's extremely important that NPEs aren't suppressed with null checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @lhotari
My comments are mostly questions - 1 minor suggestion on naming. Other than that LGTM.
Also, because it is coupled w/ the "emergency brake" property , it seems harmless to add in .
The remaining piece will be to find out where we are currently doing the null checks and remove them.
return entry; | ||
} | ||
|
||
private static EntryImpl getEntryFromRecycler() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 factoring this out makes it easier to follow (and add other common behaviors in one place later if needed - basic DRY stuff).
setRefCnt(1); | ||
} | ||
|
||
public static <T extends AbstractValidatingReferenceCounted> T getAndCheck(Recycler<T> recycler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] It is not really doing a "check". In the ACRC it was called getEntryFromRecycler
. I think it would be helpful to name them the same thing (sans "entry"), maybe getInstanceFromRecycler
.
} | ||
|
||
public final void resetRefCnt() { | ||
setRefCnt(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate that setRefCnt
is protected. Otherwise, this could have been done in a single util rather than inserted into the hierarchy (composition over inheritance). Also, the code in both hierarchies is almost identical.
- The checkOnAccess var is just a static lookup.
- The checkRefCount and getFromRecycler could be passed in the counter or recycler, respectively
Not the end of the world but my brain is having trouble leaving it alone.
@@ -1374,8 +1375,10 @@ protected ProducerImpl.ChunkedMessageCtx newObject( | |||
}; | |||
|
|||
public static ChunkedMessageCtx get(int totalChunks) { | |||
ChunkedMessageCtx chunkedMessageCtx = RECYCLER.get(); | |||
chunkedMessageCtx.setRefCnt(totalChunks); | |||
ChunkedMessageCtx chunkedMessageCtx = getAndCheck(RECYCLER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is setting ref count to 1 and then retaining N-1 less performant than setting ref count to N?
OR
Is setting ref count to 1 and then retaining N-1 more functionally correct than setting ref count to N?
Pattern used in Netty to detect "use after release" bugs: |
There are 2 ByteBuf reference count handling issues in Bookkeeper client that have been recently fixed: apache/bookkeeper#4289 and apache/bookkeeper#4293 . |
I recently discovered another bug pattern with ByteBufs which is due to an incorrect assumption of how Netty ByteBuf reference counting works for derived buffers. Writing some pseudo code to explain:
The assumption could be that the above code is correct. Why is it wrong? To understand how a duplicate buffer works, one could take a look at the source, in AbstractPooledDerivedByteBuf. The reference count of the duplicated buffer (extends AbstractPooledDerivedByteBuf in the case of pooled buffers) is independent of the parent buffer. It will call |
The Netty 4.1.111.Final upgrade will prevent some problems in this area, more details in #22892 |
Motivation
In Pulsar, users have reported issues that could be caused by "double release" or "use after release" bugs with recycled objects.
Here's are some issues that could potentially be caused by "double release" or "use after release" bugs:
#22035
#21892
Other example of such potential issue: #21421/#21933. It is possible that these issues are fixed by apache/bookkeeper#4196. The other root cause could be a "double release" or "use after release" bug which is corrupting the buffer and causing checksum calculation to fail.
Outside of Java, there's a known bug pattern called "double-free" or "Doubly freeing memory" (with malloc).
The "double release" bug pattern is a bit similar, but happens with the recycled object pattern using Netty's Recycler that Pulsar uses because of performance reasons.
There is also a "use after free" bug pattern. Something similar could be happen with Netty recycled objects that the object instance gets used after releasing.
The solution in this PR attempts to help detect "double release" and "use after release" bugs.
Modifications
.setRefCnt(1)
calls in production code since that could hide real issues-Dpulsar.refcount.check.on_access=false
.-Dpulsar.refcount.check.on_access=false
into thebin/pulsar
script by default if we are afraid of the performance overhead. For all tests, we should be running with checks enabled so that we could find the source of problems.Documentation
doc
doc-required
doc-not-needed
doc-complete