Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Bookkeeper returns mac mismatch while reading #21421

Closed
1 of 2 tasks
KannarFr opened this issue Oct 23, 2023 · 23 comments · Fixed by #22484
Closed
1 of 2 tasks

[Bug] Bookkeeper returns mac mismatch while reading #21421

KannarFr opened this issue Oct 23, 2023 · 23 comments · Fixed by #22484
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@KannarFr
Copy link
Contributor

KannarFr commented Oct 23, 2023

Search before asking

  • I searched in the issues and found nothing similar.

Version

2.11.3 to 3.1.0

Minimal reproduce step

N/A

What did you expect to see?

No error

What did you see instead?

After bumped I got a lot of exceptions as:

2023-10-23T15:01:29,964+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Mac mismatch while reading L16702 E0 from bookie: yo-bookkeeper-c3-n3
2023-10-23T15:01:29,965+0000 [BookKeeperClientWorker-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L16702 E0-E0, Sent to [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7], Heard from [yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n3, yo-bookkeeper-c3-n7] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T15:01:29,966+0000 [main] ERROR org.apache.bookkeeper.client.BookKeeperAdmin - Error reading entry 0 from ledger 16702
org.apache.bookkeeper.client.BKException$BKDigestMatchException: Entry digest does not match
        at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.client.SyncCallbackUtils$SyncReadCallback.readComplete(SyncCallbackUtils.java:229) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818) ~[org.apache.bookkeeper-bookkeeper-server-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107) ~[org.apache.bookkeeper-bookkeeper-common-4.16.2.jar:4.16.2]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.94.Final.jar:4.1.94.Final]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Entry digest does not match -  ledger=103622248 - operation=Failed to read entry - entry=0
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:711)
        at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$Functions.lambda$getLedgerEntry$0(BookkeeperSchemaStorage.java:655)
        at org.apache.bookkeeper.client.LedgerHandle$4.onFailure(LedgerHandle.java:818)
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38)
        at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
        at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:833)

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@KannarFr KannarFr added the type/bug The PR fixed a bug or issue reported a bug label Oct 23, 2023
@KannarFr KannarFr changed the title [Bug] [Bug] Bookkeeper returns Mac mismatch while reading Oct 23, 2023
@KannarFr KannarFr changed the title [Bug] Bookkeeper returns Mac mismatch while reading [Bug] Bookkeeper returns mac mismatch while reading Oct 23, 2023
@KannarFr
Copy link
Contributor Author

KannarFr commented Oct 23, 2023

I'm rollbacking brokers (only) to 2.11.

EDIT: It fixed the issue. So there is an issue with rendering on the client side on v2.11+.

@KannarFr
Copy link
Contributor Author

A bit more context

2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L103556923 E0-E0, Sent to [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4], Heard from [yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n4] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (-1, rc = null)
2023-10-23T13:35:47,556+0000 [BookKeeperClientWorker-OrderedExecutor-9-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/192.168.2.5:48818][persistent://user_421c40fa-009e-433c-a93b-fc77551a2406/logs/__change_events-partition-0][multiTopicsReader-02b7f937a6] Failed to create consumer: consumerId=2104, Entry digest does not match -  ledger=103556923 - operation=Failed to read entry - entry=0

Where Read of ledger entry failed: L**103556923** E0-E0 really looks like trying to read an empty ledger.

$ bookkeeper shell ledgermetadata -ledgerid 103556923
2023-10-23T21:28:51,808+0000 [main] INFO  org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=2, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n4, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n9, yo-bookkeeper-c3-n15]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:dXNlcl80MjFjNDBmYS0wMDllLTQzM2MtYTkzYi1mYzc3NTUxYTI0MDYvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}

@miton18
Copy link
Contributor

miton18 commented Oct 26, 2023

We hit the same errors on Pulsar v3.0.1

@hangc0276
Copy link
Contributor

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

@miton18
Copy link
Contributor

miton18 commented Oct 26, 2023

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

sadly, no

It happens when upgrading brokers to v3.x.x

@hangc0276
Copy link
Contributor

We hit the same errors on Pulsar v3.0.1

@miton18 Do you have any way to reproduce this issue?

sadly, no

It happens when upgrading brokers to v3.x.x

Only upgrade broker, doesn't upgrade the bookie, right?

@miton18
Copy link
Contributor

miton18 commented Oct 26, 2023

bookies has been upgraded to v3.1.0 without issues

@miton18
Copy link
Contributor

miton18 commented Oct 26, 2023

ok, trying to read a ledger from bookkeeper itself also throw the same error...

023-10-26T09:40:26,062+0000 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=5, writeQuorumSize=3, ackQuorumSize=3, state=CLOSED, length=8336, lastEntryId=0, digestType=CRC32C, password=base64:, ensembles={0=[yo-bookkeeper-c3-n12, yo-bookkeeper-c3-n8, yo-bookkeeper-c3-n14, yo-bookkeeper-c3-n6, yo-bookkeeper-c3-n3]}, customMetadata={component=base64:c2NoZW1h, application=base64:cHVsc2Fy, pulsar/schemaId=base64:b3JnYV8yMWQ3ZjZkMy04Mzk3LTQ3ODItOWE2NS05MWVjYWMxYTQ1OTUvbG9ncy9fX2NoYW5nZV9ldmVudHM=}}

@KannarFr
Copy link
Contributor Author

Looks fixed in 3.1.1.

@lhotari
Copy link
Member

lhotari commented Nov 9, 2023

There are some reports on Pulsar Slack that this issue happens when bookkeeperUseV2WireProtocol=false or TLS is enabled for Bookkeeper. @hangc0276 I wonder what the possible bug fix has been if this is already fixed?

@Debashish-Mallick
Copy link

Looks fixed in 3.1.1.

@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.

@KannarFr
Copy link
Contributor Author

No, I didn't but should I use v2 (legacy) wire proto? I mean this can fix the issue, but this is a jump in the past, nope? :D

@Debashish-Mallick
Copy link

Looks fixed in 3.1.1.

@KannarFr Have you checked with 3.0.XX version. Basically Its working, if we set bookkeeperUseV2WireProtocol=true in broker configmap.

@lhotari Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true ?

@lhotari
Copy link
Member

lhotari commented Nov 20, 2023

Could you please help me to understand , are there any impact If we make bookkeeperUseV2WireProtocol=true

@Debashish-Mallick bookkeeperUseV2WireProtocol=true is the recommended default with Pulsar:

pulsar/conf/broker.conf

Lines 973 to 974 in 9b643c8

# Use older Bookkeeper wire protocol with bookie
bookkeeperUseV2WireProtocol=true

However, when using bookkeeperUseV2WireProtocol=true, I believe that TLS is not supported. See apache/bookkeeper#2300 and https://github.com/apache/pulsar-helm-chart/blob/8cb3c18377e547e5de4ca19e63138ca1387aef59/charts/pulsar/templates/broker-configmap.yaml#L131-L147 for more information.

I am not exactly sure about the impact since I haven't tested this configuration myself. My concern is that TLS wouldn't be used when the intention is to use TLS between brokers and bookies.

However, it's possible that apache/bookkeeper#2300 change added support for TLS for v2, but it's hard to know the full context of that change.
One possibility is that it's completely valid and recommended to use bookkeeperUseV2WireProtocol=true, also with TLS after the BK PR 2300 fix that was made a long time ago.

@miton18
Copy link
Contributor

miton18 commented Nov 20, 2023

bookkeeperUseV2WireProtocol=false
is mandatory to have a working SQL setup
https://pulsar.apache.org/docs/3.1.x/sql-deployment-configurations/#get-the-last-message-in-a-topic

@codelipenghui
Copy link
Contributor

apache/bookkeeper#4140 has figured out the issue. It's a regression from 3.0.

@lhotari
Copy link
Member

lhotari commented Jan 17, 2024

@hangc0276 @poorbarcode Do I understand correctly that apache/bookkeeper#4140 will be released in bookkeeper version 4.16.4 ? What is the schedule for the Bookkeeper 4.16.4 release with the fix?
I also asked at https://lists.apache.org/thread/tcy93c8vrb475j64tdotlklxblrcm728 .

@lhotari
Copy link
Member

lhotari commented Jan 29, 2024

Bookkeeper 4.16.4 has been upgraded in #21983 . Closing this issue since that should contain the fix.

@lhotari lhotari closed this as completed Jan 29, 2024
@lhotari
Copy link
Member

lhotari commented Jan 29, 2024

Reopening until latest concerns about other checksum related bugs in BK are resolved.

@lhotari
Copy link
Member

lhotari commented Mar 29, 2024

Pending fix in Bookkeeper will be included in 4.16.5 release.
apache/bookkeeper#4196

@pqab
Copy link

pqab commented Apr 10, 2024

We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?

@lhotari
Copy link
Member

lhotari commented Apr 11, 2024

We are using 4.16.3 and encountering this error with 1-2 events occurring randomly on a daily basis, and we have a workaround to skip the specific position and resend the data to recover it. May I know if you are aware of any other workarounds available before the 4.16.5 release?

@pqab I'm not aware of a workaround. There's also a concern that there are some other problems (#22103 with some linked BK issue) that cause similar symptoms.
BK 4.16.5 has been released and there's PR to upgrade in Pulsar, #22484. This will be included in Pulsar releases 3.0.5 and 3.2.3 . ETA for the release is within a few weeks. In the meantime, it's possible to build custom builds of Pulsar.

@lhotari
Copy link
Member

lhotari commented May 28, 2024

I believe that this issue is a duplicate of #22601. I have made a fix to bookkeeper which fixes the issue: apache/bookkeeper#4404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants