Bookie 99% ledger disk usage #4100

mnit016 · 2023-10-08T09:42:32Z

BUG REPORT

Somehow 1 bookie in cluster is full ledger disk and turned to read-only mode
Hi there, I'm facing a issue in Pulsar-2.9.5, bookie ledger usage increased to 99.9%. Similar with #1908, but I can't see any ledger information in error logs there to do clean up.

To Reproduce
<N/A>

Expected behavior
How could I get over this issue? any workaround / solution for this?

Screenshots
Through Splunk, from several days ago, a lot of "Entering Safepoint region..." "Leaving safepoint region..." appear in logs
After a day, I found this "Exception ledger flush" / "Error in Rocksdb put" starting

And 4 hours later, it turned to "Error during flush"

Above logs happen until "Ledger directory ... is out-of-space" and continuos

Additional context
Pulsar 2.9.5 - K8s 1.26.3

mnit016 · 2023-10-09T14:21:46Z

Updated:
I've just found a loop in logs, the frequency of creating new log file was getting faster and faster, until the bookie ledger disk all full.

SingleDirectoryDbLedgerStorage - Write cache is full, triggering flush
SyncThead - Exception flushing ledgers
RocksDBException: while fdatasync ..../current/ledgers/000xxx.log: Resource temporarily unavailable

After that, several logs like below:

EntryLogManagerBase - Creating a new entry log file : createNewLog = false, reachEntryLogLimit = true
EntryLogManagerBase - Flusing entry logger xxxxx back to filesystem, pending for syncing entry loggers : [....]
EntryLoggerAllocator - Created new entry log file .../ledgers/current/xxxx.log for logId xxxxx
...
SyncThead - Exception flushing ledgers
RocksDBException: while fdatasync ..../current/ledgers/000xxx.log: Resource temporarily unavailable

trying to figure out which topic's ledger was unavailable all the time.

mnit016 · 2023-10-09T14:47:34Z

I can't see the mentioned file at below log in Bookie Storage anywhere

RocksDBException: while fdatasync ....bookkeeper/ledgers/current/ledgers/000xxx.log: Resource temporarily unavailable

mnit016 · 2023-10-09T15:49:14Z

They're more than 5000 files as below in /bookkeeper/ledgers/current/*.log

mnit016 · 2023-10-10T04:05:28Z

might be something wrong with retention data.
Reduce the retention config solved the problem, bookie back to normal

mnit016 added the type/bug label Oct 8, 2023

mnit016 closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bookie 99% ledger disk usage #4100

Bookie 99% ledger disk usage #4100

mnit016 commented Oct 8, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 10, 2023

Bookie 99% ledger disk usage #4100

Bookie 99% ledger disk usage #4100

Comments

mnit016 commented Oct 8, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 9, 2023

mnit016 commented Oct 10, 2023