Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries #9410

Merged
merged 3 commits into from
Dec 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load:

### MimirIngesterReachingSeriesLimit

This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed.
This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is close to reaching the limit.
The threshold is set at 80% to give the chance to react before the limit is reached.
After the limit is reached, write requests to the ingester fail for new series. Appending samples to existing ones continue to succeed.

Note that the error responses sent back to the sender are classified as "server errors" (5xx), which should result in a retry by the sender.
While this situation continues, these retries stall the flow of data, and newer data queues up on the sender.
If the condition is cleared in a short time, service can be restored with no data loss.
bboreham marked this conversation as resolved.
Show resolved Hide resolved

This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded.

In case of **emergency**:

Expand Down Expand Up @@ -123,7 +131,7 @@ How to **fix** it:

### MimirIngesterReachingTenantsLimit

This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones.
This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, write requests to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones.

The per-tenant memory utilisation in ingesters includes the overhead of allocations for TSDB stripes and chunk writer buffers. If the tenant number is high, this may contribute significantly to the total ingester memory utilization. The size of these allocations is controlled by `-blocks-storage.tsdb.stripe-size` (default 16KiB) and `-blocks-storage.tsdb.head-chunks-write-buffer-size-bytes` (default 4MiB), respectively.

Expand Down
Loading