Skip to content

Commit

Permalink
Add snapshot monitors for elasticsearch (#60)
Browse files Browse the repository at this point in the history
* add snapshot monitors for elasticsearch

* fix missing default on variables
  • Loading branch information
Aohzan authored Nov 15, 2024
1 parent c6d6293 commit 31fb26d
Show file tree
Hide file tree
Showing 4 changed files with 203 additions and 35 deletions.
44 changes: 31 additions & 13 deletions database/elasticsearch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,24 @@ Creates DataDog monitors with the following checks:
- Elasticsearch average search fetch latency
- Elasticsearch average search query latency
- Elasticsearch average Young-generation garbage collections latency
- Elasticsearch change alert on the average time spent by tasks in the queue
- Elasticsearch change alert on the number of currently active queries
- Elasticsearch change alert on the number of query cache evictions
- Elasticsearch change alert on the number of request cache evictions
- Elasticsearch change alert on the number of search fetches currently running
- Elasticsearch change alert on the total number of evictions from the fielddata cache
- ElasticSearch Cluster has unassigned shards
- ElasticSearch Cluster is initializing shards
- ElasticSearch Cluster is relocating shards
- ElasticSearch Cluster status not green
- ElasticSearch does not respond
- ElasticSearch free space < 10%
- Elasticsearch change alert on the average time spent by tasks in the queue on {{cluster_name}}
- Elasticsearch change alert on the number of currently active queries on {{cluster_name}}
- Elasticsearch change alert on the number of query cache evictions on {{node_name}}
- Elasticsearch change alert on the number of request cache evictions on {{node_name}}
- Elasticsearch change alert on the number of search fetches currently running on {{cluster_name}}
- Elasticsearch change alert on the total number of evictions from the fielddata cache on {{node_name}}
- ElasticSearch Cluster has unassigned shards on {{cluster_name}}
- ElasticSearch Cluster is initializing shards on {{cluster_name}}
- ElasticSearch Cluster is relocating shards on {{cluster_name}}
- ElasticSearch Cluster status not green on {{cluster_name}}
- ElasticSearch does not respond on {{server}}:{{port}}
- ElasticSearch free space < 10% on {{node_name}}
- Elasticsearch JVM HEAP memory usage
- Elasticsearch JVM memory Old usage
- Elasticsearch JVM memory Young usage
- Elasticsearch number of current open HTTP connections anomaly detected
- Elasticsearch number of current open HTTP connections anomaly detected on {{node_name}}
- Elasticsearch {{policy}} snapshot deletion failure on {{cluster_name}}
- Elasticsearch {{policy}} snapshot failed on {{cluster_name}}

<!-- BEGIN_TF_DOCS -->
## Requirements
Expand Down Expand Up @@ -85,6 +87,8 @@ Creates DataDog monitors with the following checks:
| [datadog_monitor.request_cache_evictions_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.search_query_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.search_query_latency](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.slm_snapshot_deletion_failures](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.slm_snapshots_failed](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |
| [datadog_monitor.task_time_in_queue_change](https://registry.terraform.io/providers/DataDog/datadog/latest/docs/resources/monitor) | resource |

## Inputs
Expand Down Expand Up @@ -258,6 +262,18 @@ Creates DataDog monitors with the following checks:
| <a name="input_search_query_latency_threshold_warning"></a> [search\_query\_latency\_threshold\_warning](#input\_search\_query\_latency\_threshold\_warning) | Cluster Status warning threshold | `string` | `10` | no |
| <a name="input_search_query_latency_time_aggregator"></a> [search\_query\_latency\_time\_aggregator](#input\_search\_query\_latency\_time\_aggregator) | Time aggregator for the Cluster Status monitor | `string` | `"avg"` | no |
| <a name="input_search_query_latency_timeframe"></a> [search\_query\_latency\_timeframe](#input\_search\_query\_latency\_timeframe) | Timeframe for the Cluster Status monitor | `string` | `"last_15m"` | no |
| <a name="input_slm_snapshot_deletion_failures_enabled"></a> [slm\_snapshot\_deletion\_failures\_enabled](#input\_slm\_snapshot\_deletion\_failures\_enabled) | Flag to enable SLM Snapshot deletion failures monitor | `string` | `"true"` | no |
| <a name="input_slm_snapshot_deletion_failures_extra_tags"></a> [slm\_snapshot\_deletion\_failures\_extra\_tags](#input\_slm\_snapshot\_deletion\_failures\_extra\_tags) | Extra tags for SLM Snapshot deletion failures monitor | `list(string)` | `[]` | no |
| <a name="input_slm_snapshot_deletion_failures_message"></a> [slm\_snapshot\_deletion\_failures\_message](#input\_slm\_snapshot\_deletion\_failures\_message) | Custom message for SLM Snapshot deletion failures monitor | `string` | `""` | no |
| <a name="input_slm_snapshot_deletion_failures_threshold_critical"></a> [slm\_snapshot\_deletion\_failures\_threshold\_critical](#input\_slm\_snapshot\_deletion\_failures\_threshold\_critical) | SLM Snapshot deletion failures critical threshold | `string` | `1` | no |
| <a name="input_slm_snapshot_deletion_failures_time_aggregator"></a> [slm\_snapshot\_deletion\_failures\_time\_aggregator](#input\_slm\_snapshot\_deletion\_failures\_time\_aggregator) | Time aggregator for SLM Snapshot deletion failures monitor | `string` | `"avg"` | no |
| <a name="input_slm_snapshot_deletion_failures_timeframe"></a> [slm\_snapshot\_deletion\_failures\_timeframe](#input\_slm\_snapshot\_deletion\_failures\_timeframe) | SLM Snapshot deletion failures timeframe | `string` | `"last_5m"` | no |
| <a name="input_slm_snapshots_failed_enabled"></a> [slm\_snapshots\_failed\_enabled](#input\_slm\_snapshots\_failed\_enabled) | Flag to enable SLM Snapshots Failed monitor | `string` | `"true"` | no |
| <a name="input_slm_snapshots_failed_extra_tags"></a> [slm\_snapshots\_failed\_extra\_tags](#input\_slm\_snapshots\_failed\_extra\_tags) | Extra tags for SLM Snapshots Failed monitor | `list(string)` | `[]` | no |
| <a name="input_slm_snapshots_failed_message"></a> [slm\_snapshots\_failed\_message](#input\_slm\_snapshots\_failed\_message) | Custom message for SLM Snapshots Failed monitor | `string` | `""` | no |
| <a name="input_slm_snapshots_failed_threshold_critical"></a> [slm\_snapshots\_failed\_threshold\_critical](#input\_slm\_snapshots\_failed\_threshold\_critical) | SLM Snapshots Failed critical threshold | `string` | `1` | no |
| <a name="input_slm_snapshots_failed_time_aggregator"></a> [slm\_snapshots\_failed\_time\_aggregator](#input\_slm\_snapshots\_failed\_time\_aggregator) | Time aggregator for SLM Snapshots Failed monitor | `string` | `"avg"` | no |
| <a name="input_slm_snapshots_failed_timeframe"></a> [slm\_snapshots\_failed\_timeframe](#input\_slm\_snapshots\_failed\_timeframe) | SLM Snapshots Failed timeframe | `string` | `"last_5m"` | no |
| <a name="input_tags"></a> [tags](#input\_tags) | Global variables | `list(string)` | <pre>[<br> "type:database",<br> "provider:elasticsearch",<br> "resource:elasticsearch"<br>]</pre> | no |
| <a name="input_task_time_in_queue_change_enabled"></a> [task\_time\_in\_queue\_change\_enabled](#input\_task\_time\_in\_queue\_change\_enabled) | Flag to enable Cluster Status monitor | `string` | `"true"` | no |
| <a name="input_task_time_in_queue_change_extra_tags"></a> [task\_time\_in\_queue\_change\_extra\_tags](#input\_task\_time\_in\_queue\_change\_extra\_tags) | Extra tags for Cluster Status monitor | `list(string)` | `[]` | no |
Expand Down Expand Up @@ -295,6 +311,8 @@ Creates DataDog monitors with the following checks:
| <a name="output_request_cache_evictions_change_id"></a> [request\_cache\_evictions\_change\_id](#output\_request\_cache\_evictions\_change\_id) | id for monitor request\_cache\_evictions\_change |
| <a name="output_search_query_change_id"></a> [search\_query\_change\_id](#output\_search\_query\_change\_id) | id for monitor search\_query\_change |
| <a name="output_search_query_latency_id"></a> [search\_query\_latency\_id](#output\_search\_query\_latency\_id) | id for monitor search\_query\_latency |
| <a name="output_slm_snapshot_deletion_failures_id"></a> [slm\_snapshot\_deletion\_failures\_id](#output\_slm\_snapshot\_deletion\_failures\_id) | id for monitor slm\_snapshot\_deletion\_failures |
| <a name="output_slm_snapshots_failed_id"></a> [slm\_snapshots\_failed\_id](#output\_slm\_snapshots\_failed\_id) | id for monitor slm\_snapshots\_failed |
| <a name="output_task_time_in_queue_change_id"></a> [task\_time\_in\_queue\_change\_id](#output\_task\_time\_in\_queue\_change\_id) | id for monitor task\_time\_in\_queue\_change |
<!-- END_TF_DOCS -->
## Related documentation
Expand Down
79 changes: 79 additions & 0 deletions database/elasticsearch/inputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -1110,3 +1110,82 @@ variable "not_responding_extra_tags" {
default = []
}

#
# SLM - Snapshots Failed
#

variable "slm_snapshots_failed_enabled" {
description = "Flag to enable SLM Snapshots Failed monitor"
type = string
default = "true"
}

variable "slm_snapshots_failed_message" {
description = "Custom message for SLM Snapshots Failed monitor"
type = string
default = ""
}

variable "slm_snapshots_failed_time_aggregator" {
description = "Time aggregator for SLM Snapshots Failed monitor"
type = string
default = "avg"
}

variable "slm_snapshots_failed_timeframe" {
description = "SLM Snapshots Failed timeframe"
type = string
default = "last_5m"
}

variable "slm_snapshots_failed_threshold_critical" {
description = "SLM Snapshots Failed critical threshold"
type = string
default = 1
}

variable "slm_snapshots_failed_extra_tags" {
description = "Extra tags for SLM Snapshots Failed monitor"
type = list(string)
default = []
}

#
# SLM - Snapshot deletion failures
#

variable "slm_snapshot_deletion_failures_enabled" {
description = "Flag to enable SLM Snapshot deletion failures monitor"
type = string
default = "true"
}

variable "slm_snapshot_deletion_failures_message" {
description = "Custom message for SLM Snapshot deletion failures monitor"
type = string
default = ""
}

variable "slm_snapshot_deletion_failures_time_aggregator" {
description = "Time aggregator for SLM Snapshot deletion failures monitor"
type = string
default = "avg"
}

variable "slm_snapshot_deletion_failures_timeframe" {
description = "SLM Snapshot deletion failures timeframe"
type = string
default = "last_5m"
}

variable "slm_snapshot_deletion_failures_threshold_critical" {
description = "SLM Snapshot deletion failures critical threshold"
type = string
default = 1
}

variable "slm_snapshot_deletion_failures_extra_tags" {
description = "Extra tags for SLM Snapshot deletion failures monitor"
type = list(string)
default = []
}
Loading

0 comments on commit 31fb26d

Please sign in to comment.