-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor and alert on errors in cron jobs #24347
Conversation
@@ -380,6 +380,7 @@ module "monitoring" { | |||
sns_topic_arns_map = { | |||
alb_httpcode_5xx = [module.notify_slack.slack_topic_arn] | |||
cron_monitoring = [module.notify_slack.slack_topic_arn] | |||
cron_job_failure_monitoring = [module.notify_slack_p2.slack_topic_arn] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not specified, the same ARN as cron_monitoring
will be used.
module "notify_slack_p2" { | ||
source = "terraform-aws-modules/notify-slack/aws" | ||
version = "5.5.0" | ||
|
||
sns_topic_name = "fleet-dogfood-p2-alerts" | ||
|
||
slack_webhook_url = var.slack_webhook | ||
slack_channel = "#help-p2" | ||
slack_username = "monitoring" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a new topic to send these alerts to P2 channel.
type CronScheduleErrors map[string]error | ||
|
||
func (cse CronScheduleErrors) MarshalJSON() ([]byte, error) { | ||
// Create a temporary map for JSON serialization | ||
stringMap := make(map[string]string) | ||
for key, err := range cse { | ||
if err != nil { | ||
stringMap[key] = err.Error() | ||
} else { | ||
stringMap[key] = "" | ||
} | ||
} | ||
// Serialize the temporary map to JSON | ||
return json.Marshal(stringMap) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New type to hold errors from a cron schedule and turn them into a JSON string.
type NullEvent struct{} | ||
type ( | ||
NullEvent struct{} | ||
SNSTopicArnsMap map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a map to hold the different topics used for cron system vs cron schedule failure alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than needing to separate out the dogfood-specific changes, this looks good from the terraform side and what is in the monitoring lambda. I'm not a code owner for the rest of this, so I will defer review of the rest to others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably PR the change to dogfood itself separately from the rest of these changes as the terraform modules used tagged versions.
We won't have tagged a new version of the terraform before this PR merges, and therefore we cannot specify the version of the monitoring module properly.
See https://github.com/fleetdm/fleet/blob/main/infrastructure/dogfood/terraform/aws-tf-module/main.tf#L371-L372 for how the tags are referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like dogfood is also using a pretty old version of the monitoring module and may require some additional changes regardless. tf-mod-addon-monitoring-v1.4.0
is the current latest version, and I would tag these new changes as tf-mod-addon-monitoring-v1.5.0
.
I can handle the integration portion if-desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can handle the integration portion if-desired.
@rfairburn if you can add the new slack webbook and a SLACK_G_HELP_P2_WEBHOOK_URL
to this and the confidential repo that'd be 💯 , I can handle the rest.
|
||
sns_topic_name = "fleet-dogfood-p2-alerts" | ||
|
||
slack_webhook_url = var.slack_webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require a separate webhook from the other as the webhook URLs are tied to channels even though we have to specify a slack_channel separately in the module.
We'll have to configure a new variable and also the github action to populate it from github secrets and the like. This could all be handled separately after the feature is merged similarly to the monitoring module itself per my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that seemed to good to be true 😏 . Makes sense to separate it out as you said, since the code defaults to using the p1 channel for the new alert if a separate p2 topic arn isn't specified.
🤦 fixed |
@mna Looks like this test is flaky. Filed #24652 I've seen it fail before:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, assuming tests pass.
We should also in the I didn't think of this initially until just now. |
@rfairburn I tried this and it added my doc for the new
Should I just back out that change before committing? Or is there something I'm missing in my |
Did you run a |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #24347 +/- ##
=======================================
Coverage 63.59% 63.59%
=======================================
Files 1603 1604 +1
Lines 152006 152036 +30
Branches 3960 3960
=======================================
+ Hits 96663 96687 +24
- Misses 47647 47651 +4
- Partials 7696 7698 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@rfairburn readme updated! @getvictor fixed another test failure, all green and gtg if you can 👍 again please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infra changes look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for #19930
Checklist for submitter
changes/
,orbit/changes/
oree/fleetd-chrome/changes
.SELECT *
is avoided, SQL injection is prevented (using placeholders for values in statements)Details
This PR adds a new feature to the existing monitoring add-on. The add-on will now send an SNS alert whenever a scheduled job like "vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors. The alert contains the job type and the set of errors (there can be multiple, since jobs can have multiple sub-jobs). By default the SNS topic for this new alert is the same as the one for the existing cron system alerts, but it can be configured to use a separate topic (e.g. dogfood instance will post to a separate slack channel).
The actual changes are:
On the server side:
Schedule
struct to collect errors from jobsRunAllJobs
, collect err from job into new errors varSchedule.updateStats
andCronStats.UpdateCronStats
to accept errors argumentOn the monitor side:
Testing
New automated testing was added for the functional code that gathers and stores errors from cron runs in the database. To test the actual Lambda, I added a row in my
cron_stats
table with errors, then compiled and ran the Lambda executable locally, pointing it to my local mysql and localstack instances: