Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor and alert on errors in cron jobs #24347

Merged
merged 33 commits into from
Dec 19, 2024
Merged

Conversation

sgress454
Copy link
Contributor

@sgress454 sgress454 commented Dec 3, 2024

for #19930

Checklist for submitter

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements)
  • Added/updated tests
  • If database migrations are included, checked table schema to confirm autoupdate
  • Manual QA for all new/changed functionality

Details

This PR adds a new feature to the existing monitoring add-on. The add-on will now send an SNS alert whenever a scheduled job like "vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors. The alert contains the job type and the set of errors (there can be multiple, since jobs can have multiple sub-jobs). By default the SNS topic for this new alert is the same as the one for the existing cron system alerts, but it can be configured to use a separate topic (e.g. dogfood instance will post to a separate slack channel).

The actual changes are:

On the server side:

  • Add errors field to cron_stats table (json DEFAULT NULL)
  • Added errors var to Schedule struct to collect errors from jobs
  • In RunAllJobs, collect err from job into new errors var
  • Update Schedule.updateStatsand CronStats.UpdateCronStatsto accept errors argument
  • If provided, update errors field of cron_stats table

On the monitor side:

  • Add new SQL query to look for all completed schedules since last run with non-null errors
  • send SNS with job ID, name, errors

Testing

New automated testing was added for the functional code that gathers and stores errors from cron runs in the database. To test the actual Lambda, I added a row in my cron_stats table with errors, then compiled and ran the Lambda executable locally, pointing it to my local mysql and localstack instances:

2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"

@sgress454 sgress454 changed the title 19930 alert cron failures Monitor and alert on errors in cron jobs Dec 3, 2024
@@ -380,6 +380,7 @@ module "monitoring" {
sns_topic_arns_map = {
alb_httpcode_5xx = [module.notify_slack.slack_topic_arn]
cron_monitoring = [module.notify_slack.slack_topic_arn]
cron_job_failure_monitoring = [module.notify_slack_p2.slack_topic_arn]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not specified, the same ARN as cron_monitoring will be used.

Comment on lines 471 to 481
module "notify_slack_p2" {
source = "terraform-aws-modules/notify-slack/aws"
version = "5.5.0"

sns_topic_name = "fleet-dogfood-p2-alerts"

slack_webhook_url = var.slack_webhook
slack_channel = "#help-p2"
slack_username = "monitoring"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new topic to send these alerts to P2 channel.

Comment on lines +151 to +165
type CronScheduleErrors map[string]error

func (cse CronScheduleErrors) MarshalJSON() ([]byte, error) {
// Create a temporary map for JSON serialization
stringMap := make(map[string]string)
for key, err := range cse {
if err != nil {
stringMap[key] = err.Error()
} else {
stringMap[key] = ""
}
}
// Serialize the temporary map to JSON
return json.Marshal(stringMap)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New type to hold errors from a cron schedule and turn them into a JSON string.

type NullEvent struct{}
type (
NullEvent struct{}
SNSTopicArnsMap map[string]string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a map to hold the different topics used for cron system vs cron schedule failure alerts.

Copy link
Contributor

@rfairburn rfairburn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than needing to separate out the dogfood-specific changes, this looks good from the terraform side and what is in the monitoring lambda. I'm not a code owner for the rest of this, so I will defer review of the rest to others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably PR the change to dogfood itself separately from the rest of these changes as the terraform modules used tagged versions.

We won't have tagged a new version of the terraform before this PR merges, and therefore we cannot specify the version of the monitoring module properly.

See https://github.com/fleetdm/fleet/blob/main/infrastructure/dogfood/terraform/aws-tf-module/main.tf#L371-L372 for how the tags are referenced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like dogfood is also using a pretty old version of the monitoring module and may require some additional changes regardless. tf-mod-addon-monitoring-v1.4.0 is the current latest version, and I would tag these new changes as tf-mod-addon-monitoring-v1.5.0.

I can handle the integration portion if-desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can handle the integration portion if-desired.

@rfairburn if you can add the new slack webbook and a SLACK_G_HELP_P2_WEBHOOK_URL to this and the confidential repo that'd be 💯 , I can handle the rest.


sns_topic_name = "fleet-dogfood-p2-alerts"

slack_webhook_url = var.slack_webhook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a separate webhook from the other as the webhook URLs are tied to channels even though we have to specify a slack_channel separately in the module.

We'll have to configure a new variable and also the github action to populate it from github secrets and the like. This could all be handled separately after the feature is merged similarly to the monitoring module itself per my previous comment.

Copy link
Contributor Author

@sgress454 sgress454 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that seemed to good to be true 😏 . Makes sense to separate it out as you said, since the code defaults to using the p1 channel for the new alert if a separate p2 topic arn isn't specified.

@sgress454
Copy link
Contributor Author

@sgress454 Looks like some Go tests are failing:

FAIL: TestUpdateAllCronStatsForInstance (5.31s)
FAIL: TestCleanupCronStats (5.59s)

🤦 fixed

@getvictor
Copy link
Member

getvictor commented Dec 11, 2024

@mna Looks like this test is flaky. Filed #24652

I've seen it fail before:

=== NAME  TestAsyncLastSeen/timed_flush
    async_test.go:80: 
        	Error Trace:	/home/runner/work/fleet/fleet/server/mdm/nanomdm/storage/mysql/async_test.go:80
        	Error:      	Not equal: 
        	            	expected: "12|3|456"
        	            	actual  : "12|3|4|56"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-12|3|456
        	            	+12|3|4|56
        	Test:       	TestAsyncLastSeen/timed_flush

getvictor
getvictor previously approved these changes Dec 11, 2024
Copy link
Member

@getvictor getvictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, assuming tests pass.

@rfairburn
Copy link
Contributor

We should also in the terraform/addons/monitoring section update the .header.md with updated instructions and after running a local terraform init do a terraform-docs markdown . > README.md in order to update the documentation for the monitoring module to reflect the latest changes.

I didn't think of this initially until just now.

@sgress454
Copy link
Contributor Author

We should also in the terraform/addons/monitoring section update the .header.md with updated instructions and after running a local terraform init do a terraform-docs markdown . > README.md in order to update the documentation for the monitoring module to reflect the latest changes.

@rfairburn I tried this and it added my doc for the new cron_job_failure_monitoring option, but also made some updates I didn't understand, like setting the providers section to:

Name Version
archive n/a
aws n/a
null n/a

Should I just back out that change before committing? Or is there something I'm missing in my terraform-docs command?

@rfairburn
Copy link
Contributor

Did you run a terraform init prior to running the docs? it uses that to determine the versions.

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Project coverage is 63.59%. Comparing base (e472b80) to head (d347593).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...tables/20241210140021_AddErrorsToCronStatsTable.go 54.54% 4 Missing and 1 partial ⚠️
server/fleet/cron_schedules.go 70.00% 2 Missing and 1 partial ⚠️
server/datastore/mysql/cron_stats.go 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #24347   +/-   ##
=======================================
  Coverage   63.59%   63.59%           
=======================================
  Files        1603     1604    +1     
  Lines      152006   152036   +30     
  Branches     3960     3960           
=======================================
+ Hits        96663    96687   +24     
- Misses      47647    47651    +4     
- Partials     7696     7698    +2     
Flag Coverage Δ
backend 64.39% <75.67%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sgress454
Copy link
Contributor Author

@rfairburn readme updated!

@getvictor fixed another test failure, all green and gtg if you can 👍 again please

Copy link
Contributor

@rfairburn rfairburn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infra changes look good.

Copy link
Member

@getvictor getvictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sgress454 sgress454 merged commit 6bd9cc8 into main Dec 19, 2024
29 checks passed
@sgress454 sgress454 deleted the 19930-alert-cron-failures branch December 19, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants