Monitor and alert on errors in cron jobs #24347

sgress454 · 2024-12-03T21:52:54Z

for #19930

Checklist for submitter

Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements)
Added/updated tests
If database migrations are included, checked table schema to confirm autoupdate
Manual QA for all new/changed functionality

Details

This PR adds a new feature to the existing monitoring add-on. The add-on will now send an SNS alert whenever a scheduled job like "vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors. The alert contains the job type and the set of errors (there can be multiple, since jobs can have multiple sub-jobs). By default the SNS topic for this new alert is the same as the one for the existing cron system alerts, but it can be configured to use a separate topic (e.g. dogfood instance will post to a separate slack channel).

The actual changes are:

On the server side:

Add errors field to cron_stats table (json DEFAULT NULL)
Added errors var to Schedule struct to collect errors from jobs
In RunAllJobs, collect err from job into new errors var
Update Schedule.updateStatsand CronStats.UpdateCronStatsto accept errors argument
If provided, update errors field of cron_stats table

On the monitor side:

Add new SQL query to look for all completed schedules since last run with non-null errors
send SNS with job ID, name, errors

Testing

New automated testing was added for the functional code that gathers and stores errors from cron runs in the database. To test the actual Lambda, I added a row in my cron_stats table with errors, then compiled and ran the Lambda executable locally, pointing it to my local mysql and localstack instances:

2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"

sgress454 · 2024-12-03T21:54:29Z

infrastructure/dogfood/terraform/aws-tf-module/main.tf

@@ -380,6 +380,7 @@ module "monitoring" {
  sns_topic_arns_map = {
    alb_httpcode_5xx = [module.notify_slack.slack_topic_arn]
    cron_monitoring  = [module.notify_slack.slack_topic_arn]
+    cron_job_failure_monitoring = [module.notify_slack_p2.slack_topic_arn]


If this is not specified, the same ARN as cron_monitoring will be used.

sgress454 · 2024-12-03T21:54:54Z

infrastructure/dogfood/terraform/aws-tf-module/main.tf

+module "notify_slack_p2" {
+  source  = "terraform-aws-modules/notify-slack/aws"
+  version = "5.5.0"
+
+  sns_topic_name = "fleet-dogfood-p2-alerts"
+
+  slack_webhook_url = var.slack_webhook
+  slack_channel     = "#help-p2"
+  slack_username    = "monitoring"
+}
+


Adding a new topic to send these alerts to P2 channel.

sgress454 · 2024-12-03T21:59:10Z

server/fleet/cron_schedules.go

+type CronScheduleErrors map[string]error
+
+func (cse CronScheduleErrors) MarshalJSON() ([]byte, error) {
+	// Create a temporary map for JSON serialization
+	stringMap := make(map[string]string)
+	for key, err := range cse {
+		if err != nil {
+			stringMap[key] = err.Error()
+		} else {
+			stringMap[key] = ""
+		}
+	}
+	// Serialize the temporary map to JSON
+	return json.Marshal(stringMap)
+}


New type to hold errors from a cron schedule and turn them into a JSON string.

sgress454 · 2024-12-03T22:00:14Z

terraform/addons/monitoring/lambda/main.go

-type NullEvent struct{}
+type (
+	NullEvent       struct{}
+	SNSTopicArnsMap map[string]string


Created a map to hold the different topics used for cron system vs cron schedule failure alerts.

rfairburn

Other than needing to separate out the dogfood-specific changes, this looks good from the terraform side and what is in the monitoring lambda. I'm not a code owner for the rest of this, so I will defer review of the rest to others.

rfairburn · 2024-12-03T22:07:31Z

infrastructure/dogfood/terraform/aws-tf-module/main.tf

We should probably PR the change to dogfood itself separately from the rest of these changes as the terraform modules used tagged versions.

We won't have tagged a new version of the terraform before this PR merges, and therefore we cannot specify the version of the monitoring module properly.

See https://github.com/fleetdm/fleet/blob/main/infrastructure/dogfood/terraform/aws-tf-module/main.tf#L371-L372 for how the tags are referenced.

it looks like dogfood is also using a pretty old version of the monitoring module and may require some additional changes regardless. tf-mod-addon-monitoring-v1.4.0 is the current latest version, and I would tag these new changes as tf-mod-addon-monitoring-v1.5.0.

I can handle the integration portion if-desired.

I can handle the integration portion if-desired.

@rfairburn if you can add the new slack webbook and a SLACK_G_HELP_P2_WEBHOOK_URL to this and the confidential repo that'd be 💯 , I can handle the rest.

rfairburn · 2024-12-03T22:18:31Z

infrastructure/dogfood/terraform/aws-tf-module/main.tf

+
+  sns_topic_name = "fleet-dogfood-p2-alerts"
+
+  slack_webhook_url = var.slack_webhook


This will require a separate webhook from the other as the webhook URLs are tied to channels even though we have to specify a slack_channel separately in the module.

We'll have to configure a new variable and also the github action to populate it from github secrets and the like. This could all be handled separately after the feature is merged similarly to the monitoring module itself per my previous comment.

I thought that seemed to good to be true 😏 . Makes sense to separate it out as you said, since the code defaults to using the p1 channel for the new alert if a separate p2 topic arn isn't specified.

sgress454 · 2024-12-11T16:05:16Z

@sgress454 Looks like some Go tests are failing:
FAIL: TestUpdateAllCronStatsForInstance (5.31s)
FAIL: TestCleanupCronStats (5.59s)

🤦 fixed

getvictor · 2024-12-11T16:29:42Z

@mna Looks like this test is flaky. Filed #24652

I've seen it fail before:

=== NAME  TestAsyncLastSeen/timed_flush
    async_test.go:80: 
        	Error Trace:	/home/runner/work/fleet/fleet/server/mdm/nanomdm/storage/mysql/async_test.go:80
        	Error:      	Not equal: 
        	            	expected: "12|3|456"
        	            	actual  : "12|3|4|56"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-12|3|456
        	            	+12|3|4|56
        	Test:       	TestAsyncLastSeen/timed_flush

getvictor

Looks good, assuming tests pass.

rfairburn · 2024-12-12T08:31:23Z

We should also in the terraform/addons/monitoring section update the .header.md with updated instructions and after running a local terraform init do a terraform-docs markdown . > README.md in order to update the documentation for the monitoring module to reflect the latest changes.

I didn't think of this initially until just now.

sgress454 · 2024-12-12T15:31:57Z

We should also in the terraform/addons/monitoring section update the .header.md with updated instructions and after running a local terraform init do a terraform-docs markdown . > README.md in order to update the documentation for the monitoring module to reflect the latest changes.

@rfairburn I tried this and it added my doc for the new cron_job_failure_monitoring option, but also made some updates I didn't understand, like setting the providers section to:

Name	Version
archive	n/a
aws	n/a
null	n/a

Should I just back out that change before committing? Or is there something I'm missing in my terraform-docs command?

rfairburn · 2024-12-12T17:29:06Z

Did you run a terraform init prior to running the docs? it uses that to determine the versions.

codecov · 2024-12-18T20:03:44Z

Codecov Report

Attention: Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.

Project coverage is 63.59%. Comparing base (e472b80) to head (d347593).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
...tables/20241210140021_AddErrorsToCronStatsTable.go	54.54%	4 Missing and 1 partial ⚠️
server/fleet/cron_schedules.go	70.00%	2 Missing and 1 partial ⚠️
server/datastore/mysql/cron_stats.go	90.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #24347   +/-   ##
=======================================
  Coverage   63.59%   63.59%           
=======================================
  Files        1603     1604    +1     
  Lines      152006   152036   +30     
  Branches     3960     3960           
=======================================
+ Hits        96663    96687   +24     
- Misses      47647    47651    +4     
- Partials     7696     7698    +2

Flag	Coverage Δ
backend	`64.39% <75.67%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sgress454 · 2024-12-19T17:28:42Z

@rfairburn readme updated!

@getvictor fixed another test failure, all green and gtg if you can 👍 again please

rfairburn

infra changes look good.

getvictor

LGTM

sgress454 added 12 commits November 26, 2024 13:45

add stack trace to cron job panic recovery

6a49421

update tests

85cdaee

schedule updates

d18791f

add migration

d67a557

update cron schedule errors, tests

3cbe83f

make new topics for p1 and p2 alerts

aa48034

don't change default slack module name

2afe530

add new alert

f13e6d6

send to correct topic

091304c

don't change default cron sns topic name

7c15ef2

Merge branch 'main' into 19930-alert-cron-failures

80a165a

add changelog

49b503d

sgress454 requested review from rfairburn, ksatter, lukeheath, edwardsb, georgekarrv and a team as code owners December 3, 2024 21:52

sgress454 temporarily deployed to Docker Hub December 3, 2024 21:53 — with GitHub Actions Inactive

sgress454 changed the title ~~19930 alert cron failures~~ Monitor and alert on errors in cron jobs Dec 3, 2024

sgress454 commented Dec 3, 2024

View reviewed changes

rfairburn reviewed Dec 3, 2024

View reviewed changes

sgress454 added 2 commits December 3, 2024 17:22

lint

cf244ea

revert dogfood tf changes (will make separate PR)

5a066a8

sgress454 temporarily deployed to Docker Hub December 3, 2024 23:25 — with GitHub Actions Inactive

update migration timestamp

1ac2df1

sgress454 temporarily deployed to Docker Hub December 11, 2024 15:51 — with GitHub Actions Inactive

sgress454 added 2 commits December 11, 2024 10:06

Merge branch 'main' into 19930-alert-cron-failures

0f077ad

update schema.sql

4979484

sgress454 temporarily deployed to Docker Hub December 11, 2024 16:07 — with GitHub Actions Inactive

getvictor previously approved these changes Dec 11, 2024

View reviewed changes

Merge branch 'main' into 19930-alert-cron-failures

8115750

sgress454 temporarily deployed to Docker Hub December 12, 2024 22:00 — with GitHub Actions Inactive

sgress454 added 2 commits December 18, 2024 10:10

Merge branch 'main' into 19930-alert-cron-failures

c765339

update monitoring readme

bb56e25

sgress454 dismissed getvictor’s stale review via bb56e25 December 18, 2024 16:13

sgress454 had a problem deploying to Docker Hub December 18, 2024 16:13 — with GitHub Actions Error

don't commit lock file

72f845f

sgress454 temporarily deployed to Docker Hub December 18, 2024 16:15 — with GitHub Actions Inactive

allow test to recover from panic

31dd84b

sgress454 temporarily deployed to Docker Hub December 18, 2024 19:42 — with GitHub Actions Inactive

Merge branch 'main' into 19930-alert-cron-failures

97cdca2

sgress454 temporarily deployed to Docker Hub December 19, 2024 17:27 — with GitHub Actions Inactive

rfairburn approved these changes Dec 19, 2024

View reviewed changes

Merge branch 'main' into 19930-alert-cron-failures

d347593

sgress454 temporarily deployed to Docker Hub December 19, 2024 19:21 — with GitHub Actions Inactive

getvictor approved these changes Dec 19, 2024

View reviewed changes

sgress454 merged commit 6bd9cc8 into main Dec 19, 2024
29 checks passed

sgress454 deleted the 19930-alert-cron-failures branch December 19, 2024 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor and alert on errors in cron jobs #24347

Monitor and alert on errors in cron jobs #24347

sgress454 commented Dec 3, 2024 •

edited

Loading

sgress454 Dec 3, 2024

sgress454 Dec 3, 2024

sgress454 Dec 3, 2024

sgress454 Dec 3, 2024

rfairburn left a comment

rfairburn Dec 3, 2024

rfairburn Dec 3, 2024

sgress454 Dec 4, 2024

rfairburn Dec 3, 2024

sgress454 Dec 3, 2024 •

edited

Loading

sgress454 commented Dec 11, 2024

getvictor commented Dec 11, 2024 •

edited

Loading

getvictor left a comment

rfairburn commented Dec 12, 2024

sgress454 commented Dec 12, 2024

rfairburn commented Dec 12, 2024

codecov bot commented Dec 18, 2024 •

edited

Loading

sgress454 commented Dec 19, 2024

rfairburn left a comment

getvictor left a comment


		sns_topic_name = "fleet-dogfood-p2-alerts"

		slack_webhook_url = var.slack_webhook

Monitor and alert on errors in cron jobs #24347

Monitor and alert on errors in cron jobs #24347

Conversation

sgress454 commented Dec 3, 2024 • edited Loading

Checklist for submitter

Details

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfairburn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgress454 Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

sgress454 commented Dec 11, 2024

getvictor commented Dec 11, 2024 • edited Loading

getvictor left a comment

Choose a reason for hiding this comment

rfairburn commented Dec 12, 2024

sgress454 commented Dec 12, 2024

rfairburn commented Dec 12, 2024

codecov bot commented Dec 18, 2024 • edited Loading

Codecov Report

sgress454 commented Dec 19, 2024

rfairburn left a comment

Choose a reason for hiding this comment

getvictor left a comment

Choose a reason for hiding this comment

sgress454 commented Dec 3, 2024 •

edited

Loading

sgress454 Dec 3, 2024 •

edited

Loading

getvictor commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 18, 2024 •

edited

Loading