Skip to content

schubergphilis/terraform-aws-mcaf-resource-scheduler

Repository files navigation

🕐 Resource Scheduler

This composition scheduler can be used to schedule compositions with components that can be temporarily stopped, like EC2 instances, RDS instances/clusters and Redshift clusters. It's aimed at:

  • Environments that only run during office hours
  • Environments that only run on-demand

Goal is to minimize cost; either by scaling resources to zero or to their most minimal configuration.

It has the following high level architecture:

Architecture

📚 Features

Supported resource types

The following resoure types and actions can be controlled via this scheduler:

  • EC2 Auto-Scaling Groups
    • Max size, min size, desired capacity
  • EC2 Instances
    • Stop/start the instance
  • ECS Services
    • Set desired tasks
  • EFS File Systems
    • Set provisioned throughput
  • FSx Windows File Systems
    • Set throughput capacity
  • RDS Clusters
    • Stop/start the cluster
  • RDS Instances
    • Stop/start the instance
  • Redshift Clusters
    • Pause/resume the cluster

RDS only support stopping instances / clusters that are not running in multi-AZ mode.

Timezone aware

Schedules are timezone aware so there's no need to change them on any DST changes, keeping https://docs.aws.amazon.com/scheduler/latest/UserGuide/schedule-types.html#daylist-savings-time in mind.

Control order and wait times

The order in which resources are started- and stopped is controllable. These operations are asynchronous and wait times between activities are configurable.

The stop procedure of a schedule is the reverse of the start procedure.

Respecting maintenance- and backup windows

When setting up scheduling, the scheduler will check if configured maintenance- and backup windows on RDS instances / clusters and Redshift clusters overlap with the scheduled start- and stop times of a composition. If they don't, an additional schedule will be setup to make sure the cluster is started during the scheduled maintenance- and backup windows.

Webhooks

Optionally a pair of webhooks can be deployed to trigger starting or stopping an environments based on external events. This allows for on-demand starting or stopping of environments by - for example - a service management ticket, a Slack integration, a custom frontend, a Github workflow, etc.

Webhooks require an API key and can be setup to only allow certain IP addresses. A POST request has to be made to one of the outputted endpoints to trigger a webhook.

🔧 Setup

To setup this module requires a composition of the resources that need to be managed. Based on that input, a state machine is generated. Be aware that the composition dictates order: the resources in the composition are controlled in that order when the composition is started. The order is reversed when the composition is stopped.

Please see the examples folder for code examples of how to implement this module.

Configuration

Resource types require certain parameters in order to function. It's recommended to fill the parameters by refering to existing resources in your TF code.

Resource Resource Type Required Parameters
EC2 Auto-Scaling Group auto_scaling_group name: the name of the auto-scaling group to control
min: the minimal number of instances to run (used on start of composition)
max: the maximum number of instances to run (used on start of composition)
desired: the desired number of instances to run (used on start of composition)
EC2 Instance ec2_instance id: the ID of the instance to control
ECS Service ecs_service cluster_name: the name of the ECS cluster the task runs on
desired: the desired number of tasks (used on start of composition)
name: the name of the ECS task to control
EFS Filesystem efs_file_system id: the ID of the filesystem to control
provisioned_throughput_in_mibps: the provisioned throughput of the filesystem (used on start of composition)
FSX Windows Filesystem fsx_windows_file_system id: the ID of the filesystem to control
throughput_capacity: the throughput capacity of the filesystem (used on start of composition)
RDS Cluster rds_cluster id: the ID of the cluster to control
RDS Instance rds_instance id: the ID of the instance to control
Redshift Cluster redshift_cluster id: the ID of the cluster to control

🚨 Limitations

Schedule mixing

Most of the supported services allow for their own methods of scheduling, either with or without timezone support. This module can not detect existing schedules so overlapping schedules could contradict each other, resulting in unexpected behaviour.

RDS snapshot retention

When an RDS instance is stopped, the stopped time does not count towards the retention time of snapshots. Effectively this means that any snapshot will be retained longer than expected, including any associated cost.

This behaviour is described here: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ManagingAutomatedBackups.html

AWS Backup integration

For AWS Backup to be able to create a backup of an RDS instance, it needs to be running. This module is currently not capable of automatically detecting the schedule of any AWS Backup plans. In such cases you will need to manually align the schedules.

FSx Windows File Systems

FSx Windows File System provisioned throughput is quite expensive. It could be worthwhile to scale down on provisioned throughput during off-hours. Changing provisioned throughput also affects network I/O, memory and disk I/O of the file system.

See https://docs.aws.amazon.com/fsx/latest/WindowsGuide/performance.html for more information.

Throughput can't be changed until 6 hours after the last change was requested. After a throughput change an optimization phase takes place that could take longer than 6 hours; depending on the size off the file system. Throughput can also not be changed during this optimization phase. Use with care.

See https://docs.aws.amazon.com/fsx/latest/WindowsGuide/managing-storage-configuration.html#managing-storage-capacity for more information.

🧪 Development

This module uses the integrated Lambda to abstract some of the more complex functionality away. For redistribution purposes, the following dependencies have been vendorized:

Adding support for more resources

This module is extendable. To add support for more resources, follow these general steps:

  1. In the Lambda code

    1. Add a test
    2. Add the resource controller
    3. Add the resource to the handler
    4. Add the resource to the schema
    5. Make sure tests pass
  2. In the Terraform code

    1. Add the resource to the validations of the resource_composition variable
    2. Add IAM permissions as a new dynamic block to the Lambda policy document
    3. Add an example in the examples folder
    4. Update this README (including the architecture image if required)
    5. Make sure validations pass

Requirements

Name Version
terraform >= 1.5.0
archive >= 2.4.0
aws >= 5.10.0

Providers

Name Version
archive >= 2.4.0
aws >= 5.10.0

Modules

Name Source Version
api_gateway_role github.com/schubergphilis/terraform-aws-mcaf-role v0.3.3
eventbridge_scheduler_role github.com/schubergphilis/terraform-aws-mcaf-role v0.3.3
lambda_role github.com/schubergphilis/terraform-aws-mcaf-role v0.3.3
scheduler_lambda schubergphilis/mcaf-lambda/aws ~> 1.1.2
step_functions_role github.com/schubergphilis/terraform-aws-mcaf-role v0.3.3

Resources

Name Type
aws_api_gateway_api_key.webhooks resource
aws_api_gateway_deployment.webhooks resource
aws_api_gateway_integration.webhook_start_composition resource
aws_api_gateway_integration.webhook_stop_composition resource
aws_api_gateway_integration_response.webhook_start_composition resource
aws_api_gateway_integration_response.webhook_stop_composition resource
aws_api_gateway_method.webhook_start_composition resource
aws_api_gateway_method.webhook_stop_composition resource
aws_api_gateway_method_response.webhook_start_composition_200 resource
aws_api_gateway_method_response.webhook_stop_composition_200 resource
aws_api_gateway_method_settings.webhooks resource
aws_api_gateway_model.webhooks resource
aws_api_gateway_resource.webhook_start_composition resource
aws_api_gateway_resource.webhook_stop_composition resource
aws_api_gateway_rest_api.webhooks resource
aws_api_gateway_rest_api_policy.webhooks resource
aws_api_gateway_stage.webhooks resource
aws_api_gateway_usage_plan.webhooks resource
aws_api_gateway_usage_plan_key.webhooks resource
aws_cloudwatch_log_group.webhooks resource
aws_scheduler_schedule.rds_cluster_backup_start resource
aws_scheduler_schedule.rds_cluster_backup_stop resource
aws_scheduler_schedule.rds_cluster_maintenance_start resource
aws_scheduler_schedule.rds_cluster_maintenance_stop resource
aws_scheduler_schedule.rds_instance_backup_start resource
aws_scheduler_schedule.rds_instance_backup_stop resource
aws_scheduler_schedule.rds_instance_maintenance_start resource
aws_scheduler_schedule.rds_instance_maintenance_stop resource
aws_scheduler_schedule.redshift_cluster_maintenance_start resource
aws_scheduler_schedule.redshift_cluster_maintenance_stop resource
aws_scheduler_schedule.start_composition resource
aws_scheduler_schedule.stop_composition resource
aws_scheduler_schedule_group.scheduler resource
aws_sfn_state_machine.composition_start resource
aws_sfn_state_machine.composition_stop resource
archive_file.scheduler_source data source
aws_caller_identity.current data source
aws_db_instance.managed data source
aws_iam_policy_document.api_gateway_policy data source
aws_iam_policy_document.eventbridge_scheduler_policy data source
aws_iam_policy_document.lambda_policy data source
aws_iam_policy_document.step_functions_policy data source
aws_lambda_invocation.rds_cluster_backup_windows data source
aws_lambda_invocation.rds_cluster_maintenance_windows data source
aws_lambda_invocation.rds_instance_backup_windows data source
aws_lambda_invocation.rds_instance_maintenance_windows data source
aws_lambda_invocation.redshift_cluster_maintenance_windows data source
aws_rds_cluster.managed data source
aws_redshift_cluster.managed data source
aws_region.current data source

Inputs

Name Description Type Default Required
composition_name The name of the controlled composition string n/a yes
kms_key_arn The ARN of the KMS key to use with the Lambda function string n/a yes
resource_composition Resource composition
list(object({
type = string
params = map(any)
}))
n/a yes
start_resources_at Resources start cron expression in selected timezone string n/a yes
stop_resources_at Resources stop cron expression in selected timezone string n/a yes
tags Mapping of tags map(string) {} no
timezone Timezone to execute schedules in string "UTC" no
webhooks Deploy webhooks for external triggers from whitelisted IP CIDR's.
object({
deploy = bool
ip_whitelist = list(string)
private = optional(bool, false)
})
{
"deploy": false,
"ip_whitelist": [],
"private": false
}
no

Outputs

Name Description
api_gateway_stage_arn n/a
start_composition_state_machine_arn n/a
start_composition_webhook_url n/a
stop_composition_state_machine_arn n/a
stop_composition_webhook_url n/a
webhook_api_key n/a