Change our deployment model to use Terraform #5

chrnorm · 2022-10-13T19:25:27Z

chrnorm
Oct 13, 2022
Maintainer

Granted Approvals has been developed using a serverless architecture and runs on AWS. When beginning the project we opted to use AWS CDK in TypeScript to define the required infrastructure. This had the following benefits, namely:

great IDE auto-complete for contributors
native deployment state management via CloudFormation
single command to spin up a development deployment (mage deploy:dev)

We have, however, ran into some trade-offs with CDK. CDK is designed for deploying cloud applications into one's own AWS environment, and is not built for packaging CloudFormation templates to be deployed by users into their own AWS environments. To achieve this we have written build steps in Mage and TypeScript which synthesize the CloudFormation template and publish assets to S3.

An immediate downside of this approach is that our synthesized assets use names based on SHA256 hashes of their content. This makes it very difficult to interpret which Lambda function is which in our release builds:

2022-10-11 13:20:55   29944622 024ed9d5ebd21fc5f60c2d765adf719b762ae7778db9ba16ea29ab63ca06c829.zip
2022-10-11 13:20:55    8219561 18bf3323258c436e985fe7e0c07d0d60f7ba8c11b80131a8d57ec99e526581fa.zip
2022-10-11 13:20:55   29154484 378aeac1bcb4bd8645ce2d736f9283b7a4953eadf612a9c5fb2139cc68bfd8d6.zip
2022-10-11 13:20:55   35206577 40a063743b209e597e9685e6a77419c3f589c051dd7be33052729248fe6452cb.zip
2022-10-11 13:20:55       2408 6dbd112fe448437b3438da4382c72fccbb7d2ee1543db222620d7447fffebc50.zip
2022-10-11 13:20:55   24596023 77f90f2a1d98a7f453291871b6d7da3c3931099402613e00af271132c6510acb.zip
2022-10-11 13:20:55   30543041 78a726c148ff24aebee51183a709ccdf414ab11a43eac604662e73f3d29f019b.zip
2022-10-11 13:20:46     118457 Granted.template.json
2022-10-11 13:20:55   29870355 b2fa67706de9012b7e8ec62a49b71db73a25e81942828d40aae6d31535179788.zip
2022-10-11 13:20:55     118457 c430a4bc825f33adccb9ead4b6c12a26487f456f16d571e38c2eb9f9e1f3ecd3.json
2022-10-11 13:20:55       2432 e57c1acaa363d7d2b81736776007a7091bc73dff4aeb8135627c4511a51e7dca.zip
2022-10-11 13:20:55       1643 eb5b005c858404ea0c8f68098ed5dcdf5340e02461f149751d10f59c210d5ef8.zip
2022-10-11 13:20:55   29357857 ee87525f098800a81ed4cd7ed7f7059b191fb9ed4c5c93087474abcc812c8208.zip

Another downside of this approach is that CDK does not play nicely with CloudFormation conditional parameters. This requires us to frequently drop down into L1 Constructs (example), meaning that we lose some of the higher-level abstraction benefits of using CDK in the first place.

In order to manage a Granted Approvals deployment we built gdeploy, a deployment helper tool. When running gdeploy init, a granted-deployment.yml file is created which contains various deployment parameters, like the release version being run, the region being deployed to, and the Access Provider configuration.

While this works well for a user evaluating Granted Approvals, many users have adopted Terraform to manage their cloud infrastructure, and gdeploy is not suitable for them. With gdeploy we are introducing additional operational workflows for these users to implement, and a new tool which must be managed. Teams may already have CI/CD workflows implemented for Terraform, and now need to add new workflows to use gdeploy. For some teams we have spoken with, this adds a high amount of friction to adopting Granted Approvals.

@Zordrak has developed a Terraform-based deployment of Granted Approvals: https://github.com/bjsscloud/terraform-aws-granted-approvals. @Zordrak's Terraform deployment contains some improvements over our current approach to infra:

It supports KMS Customer Managed Keys for resources that support it
It copies the Lambda binaries into an S3 bucket, allowing the infrastructure to be deployed into any AWS region

I'm raising this RFD to discuss whether we should adopt this as the core infrastructure-as-code layer for the Granted Approvals project.

These are the goals which I propose we should aim to meet with our cloud infrastructure:

Good developer experience when contributing to Granted (this motivated use of CDK as the TypeScript auto-complete is great). At Common Fate we deploy many copies of Granted Approvals to our sandbox environments when working on the project.
Minimise manual maintenance of multiple infra-as-code deployments for Granted (avoids risk of introducing issues where env vars change and the contributor/reviewer forgets to check the Terraform stack)
Good first-time deployment UX for users who aren’t familiar with infra-as-code

I'd love to hear feedback on the following discussion points:

Is our current gdeploy method preventing you from deploying Granted Approvals?
Should we adopt Terraform for our cloud infrastructure in the Granted Approvals project or continue using CDK/CloudFormation?
If we adopted Terraform for our cloud infrastructure, how can we make getting started with a new deployment of Granted Approvals as easy as possible (especially if a user is less familiar with Terraform)?
If we adopted Terraform, how can we make the contributing experience as easy as possible? Ideally, we can still spin up a development stack with a single command.
Should we continue to use granted-deployment.yml to manage deployment parameters, or is this made invalid by Terraform?
If we don't adopt Terraform in the core Granted Approvals project, how can we best support the Terraform infrastructure fork as we built Granted Approvals?
Should other infra-as-code frameworks (pure CloudFormation, Pulumi, something else) be considered here?

JoshuaWilkes · 2022-10-13T23:58:35Z

JoshuaWilkes
Oct 13, 2022

On the issue of developer velocity, CDK dev deployments are easy to cleanup even after the local configuration is lost or forgotten as the whole Stack can be deleted in Cloudformation at regular intervals.

Terraform would require us to be more strict with our development practice to maintain and destroy environments.

I would like to hear from the community whether a terraform provider which uses the cloudformation template under the hood would be a good middleground https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudformation_stack

We can support this in gdeploy by outputting terraform compatible configuration, allowing users to use the useful features of gdeploy, while having a more familiar deployment experience using terraform.

2 replies

lorengordon Oct 17, 2022

I've wrapped CloudFormation with a Terraform template a couple times for various projects. It works (for a loose sense of "works"), but it's... not great... 😁 I have a feeling anyone who has a Terraform pipeline and experience with both Terraform and CloudFormation would much prefer native-Terraform resources...

Also, how would that address the difficulties with CDK described in the OP? Or, do you mean to just continue dealing with that in the same way you have been?

We have, however, ran into some trade-offs with CDK. CDK is designed for deploying cloud applications into one's own AWS environment, and is not built for packaging CloudFormation templates to be deployed by users into their own AWS environments.

Another downside of this approach is that CDK does not play nicely with CloudFormation conditional parameters. This requires us to frequently drop down into L1 Constructs (example), meaning that we lose some of the higher-level abstraction benefits of using CDK in the first place.

Zordrak Oct 19, 2022

We make strong use of the CloudFormation resource wherever something is supported by cloudFormation but not (yet) supported by terraform. We write the stack definition in HCL and use yamlencode on the HCL to create the stack definition attribute.

But for all of the reasons that we avoid CloudFormation wherever possible, once terraform support is available for the resource, we replace it with a native solution.

In creating the terraform solution I have found that most of the functionality of gdeploy is to get around the limitations of the CDK solution, where the TF solution rephrases the approach such that there is no longer a limitation to work around to make gdeploy necessary

lorengordon · 2022-10-15T15:05:54Z

lorengordon
Oct 15, 2022

This might be a tangential question, but could maybe speak a little to this bit:

We have, however, ran into some trade-offs with CDK. CDK is designed for deploying cloud applications into one's own AWS environment, and is not built for packaging CloudFormation templates to be deployed by users into their own AWS environments.

Have you considered publishing the lambda function code to a registry? Could perhaps separate the function versions from the infrastructure versions. The infrastructure tool (CDK or terraform) could then use it's own dependency control mechanisms to pin the function versions (and automate updates via Dependabot or similar). As a registry for the function code, I'm thinking something like npmjs or pypi (language dependent), or even package them as docker images and use the container feature when deploying the lambda functions?

Or perhaps use a SAM template and publish to the AWS Serverless Application Repository?

Yet another option might be to create an AWS Marketplace offering, even if just listing it for free. That supports CloudFormation-based products, along with "update" paths, which are intended to deploy into the user's AWS account.

3 replies

lorengordon Oct 17, 2022

Have you considered publishing the lambda function code to a registry?

Here is an example of how that might work, for nodejs/npmjs...

lorengordon Oct 17, 2022

Another way that might work with terraform and a separate git repo, is to use an "empty" module block... Basically, on init, terraform will clone any git repo you give it as a module to the local module cache. The repo does not need any terraform code at all. You can then use those local paths to run any lambda packaging/deployment commands.

module "granted_approvals" {
  source = "git::https://github.com/common-fate/granted-approvals.git?ref=v0.7.1"
}

module "lambda" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-lambda.git?ref=v4.1.1"

  source_paths = [
    {
      path = "${path.module}/.terraform/modules/granted_approvals"
      commands = [
        "go build",
      ]
    }
  ]
}

chrnorm Oct 18, 2022
Maintainer Author

Have you considered publishing the lambda function code to a registry?

This is an excellent suggestion, thanks for sharing the idea @lorengordon. I would anticipate this to give the following benefits:

allow us to codesign Lambda binaries we produce, as we are no longer coupled to CDK for uploading the Lambda assets (the CDK team is not planning on working on this at the moment)
make the Lambda filenames more understandable
make it easier to maintain the BJSS Terraform fork of our infrastructure, as there is no longer a mapping step between CDK asset outputs and the Lambda references in the Terraform code.

A free Marketplace offering is also worth considering. I like the idea that we can proactively notify our users of version upgrades (I get some of these for my own Marketplace subscriptions).

I'm less familiar with the AWS Serverless Application Repository, but it looks very similar to what we are trying to achieve:

I see that Terraform does support deploying a Serverless Application Repository application, although it looks like it just creates a CloudFormation template under the hood, so still suffers from the problems that we are discussing in @JoshuaWilkes's reply thread.

chrnorm · 2022-10-18T11:22:38Z

chrnorm
Oct 18, 2022
Maintainer Author

An additional goal to add here: we would like to support arbitrary deployment regions rather than requiring users to deploy to specific predefined regions. A limitation of our current approach is that our CloudFormation templates refer to Lambda binaries which must exist in the same region as the template itself. Because of this, we publish templates and binaries to us-east-1, us-west-1, and ap-southeast-2. Some users would like to run Granted Approvals in AWS GovCloud and this is not possible with our current approach.

I believe that publishing Lambda functions as Docker containers may help us avoid this problem as @lorengordon has suggested above.

0 replies

Zordrak · 2022-10-19T17:39:30Z

Zordrak
Oct 19, 2022

Generally: TF vs. CF

Before I separately provide input on the differences between CDK and TF approaches for Granted Approvals, I thought it would be useful to transcribe some of the gathered thoughts we've captured over time in our company in CloudFormation versus Terraform as underpinning IaC options for our customers. I am very aware that CDK attempts to resolve some of the problems with CloudFormation; but in my opinion brings as many drawbacks as it brings advantages, not gaining any ground on a terraform approach. For me, Pulumi and Terragrunt (for example) are to Terraform as CDK is to CloudFormation, what you lose is not worth what you might gain.

Caveat 1: There have been some improvements in CloudFormation in the last few years to introduce resolutions to some problems and they may not be accounted for in this summary, but on the whole I still believe these are sticking-plasters that don't address fundamental design flaws.
Caveat 2: This contains my sense of humour. It's not for everyone, it's not meant to be mean.
Caveat 3: This is by no means an exhaustive list of comparable attributes. When it comes to deploying complex enterprise projects with large interwoven teams there are lots and lots more examples of how terraform enables us to be dynamic and collaborative, where CloudFormation keeps us imprisoned in limited processes.

Cleaning Up

CloudFormation: Delete all my stacks
Terraform With Good Processes: terraform destroy
Terraform With Bad Processes: Hope you tagged everything, and you have a copy of your statefile. Go and find your Orphans.

Winner: CloudFormation
(but follow good processes and it doesn't matter)

Deployment Management

CloudFormation: For some resources, such as Scaling Groups, CloudFormation has native integrations for automatically following different zero-downtime deployment methods, with resources calling back to the stack to signal completion, albeit a little difficult to set up.

Terraform: It depends. With modern terraform you can hook in to all kinds of processes internal to your terraform run and set up complex dependencies, and involve lambdas to test completion before moving forward - but you have to design them yourselves and there's no "listener" available.

Winner: CloudFormation

Drift Control

CloudFormation: Can detect some drift, as an optional process external to the deployment lifecycle. It cannot do anything about drift. If your resources are too drifty, CloudFormation will error. Raise a support ticket, or if you want it fixed this week, orphan the resource, delete it, and try again.

Terraform: Designed from the outset to detect and correct drift for any attribute of any resource it manages as part of the deployment lifecycle.

Winner: Terraform

Error Handling

CloudFormation: Had any resource fail in any way during a Stack deployment or in any nested stack in a stack of stacks? That's a paddlin'. Throw everything away and start again no matter how much state or data has accrued or how long the rest of the resources took to deploy. Workaround to the problem being break everything into smaller stacks, so long as they don't nest. Roll your own pipeline tooling to connect your stack deployments.

Terraform: One resource had an issue, so it was skipped and any dependencies were skipped, but everything else succeeded. Fix the problem, and carry on from where you left off with no impact to any resources that you took the time to deploy or already contain state or data. If you need to recreate something that did get created due to having encountered an error, you can choose precisely what resources you want to recreate and what can be left alone - or you can start from scratch if it suits you to do so.

Winner: Terraform

Block repetition, code re-use and standardisation

CloudFormation: Nest stacks into a stack of stacks defining each repeatable block as its own independent stack. Call each iteration of the block as its own static definition. Use mostly hard-coded mappings or blocks of parameters (so long as you dont run out of parameters) or custom lambda-backed resources to get the inputs for your iterations. Or just repeat the code manually.

Terraform: Use a module. Iterate the module as many times as you need from 0 to infinity, iterating over any collection data type. Use hundreds of ways to collate your input data from wherever you see fit. Use any data transformation you need to translate variables with complex data types, other resources, data sources from any supported provider and other funky magic to put together what you need and iterate appropriately.

Winner: Terraform

Multi-region / Multi-Account Deployment

CloudFormation: Multi-region? What's that? Hey CodePipeline, do you know?
CodePipeline: Oh yeah, that's that thing where you create pipeline-specific buckets in every region, and replicate the source data between them using independent KMS keys in each region so that you can then deploy independent stacks that have very few ways to share any kind of data across different regions, then you can visit each region to see your resources.
CloudFormation: That sounds fun; let's give it a go. I can do something like that, maybe even better!
Terraform: This should be good... 🍿
CloudFormation: I've got it.. so I've called it StackSets..
Terraform: O..k(?)
CloudFormation: Now, what you do is see, you create a StackSet, and it allows you to deploy code to multiple accounts and regions.
Terraform: Oh great, so you can just decide what account to put different resources in?
CloudFormation: Oh no, they're the exact same resources, just in multiple regions and accounts at once.
CodePipeline: ...
Terraform: Chuckles and takes a sip of whiskey
CloudFormation: You hate it, don't you?

Terraform: Multi-region, multi-account, multi-cloud, multi-service deployment is just a native capability. Deploy resources in any region, account, cloud or supported service you like by passing the correct provider configuration.

Winner: Terraform

Looking up Data

CloudFormation: Invoke a Lambda. Lambda glues everything together. Write CloudFormation stack to deploy a lambda function that you write yourself, explicitly expecting a question and response only from CloudFormation, that cloudformation can then execute in order to get a specific piece of data about a specific external resource for one use case. Or if you're up for it, spend some months writing a company-standard lookup-interface function that allows you to lookup multiple things from multiple sources without having to write additional functions. Keep a team on standby to look after your special function.

Terraform: Just look up the data. Use a data source for any supported provider. Or use a shell script. Or a command. Or pull it from a file. Or pull it from an API. Use a variable as an input to a data structure that you pass to a data source to generate a more complicated data set that you then iterate on where each iteration looks up something else. Look up data from remote terraform state whether the state is a live deployment or not. Look up any attribute of any resource you have deployed without declaring any explicit outputs for it. It's your data, have it your way.

Winner: Terraform

Resource lifecycle management

CloudFormation: First you create new things. Then you destroy old things. It puts the lotion on its skin or else it gets the hose again.
Terraform: Jeez lighten up. If you need to destroy the old one first, that's cool.

CloudFormation: Drift is not my problem, if a resource I own changes, it's not my problem.
Terraform: I do care about drift. But, if you need me to ignore one specific attribute changing, just let me know.

CloudFormation: I protect you. I can stop a whole stack being deleted, or I can make sure not to delete a resource when I delete the stack.
Terraform: Hold my beer. Some things are protected by default, and you can choose to forcefully destroy them. If you do want to prevent something being deleted, I can prevent any actions that might include the deletion of a resource. For example if you just try to destroy 7 resources instead of the whole thing, and your protected resource is one of the 7...
CloudFormation: What do you mean instead of the whole thing, how can you not just destroy everything?
Terraform: Hush, CloudFormation! Grown-ups are talking. Now where was I? Oh yeah, if your action would destroy a protected resource, I can say no. Or you can just drop a resource from the state file and I'll pretend I didn't know about it. Or you can add it back in. In fact I'm a very inclusive parent, I will adopt anything you want me to. I can also decide whether a resource needs to be replaced by any data trigger you like. You just let me know when I should replace it an I'll get right on it. Or just taint it and I'll replace it next time. Whatever you want really. I'll even refuse to make something if it doesnt meet your conditions, or I'll check it after I made it and let you know if it your condition isnt satisfied.

Winner: Terraform

Data Transformation

CloudFormation: Basic string manipulation functions, otherwise prepare and deploy transforms, lambda functions and other miscellany to try to turn parameters into anything more nuanced than basic data types.

Terraform: ~100 intrinsic data manipulation functions from simple maths and string munging to calculating CIDR blocks and converting entire data structures between languages.

Winner: Terraform

Multi-User Simultaneous Change Planning

CloudFormation: A stack is a living breathing object that owns the resources it created. If you want to plan any changes, form an orderly queue or deploy multiple stacks for everyone to play with.

Terraform: So long as you don't want to make any changes, everyone pile on. You can practice all you like so long as you're sharing my state file (see tfscaffold). If you want to simultanously make changes, even though it's not always the best idea, you can if you want to. Just configure some locking and you can put your changes into the queue.

Winner: Terraform

Access to Stored Resource State

CloudFormation: Ask my API nicely and I will tell you what I think you need to know, there's a nice web interface!

Terraform: I am an open book. You can read my state file if you want. It's all JSON so you can parse it with jq. Or you can just ask me to list the resources, or show you all the resource configurations. You can manipulate the state if you want in case you have any out-of-band changes. I can add things for you or you can add them yourself, or remove them, or change them.

Winner: Terraform

Planning Changes: Process

CloudFormation: Upload a new template to an S3 bucket (you have an S3 bucket for that, right?), then I will look at it. I'll create an unsecured bucket for you if you don't have one. No bucket, no dice.

Terraform: Point me at a state file, existing or not, local or not and just type plan. You don't need to copy anything anywhere or create any artifacts, just save your file and I can plan with it. And if you just want to plan changes for the resource you're working on, tell me to target it, or a whole module for that matter.

Winner: Terraform

Planning Changes: Confidence

CloudFormation: If you update stack, I will tell you what might change in that stack. But only that stack. I'm not a mind-reader! If you have any nested stacks that the first stack feeds outputs into, your guess is as good as mine what will happen to them. I guess you could fake the outputs I give you into a hardcoded temporary template and use that to plan the child stack?

Terraform: If I made it, I will check it. I will tell you what changed outside of my influence, and I will tell you what I'm going to change as a result.

Winner: Terraform

Language & Syntax

Everyone has a preference here. AWS certainly recognise that JSON is not exactly friendly, and so created a YAML alternative. But I defy anyone to say that HCL is not more friendly to read, write and maintain. It is strict and limited in such a way that it can never get as complex as your average javascript application and although it pushes the boundaries it is still a declarative language, but at the same time flexible and easy to use.

Ignoring the issues regarding supported variable data types and intrinsic functions (and why Ref and GetAtt are not the same function), compare these equivalent code snippets:

CloudFormation

content:
  !Join
    - "\n"
    -
      - "ATL_PRODUCT_FAMILY=confluence"
      - !Sub ["ATL_PRODUCT_VERSION=${ConfluenceVersion}", ConfluenceVersion: !Ref ConfluenceVersion]

Terraform

content = <<EOF
  ATL_PRODUCT_FAMILY=confluence
  ATL_PRODUCT_VERSION=${var.confluence_version}
EOF

CloudFormation

ImageId:
  !FindInMap
    - AWSRegionArch2AMI
    - !Ref AWS::Region
    - !FindInMap
        - AWSInstanceType2Arch
        - !Ref NodeInstanceType
        - Arch

Terraform

image_id = var.images[var.region][var.node_type]

CloudFormation

Name: !Join ['.', [!Ref 'AWS::StackName', 'db', !Ref 'HostedZone']]

Terraform

name = "${var.name}.db.${aws_route53_zone.main.domain_name}"

Winner: Terraform

Parameter / Variable Data Types

CloudFormation: String, Number, List, CommaDelimitedList, AWS Resource IDs, SSM Parameter Names

Terraform: Any JSON object in HCL format

Winner: Terraform

Interoperability

While I don't hold interoperability as a major concern, and not anything I would use to make a decision, it's worth bearing in mind:

CloudFormation: To run terraform, simply write a stack that starts a complicated custom pipeline that runs a container to download your source, and terraform and then execute your apply.

Terraform:

resource "aws_cloudformation_stack" "main" {
  name = local.csi

  template_body = yamlencode({
    Resources = {
      SlackChannelConfiguration = {
        Type = "AWS::Chatbot::SlackChannelConfiguration"

        Properties = {
          ConfigurationName = local.csi
          IamRoleArn        = aws_iam_role.main.arn
          SlackChannelId    = var.slack_channel_id
          SlackWorkspaceId  = var.slack_workspace_id
          LoggingLevel      = var.log_level
          SnsTopicArns      = [ aws_sns_topic.main.arn ]
        }
      }
    }
  })
}

Winner: Terraform

Resource Naming Scope

CloudFormation: All of my resources share the same scope. If you want to make a Candy IAM role, and a Candy IAM Policy, and a Candy IAM Role Policy Attachment, I can't tell the difference between them so make sure to name them CandyRole, CandyPolicy and CandyRolePolicyAttachment.

Terraform: aws_iam_role.candy, aws_iam_policy.candy, aws_iam_role_policy_attachment.candy.

Winner: Terraform

Conditions

CloudFormation: Use Parameters to set up simple Conditions with a truthiness you can reference later in strict circumstances only.

Terraform: Native boolean truthiness with no interim conversions. Set up whatever conditions you need, chained as deeply as you need, with as much interim logic and external data values as you need. Use conditional output almost wherever you like.

In the simple case..
var.boolean ? true : false
beats
Condition: ConditionName: !Equals [ Ref, "true" ] -> !If ConditionName [true, false]

CloudFormation:

Parameters:
  EnableGeoBlocking:
    Type: String
Conditions:
  EnableGeoBlocking: !Equals [ !Ref EnableGeoBlocking, "true" ]
Resources:
  CloudFrontDistribution:
    Type: "AWS::CloudFront::Distribution"
    Properties:
      DistributionConfig:
        Enabled: true
        Restrictions:
          GeoRestriction:
            !If
              - EnableGeoBlocking
              -
                RestrictionType: whitelist
                Locations:
                  - BE
                  - LU
                  - NL
              - RestrictionType: none

Terraform (one of several ways to achieve the result: making the attributes conditional; but you could make the whole restriction declaration optional):

variable "enable_geo_blocking" {
  type = bool
}
resource "cloudfront_distribution" "main" {
  enabled = true

  restrictions {
    geo_restriction {
      restriction_type = var.enable_geo_blocking ? "whitelist" : "none"
      locations        = var.enable_geo_blocking ? ["BE", "LU", "NL"] : null
    }
  }
}

Winner: Terraform

Size Constraints

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cloudformation-limits.html -- Everything has a limit. Number of Parameters. Size of a Stack. Number of Stacks in an Account. Number of Mapping Attributes.

Terraform: How much RAM have you got?

Winner: Terraform

3 replies

lorengordon Oct 19, 2022

I wonder if the relative difficulty of cleaning up terraform deployments might be eased by encouraging Approvals developers to use Terraform Cloud or Spacelift or Env0 or Scalr? Pretty sure they all have a free tier. Might even be able to talk one of them into an open source team offering?

Zordrak Oct 19, 2022

Probably good to start with what the problems are to then determine a solution. It's never been an issue I've had, cleaning up has always been a piece of cake. I wouldn't personally be fond of introducing an external dependency or SaaS tooling requirement.

Zordrak Oct 20, 2022

I've updated the text to account for StackSets

Zordrak · 2022-10-20T15:46:29Z

Zordrak
Oct 20, 2022

Terraform vs. CDK for Granted Approvals

All of the following is subject to IMHO and YMMV.

Executive Summary

I think that Common Fate's best interests would be served by maintaining both Terraform and CloudFormation infrastructure code, but without CDK, and modifying the purpose of gdeploy to be additive to the configuration and deployment experience rather than a required component of it.

My Requirements

As an Enterprise systems integrator, my primary concerns for deploying infrastructure for an application such as Granted Approvals:

No human involvement / interactivity should be required post-configuration, short of starting and/or approving the deployment
The infrastructure should be truly idempotent
- All configuration should be stored as code
- All packaged assets should be immutably versioned
- The solution should be identically reproducible with the same versioned assets and configuration
The solution should be as easy to modify to specific customer needs as possible. Enterprise customers are especially adept at inventing unusual requirements.
The solution's code should be relatively easy read for somebody who understands the resulting cloud infrastructure, but is not formally involved in the code development
Deploy Time and Run Time must have absolutely no hard dependency on external resources
- Three phases: Build Time, Deploy Time, Run Time.
- External Dependencies threaten Security and Availability of Deploy and Run Time actions.
- External dependencies are resolved at Build Time, subject to appropriate security scrutiny.
- Any practical need to resolve dependencies at Deploy Time that must at least be cached locally such that a dependency availability issue does not prevent deployment
- Absolutely no external Run Time dependencies.

Considerations

gdeploy

I am personally a big fan of providing a custom binary that solves automation decision making solutions, and provides a quick-start interactive option for users performing a proof of concept, or with little time, skill or resource to properly implement an enterprise-grade approach. In fact one of my first "DevOps" experiences was creating a deployment pipeline interface for Java applications on VPS using Bash and xdialog to let developers deploy where they needed to, when they needed to and incorporate process like database cleanups, service restarts etc as logical prescribed menu choices rather than getting their hands dirty. Frankly I think we don't do enough of this type of thing these days, and gdeploy providing this makes me happy.

However, the current use of gdeploy is not as a quick-start for the inexperienced, it is a required component of the deployment process. Documentation does not exist in many places as to how to manually replicate what gdeploy does, as it is expected that gdeploy is the only approach to use. When trying to create a terraform approach to the infrastructure deployment - necessary to integrate with our business' Enterprise Cloud Accelerator (Landing Zone) solution, this made it very difficult to identify the configuration options I needed to replicate without first having a complete gdeploy-curated deployment that could then be reverse-engineered. Several times I got what I needed by having @chrnorm forward to me copies of configuration from a deployment of his own.

This for me speaks to pitfalls that I see AWS and Hashicorp both falling into all of the time. They are so focussed on the adoption of their solutions by new users who are just beginning their journey, that they sometimes forget about experienced users. You will find many places in AWS documentation where example solutions are described without any encryption, and little to no documentation on how to implement the same solution with all possible encryption capabilities enabled. Hashicorp have made changes to terraform in the past that reduce flexibility in how you may use it; which have caused problems in our deployment processes because they have tried to make it harder for new users to make mistakes, and in doing so breaking innovative solutions that made use of the flexibility.

My approach to this type of situation would be to first think about the most complex, hardened, best-practice use case and provide solutions that are suitable to that use case - and then provide the Wizards and Examples and Guidance for enabling users with simpler use cases or limited experience. As I would say to AWS - first make sure the API is right and available to the public, and then add the Console UI instead of the choice they often make to work the other way round, prioritising the console users over the enterprise developers.

Wandering back to the topic at hand.. I would use the existence of gdeploy as a way to request configuration items from the user interactively if they wish for that, which could then output either or both of CloudFormation parameters or terraform tfvars for use in a deployment. gdeploy is then perfectly capable of deploying Stacks or Running Terraform for the user, or allowing the user to perform the task themselves. This gives the most flexibility to any type of user. A very advanced user may not use it at all - using the documentation to create configuration, and their own approach to the deployment with whichever code they choose. An intermediate user may use gdeploy to create the configuration, but deploy and/or modify that configuration themselves. The novice user can just keep telling gdeploy what to do until the deployment is complete; probably choosing the CloudFormation option because it's more novice-aws-console-user friendly.

Development and Deployment Extensibility

A major benefit of terraform is how easy it is to modify infrastructure. When a new AWS feature is available, for example when EBS Encryption by Default became available, integrating them into terraform solutions can be as easy as adding a line or two of code. In this particular example, the addition of this file called ebs_encryption_by_default.tf to one module instantly delivered the feature to every region and every account managed by our Landing Zone:

resource "aws_ebs_encryption_by_default" "main" {
  enabled = true
}

Conversely should we want to make that feature optional per account, making the boolean a defaulted variable would take mere moments. In most cases with terraform solutions, I do not need to know very much about the existing infrastructure to extend it. I need to know how to define the resource I want, or the attributes I need to change for a resource that exists, I do not need to spend any time learning the manner in which the code is built. This is generally true of CloudFormation too, albeit CloudFormation is usually harder to extend, especially where multiple stacks are tied together, passing data between them and stacks are reaching their limits.

With CDK, you don't have a declarative language expressing the infrastructure configuration, you have a bespoke application, written for your use-case; the job of which is to output an artefact that is your infrastructure configuration. Your custom application takes a language and concept that is simple and standard and abstracts it into a bespoke way of thinking, requiring significant software development experience and a whole extra language in the developer requirements. The output from that custom application is then an artefact of its own; not a human-manageable declarative script, but something intended to only be read and written by a higher-level construct. Exactly as you could not expect developers to work with minified JavaScript, you can't necessarily expect them to work with templates generated by CDK. You lock out of the process anyone who is not working to your development approach in your custom application.

The reason why, in my opinion, CDK for CloudFormation, CDK for Terraform, Terragrunt, Pulumi and other solutions exist is because of perceived deficiencies and immaturities in the underlying language.

Terraform has made great strides to make these solutions redundant through improvements in its core code. It used to be that in optionally deploying resources to multiple regions, we had to declare identical resources many times, each of them optionally toggled on for a given region. This would have been made cleaner were we to have adopted a templating structure that could perform iteration for us, but now that we can iterate module calls natively in terraform, it's a moot issue. There are still occasions where certain things could be made a little simpler by adding some templating but they are so few and far between as to not be worth sitting a custom templating engine as a dependency in our codebase and processes. The language is mature enough on its own now.

In the case of CloudFormation, there are still such sufficient immaturities that AWS are yet to address that the value of CDK is more obvious, but at the same time the arguments above as to CDK making the solution harder to maintain and less standard make it a difficult justification. Additionally, a lot of the troubles that CloudFormation presents is not in the configuration of the resources but in the co-ordination and deployment of multiple stacks and conditional configurations and such, as you might bring together into one CDK App. However this is just one way of co-ordinating stacks. Given the presence of a custom deployment management tool such as gdeploy, any of that deployment integration logic you want to implement could be implemented with gdeploy, or left for a system integrator to implement in the CI/CD tooling they happen to be using.

Build, Deploy and Run Dependencies

Being that granted-approvals is actually quite a simple application from an infrastructure perspective, there's not really much complexity to worry about. Deploying the whole thing as a single terraform module or a single human-derived CloudFormation stack is a simple proposition. The only challenge comes with building and deploying the lambda functions, and cloning the website assets and issuing a CloudFront invalidation as part of the deployment. There are many ways to address this requirement, and while CDK's approach is reasonably pleasant for the stand-alone developer, it's not the only efficient way to achieve the goal - so much so that part of the work is already devolved to a frontend-deployer Lambda function.

Really it's a choice of deciding whether to build, test version and deploy the applications as independent applications with their own lifecycles, sourcing the artefacts from a repository, or to build the applications from code as part of one end-to-end deployment process. I don't see these options as mutually exclusive. Writing terraform code that is capable of doing either approach based on a conditional is not particularly difficult. Using gdeploy to effect a build stage is also feasible. The only limitation would be how the CloudFormation option would look without CDK or gdeploy, in which case it may be just a set of user instructions for performing a build, with CommonFate developers using gdeploy for one-shot workstation simplicity and SIs using terraform picking and choosing the approach that works best for them.

The main priority for me is meeting the requirements as stated at the top. Ideally it is possible to achieve perfectly idempotent deployment, with no external dependencies at all, or with external dependencies imported at Build time. By this I mean that whether you build the code yourself, or you bring in pre-built release artefacts from a CommonFate deployment channel, you should be able to do these things prior to your CloudFormation or Terraform execution so that that execution can be performed with a predictable and guaranteed result in production.

In building the terraform solution that exists today, I have compromised and used a Deploy-Time dependency resolution, but with a cache. So that the upstream code artefacts are cloned when the deployment occurs, but if the upstream copy is unavailable, a locally cached copy is used instead. This is a compromise as it only works for same-version deployments. It does not work for version upgrades. This means that I could deploy an upgrade in pre-production, and have everything work perfectly. Then I could go to do the same in production, and find that if something happened to the CommonFate release bucket, my production deployment would fail, which would not really be acceptable.

I would like to improve this approach, but as it is I am already selecting the upstream artefacts in a particularly dodgy manner, scraping the artefact naming hashes from the CDK generated CloudFormation file. I did not feel this was a problem I could easily solve without some support from the core development team in abstracting the build and deploy processes a little from the CDK solution, or by building my own application packaging pipeline - which I could and may do, depending on the outcome of this RFD.

Continuing to Provide a CloudFormation Option

I personally have no interest in a CloudFormation option for Granted. CloudFormation as described elsewhere in this RFD is such a limited solution, that we try to avoid including it in our development ecosystem where we can, and so we do not include processes for stack management, for static code analysis for CloudFormation and other things you do to look after its existence in your software stack. We use CloudFormation only where it is explicitly necessary. But that is our situation. There are similarly system integrators that are all-in on AWS-native solutions exclusively. There are people for whom the AWS Console is life and anything they cannot do in the console is too complex for consideration. There are people who's sole purpose in life is to integrate with AWS tooling such as AWS Service Catalog, AWS Marketplace and AWS Control Tower. My feelings about each of these solutions is irrelevant, people do and will use them. If CommonFate wants to continue to provide the most seamless and inclusive solutions possible for its users, I do not feel that completely dropping support for CloudFormation-managed Infrastructure is the way to go (although I probably would choose to do so against my own interests just out of principal).

I understand completely that the concept of managing two different implementations of the same solution in multiple languages is not immediately the most desirable choice, I think for this situation it is the most practical one, and the correct one. It is also not an uncommon approach. Consider API developers that provide SDKs in multiple languages. AWS themselves provide Java, NodeJS and Python SDKs because they understand the value of the flexibility and not forcing choices that hinder the innovation of developers and integrators. Because the SDKs are an interface to the API, they have the ability to implement their own features without those features being an absolute requirement for porting to the other languages, e.g. different pagination solutions, but they all fundamentally implement the capabilities the APIs offer. For example, there is nothing to stop a terraform solution providing more mature deployment approaches than the CloudFormation option; just because you implement a new capability in the terraform solution, you don't have to ensure it is implemented in the CloudFormation solution. All that matters is that both solutions provide the same infrastructure that a given application version requires. So if you release 0.9.0 and 0.9.0 requires a new lambda function for processing events, that function must be implemented in both CF and TF infrastructure, but without much difficulty as each infra codebase has copypasta to deliver it as only a minor change. If you however release an optional feature for backing up your S3 assets bucket using replication, or an improved way to cache assets, you can choose if and when you want each infrastructure solution to support it.

As I have said before, Granted Approvals infrastructure is not very complex. It does not even change often because it is the lambda functions that provide most of the application logic. I do not personally believe that the rate of change in infrastructure is sufficient to warrant concerns over the labour required for maintenance of both terraform and cloudformation solutions.

CommonFate Terraform Provider

This, for me, is the game changer. The absolute silver bullet of all things.

My number one feature request, surpassing all others, is a terraform provider for managing Access Rules in Granted Approvals.

For all of the reasons described above, I do not just want my application and its infrastructure to be defined as code and truly idempotent, I want my application configuration to act the exact same way. I do not want my deployment process to include "and then add these rules" or "and then run these gdeploy commands" or "and then restore this database backup". I want my configuration to be in code that deploys the app configuration immediately after or even during the application deployment.

This is how our use of AWS SSO works today. Some of our AWS SSO configuration is done in Granted Approvals, and some of it is done in terraform. After Terraform has configured all of the appropriate permission sets and other configuration, it then deploys any and all static associations. So, for example, if we have a rule that all developers have read-only access to an account, but they can escalate their privilege to an Admin role for an admin activity, then their ReadOnly access would be configured by Terraform. Terraform associates the Developer group in SSO (as provided to SSO from AAD SCIM) to the appropriate permission set for the account as a permanent (or at least terraform-managed) association. It is only then when a developer wishes to escalate their privilege that they go to Granted Approvals to make that time-limited change.

This means that all of our permission configuration is stored as code in terraform, except for the Granted Approvals access rules. They are the only aspect of the configuration that has to be replicated dynamically for each deployment.

The second I have a Granted Approvals or Common Fate terraform provider available to me to represent those access rules in terraform, my entire estate goes back to true idempotency. I no longer need to allow administrators to create and modify access rules using the user interface, I declare the rules in terraform and any change to the rules goes through a code review process, an approval process and are applied via terraform deployment automation the same as everything else.

This is the one area in which there is no real way to compete for CloudFormation. Other than the mechanism already employed with gdeploy, the only approach I can think of is to deploy configuration to an S3 bucket which a lambda function would parse and use to enact changes to the database; but even that approach feels loosely coupled and, well, shonky. It certainly doesn't come close to the capabilities that would be provided by a terraform provider.

Yes, you could use lambda-backed custom resources to create Access Rules, but you would probably enter a world of pain and limitation to do that as each rule would need hard-coding in CloudFormation, and would take up one of a limited number of resources per stack, and I can't even begin to imagine how you would go about parsing a structure of rules (e.g. in JSON) and implementing resources off the back of it. I suppose you could just construct JSON with gdeploy and then send the whole JSON block into a Lambda function. It's possible, it's just not clean. Compared to creating a pretty data structure in HCL in terraform and then using only one or two resources that iterate over the data structure to create all of your rules, or maybe one resource per group of rules that share a construct it's just a different ball-game.

Conclusion

I would wholeheartedly support Common Fate moving to adopt Terraform as the official, and even preferred, model for deploying Granted Approvals infrastructure.
I think it would be difficult commercially to completely disavow CloudFormation.
I think matching a CloudFormation solution and a Terraform solution together is feasible and I think it is also sensible.
I think CDK is the weak link.
I do not believe that Pulumi, Terragrunt or other frameworks would provide material value to the solution and would recommend avoiding them.
I think that gdeploy should focus on accelerating configuration and deployment for simple use cases, but should not be a requirement for all use cases.

0 replies

chrnorm · 2022-10-26T13:34:24Z

chrnorm
Oct 26, 2022
Maintainer Author

Adding some of my own thoughts and research from investigating improved approaches to our infra.

@lorengordon and @Zordrak - thankyou both for sharing your experience and viewpoints here. @Zordrak I'm very appreciative of how comprehensive your comments are on Terraform vs CDK.

To summarise some of the discussion so far:

our gdeploy deployment helper is good for inexperienced users, but mandating it as the only way to go causes problems for advanced users.
our CDK deployment itself has several shortcomings, such as the lack of CMKs on resources.
our current approach requires release infrastructure for each region.
CDK has been helpful in development as it gives a one-shot mage deploy:dev command to bring up a stack without the need to configure Terraform remote state.
The infrastructure itself is not that complex and does not change that rapidly between releases (at most we are adding one single lambda function each release, judging from the past few releases).

Based on @lorengordon's suggestion I published a test Serverless Application Repository (SAM) application. I found that this approach had the following benefits over CDK:

SAM applications can be published publicly and can be deployed in any AWS region. I published a SAM application in ap-southeast-2 and was able to deploy it in us-west-2 - the Lambda functions were provisioned properly in us-west-2 without any manual steps required. It appears that SAM copies the Lambda files from the published assets bucket into a designated deployment bucket in the account/region I am trying to deploy to. This is a big win over our current approach as it allows Granted Approvals to be deployed into any region, without us having to develop a 'bootstrapping' process to copy the assets and CloudFormation template into a bucket in the account/region the user is deploying to.
SAM has support for code-signing assets. This is a security improvement as our current CDK approach does not code sign our Lambda binaries.
SAM has good developer experience and supports sam sync --watch (docs here) which hot-reloads Lambda functions for development.

It is possible to deploy a SAM application via Terraform:

resource "aws_serverlessapplicationrepository_cloudformation_stack" "approvals" {
  name           = "granted-approvals"
  application_id = "arn:aws:serverlessrepo:us-east-1:123456789012:applications/GrantedApprovals"
  capabilities = [
    "CAPABILITY_IAM",
    "CAPABILITY_RESOURCE_POLICY",
  ]
  parameters = {
    AdministratorGroupID = "granted_administrators"
    ProviderConfiguration     = "..."
  }
}

I can appreciate this is very much a compromise and suboptimal for Terraform users, as this is the Terraform-deploys-CloudFormation approach, but think it's worth mentioning.

This should help us get away from the "gdeploy is required" approach towards "gdeploy helps template your deployment" approach. A user could quite easily make a single aws serverlessrepo create-cloud-formation-template call which would deploy the stack.
In general, SAM templates are closer to CloudFormation templates, which should improve readability and make it easier for users to audit what is actually being deployed to their account. Users can easily see the template themselves in the AWS console:
It is possible to get started without needing gdeploy at all:

Users can ClickOps their way through an initial deployment of Granted Approvals entirely within the AWS console. This makes getting started for inexperienced users even easier, as they don't need to download our deployment tooling at all.

I'll also note here that I would prefer to keep everything on the same release cycle - releasing a new version of the Lambda binaries should also corresponding with a new version of the infrastructure, even if there are no underlying infra changes other than the Lambdas. This is purely for simplicity - it avoids us having to maintain a version matrix where version X of the Lambdas require version Y of the CloudFormation and version Z of Terraform.

What I am still unsure about is whether we can improve the names of the Lambda function assets by using SAM. SAM uses the MD5 hash of the Lambda zip file as the S3 object key. There is an open issue here on customising asset names over on the SAM CLI repo. We could try to customize this process but I'm not sure whether it would impact SAM's built-in code signing. Essentially we may need to do more steps ourselves, but this may be acceptable in order to get readable function names.

CommonFate Terraform Provider

(this is referencing @Zordrak's reply above)

This is a fairly high priority on our roadmap and no questions here in my opinion that Terraform is the ideal way to go for infrastructure-as-code configuration of Access Rules. Using CloudFormation and CRDs for definiting Access Rules definitely feels like a suboptimal solution and I agree with all of the points you've raised @Zordrak. Regardless of whether you deploy the infra layer as a SAM stack, or some native Terraform infra, we will be encouraging users to use Terraform to manage their Access Rules as code.

My proposed way forwards

I propose that we make the following changes to the Granted Approvals codebase:

implement a SAM stack for the Granted Approvals codebase and test what development is like using it. I expect it should be the same contributor experience as using CDK. We should be able to use the same logical ID for our DynamoDB table and other resources in the SAM stack, which should allow users to upgrade their deployment without needing to migrate data.
enable CMKs for resources by default in the new SAM stack.
remove configuration of Access Providers from the CloudFormation template (i.e. the ProviderConfiguration variable) and shift these to a configuration API. Our current approach here dumps a JSON object of the enabled Access Providers and adds it to the stack as a variable. As the number of Access Providers we support grows this approach is limited as there is a length limitation on this field. The configuration API will make it easier for us to write a Terraform provider to configure Access Providers.
limit the amount of work that gdeploy is doing - really it should be just showing users the AWS CLI commands that they can run themselves.
In the SAM infrastructure stack, use CloudFormation metadata and/or resource tags to denote a human-readable component tag like commonfate.io/component. We can then use this to 'diff' the Terraform deployment against the CloudFormation deployment for new release candidates to make maintaining a Terraform infra stack easier. Each CloudFormation resource should have a corresponding Terraform resource. We can also use metadata to describe the purpose of each component which will make our stack more readable. We've received user feedback that our current CDK deployment prints tons of random resources to be created, with little explanation of what they are.

Personally, my viewpoint is that we should compromise on Terraform deploying the SAM application as our 'official' Terraform infra approach to minimise the amount of infra work we are doing (there are many application-level features we want to build and we are a small core team at the moment). However, given that @Zordrak has created a top-notch native Terraform deployment I think it is worthwhile to try to maintain both deployments together using the component tagging approach I've mentioned above. So my proposal is that we bring the Terraform deployment up to v0.9.0 and then try to release both TF and SAM flavoured deployments for the forseeable future.

One particular pain point that SAM will not help with is the Frontend Deployer CRD. Ideally, CloudFormation would have a nice way for us to specify "here is a set of frontend assets to host with CloudFront". I note that Terraform has a lot more flexibility here as we can copy the assets over as part of a deploy.

A limitation in our current Frontend Deployer is that it deletes objects in S3 prior to updating them. If the Frontend Deployer fails during a deployment this will cause the dashboard to become unavailable. As part of shifting to SAM I propose that we adjust the Frontend Deployer to use version tags as a 'namespace' for deployments rather than deleting objects. So a frontend hosting bucket with multiple releases would look like this:

.
├── v0.10.0
│   └── assets...
└── v0.9.0
    └── assets...

1 reply

lorengordon Oct 27, 2022

I feel like this compromise makes sense to me.

Fwiw, what I was trying to get at with the question about Terraform Cloud, is that it could address the question/concern around cleanup and developer velocity and multiple environments. Each developer could have their own (free) TFC account, connected to their AWS development environment. Then the TF CLI integration with TFC makes it easy to build/deploy multiple stacks/workspaces, and keep track of them all in the TFC console. So, if you wanted a pure TF solution, that could be one approach.

(EDIT: To be clear, TFC wouldn't be required at all. Just using it to help to address the cleanup and velocity question)

Personally, I would make the TF module a little less opinionated, without all the stuff in there specific to tfscaffold, and have a mechanism that builds the packages for the lambda functions (probably using terraform-aws-lambda, or perhaps re-implementing the approvals module entirely using more of the https://serverless.tf components), but I'm not volunteering to maintain the terraform module for Granted Approvals, so... 🤷‍♂️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change our deployment model to use Terraform #5

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Change our deployment model to use Terraform #5

chrnorm Oct 13, 2022 Maintainer

Replies: 6 comments · 9 replies

chrnorm Oct 18, 2022 Maintainer Author

chrnorm Oct 18, 2022 Maintainer Author

Generally: TF vs. CF

Cleaning Up

Deployment Management

Drift Control

Error Handling

Block repetition, code re-use and standardisation

Multi-region / Multi-Account Deployment

Looking up Data

Resource lifecycle management

Data Transformation

Multi-User Simultaneous Change Planning

Access to Stored Resource State

Planning Changes: Process

Planning Changes: Confidence

Language & Syntax

Parameter / Variable Data Types

Interoperability

Resource Naming Scope

Conditions

Size Constraints

Terraform vs. CDK for Granted Approvals

Executive Summary

My Requirements

Considerations

gdeploy

Development and Deployment Extensibility

Build, Deploy and Run Dependencies

Continuing to Provide a CloudFormation Option

CommonFate Terraform Provider

Conclusion

chrnorm Oct 26, 2022 Maintainer Author

CommonFate Terraform Provider

My proposed way forwards

chrnorm
Oct 13, 2022
Maintainer

Replies: 6 comments 9 replies

chrnorm Oct 18, 2022
Maintainer Author

chrnorm
Oct 18, 2022
Maintainer Author

chrnorm
Oct 26, 2022
Maintainer Author