Spacelift checks off all the boxes for managing extremely large environments with a lot of state management. Since Cloud Posse's focus is on deploying large-scale loosely coupled infrastructure components with Terraform, it's common to have several hundred terraform states under management.
Every successful business in existence uses accounting software to manage its finances and understand the health of its business. The sheer number of transactions makes it infeasible to reconcile the books by hand. The same is true of modern infrastructure. With hundreds of states managed programmatically with terraform, and modified constantly by different teams or individuals, the same kind of state reconciliation is required to know the health of its infrastructure. This need goes far beyond continuous delivery and few companies have solved it. With Spacelift, you have an up-to-date view of your assets, liabilities & tech debt across all environments.
- Drift Detection runs on a customizable schedule surfaces inconsistencies with what's deployed and what's in git.
- Reconciliation helps you know what's deployed, what's failing, and what's queued.
- Plan Approvals ensures changes are released when you expect them
- Policy Driven Framework based on OPA (open source standard) is used to trigger runs and enforce permissions. This is like IAM for GitOps.
- Terraform Graph Visualization makes it easier to visualize the entire state across components
- Audit Logs of every change traced back to the commit and filterable by time
- Affordable alternative to other commercial offerings
- Works with more than Terraform (e.g. Pulumi)
- Pull Request Previews show what the proposed changes are before committing them
- Decoupling of Deploy from Release ensures we can merge to trunk and still control when those changes are propogated to environments
- Ephemeral Environments (Auto Deployment, Auto Destruction) enables us to bring up infrastructure with terraform and destroy it when it's no longer needed
- Self-hosted Runners ensure we're in full control over what is executed in our own VPC, with no public endpoints
What level of access do the Spacelift worker pools have?
Spacelift Workers are deployed in your environment with the level of permission that we grant them via IAM instance profiles. When provisioning any infrastructure that requires modifying IAM, the minimum permission is administrative. Thus, workers are provisioned with administrative permissions in all accounts that we grant access to since the terraform we provision requires creating IAM roles and policies. Note, this is not a constraint of Spacelift; this is required regardless of the platform that performs the automation.
What happens if Spacelift as a product goes away?
First off, while Spacelift might be a newer brand in the infrastructure space, it's used by publicly traded companies, Healthcare companies, banks, institutions, Fortune 500 companies, etc. So, Spacelift is not going away.
But just to entertain the hypothetical, let's consider what would happen. Since we manage all terraform states in S3, we have the “break glass” capability to leave the platform at any time and can always run terraform manually. Of course, we would lose all the benefits.
How tough would it be to move everything to a different platform?
Fortunately, with Spacelift, we can still use S3 as our standard state backend. So if at any time we need to move off of the platform, it's easy. Of course, we'd give up all the benefits but the key here is we're not locked into it.
Why not just use Atlantis?
We used to predominately recommend Atlantis but stopped doing so a number of years ago. The project was more or less dormant for 2-3 years, and only recently started accepting any Pull Requests. Atlantis was the first project to define a GitOps workflow for Terraform, but it's been left in the dust compared to newer alternatives.
- With Alantis, there is no regular reconcilation of what terraform state has been applied or not applied. So we really have no idea in atlantis the actual state of anything. With a recent customer, we helped migrate them from Atlantis to Spacelift and it took 2 months to reconcile all the infrastructure that had drifted.
- With Atlantis, there's no drift detection, but with spacelift, we detect it nightly (or as frequently as we want)
- With Atlantis, there's no way to manage dependencies of components, so that when one component changes, any other components that depend on it should be updated.
- With Atlantis, there's no way to setup OPA policies to trigger runs. The OPA support in atlantis is very basic.
- With Atlantis, anyone who can run a plan, can exfiltrate your root credentials. This talked about by others and was recently highlighted at the Defcon 2021 conference.
- With Atlantis, there's no way to limit who can run terraform plan or apply. If you have access to the repo, you can run a terraform plan. If your plan is approved, you can run terraform apply. Cloud Posse even tried to fix it (and maintained our own fork for some time), but the dicussion went no where and we moved on.
- With Atlantis, there's no way to restrict who has access to unlock workspaces via the web GUI. The only way is to install your own authetnication proxy in front of it or restrict it in your load balancer.
- With Atlantis, you have to expose the webhook endpoint publically to GitHub.
What about using GitHub Actions/GitLab/Jenkins/etc?
There are plenty of examples of using other tools to implement continuous delivery for Terraform. However, it's solving for all the edge cases which makes it so complicated and therefore seldom, if ever handled by these approaches.
- Where will you store the plan files which are required for approvals? (plan → approve → apply workflow) Note, these planfiles may contain root-level credentials to things like RDS databases, which cannot be avoided.
- How will you clean up those planfiles? Should they persist after a terraform apply succeeds or crashes?
- How will you implement approval steps? If the approval is denied, how will you clean up the terraform planfile?
- If you have multple open PRs (e.g. many plans) for one workspace, after applying one, all other plans need to be invalidated. How will you implement that invalidation?
- Git is only one source of truth for infrastructure as code. Data sources is another (e.g. terraform remote state). How will you reconcile that your state is current and update it when it drifts? When it drifts, how will you be notified?
- How will you know that your infrastructure changes are applied everywhere? If a build fails, but the code is already merged, how do you escalate and ensure it's resolved?
- If you need to lock an environment from being updated, how will you do it?
- How will you suggest the changes? If the plan is to comment on the PR, that gets VERY noisy and everyone subscribed will receive the notification. Runs may also accidentally leak secrets in the output. GitHub comments are limited to 65K bytes, which means large plans will need to be split across multiple comments.
- What happens if you have multiple PRs merged that want to modify the same environment? How will you enforce an ordered consistency?
- How will you restrict who can run terraform plans and applies? Further more, how will you restrict it to specific environments?
- How will you provide the short-lived IAM credentials to the terraform processes? e.g. any hardcoded credentials exposed will be a major liablility
Why not use Terraform Cloud?
Terraform Cloud is prohibitively expensive for most non-enterprise customers we work with, and possibly 10x the cost of Spacelift. Terraform Cloud for Teams doesn't permit self-hosted runners and requires hardcoded IAM credentials in each workspace. That's insane and we cannot recommend it. Terraform Cloud for Business (and higher) support self-hosted runners which can leverage AWS IAM Instance profiles, but the number of runners is a significant factor of the cost. When leveraging several hundred loosely-coupled terraform workspaces, there is a significant need for a lot of workers for short periods of time. Unfortunately, even if those are only online for a short period of time, you need to commit to paying for them for the full month on an annualized basis. Terraform Cloud also requires that you use their state backend, which means there's no way to “break glass” and run terraform if they are down. If you want to migrate off of Terraform Cloud, you need to migrate the state of hundreds of workspaces out of the platform and into another state backend.