Kelsey Hightower wrote Kubernetes the Hard Way almost a decade ago, and he was clear from the first sentence about what it wasn't. It wasn't a deployment guide. It wasn't a recommendation. The whole point was to walk you through standing up a cluster yourself, by hand, so you'd see what the abstractions normally hide — and then go pick a managed offering with a much better understanding of what you were running.
Here's the Terraform equivalent. Not how I'd recommend running Terraform. What it actually takes.
Kelsey's piece is meant to be run, command by command. This one shows plenty of commands too — but as illustrations of what teams stitch together, not steps to follow verbatim. Same spirit, one level up.
A note up front. If you're new to infrastructure as code and you're staring at this list thinking "all of this for terraform apply?" — that's fair. For a hello-world, most of it is overkill. This isn't a list for hello-world. It's the list of decisions a team makes on the way to production-grade, maintainable Terraform that holds up across years, environments, teams, and regions. If you're already there, none of this will be a surprise. If you're not, this is the road.
To keep the list legible, I've grouped it into three phases: things you design before you write much code, things you build to make it run, and things you operate to keep it running. The phases overlap in practice — every "design" decision gets revisited the first time it survives contact with reality — but they're a useful way to read.
Decisions that shape everything that comes after. Easier to make once, deliberately, than to migrate later.
This is the first decision and the easiest to get wrong. One repo or many. If one, how do infrastructure changes coordinate with application changes — same PR, separate PRs, gated by approval? If many, how do they share modules, state, and conventions? Folder structure inside each repo. Naming. Where stacks live, where root modules live, where shared code lives.
Everything a framework would encode for you, you'll encode by hand in conventions, READMEs, and tribal knowledge. Either way, it's a decision — not a discovery — and the longer you wait to make it deliberately, the more migration work you've signed up for later.
There isn't one common approach — there are four, each with a distinct failure mode.
The most common starting point. Top-level folders are environments; root modules nest inside.
.
├── prod/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ ├── eks/
│ └── rds/
├── staging/
│ ├── vpc/
│ └── eks/
├── dev/
│ ├── vpc/
│ └── eks/
└── modules/ # shared child modules
└── networking/Layering between environment and root module is ad hoc; every new piece of infrastructure makes its own placement choice; the first multi-region or second-account need breaks the convention.
All four work. Each one defers the same set of questions — where centralized logs go, where DNS zones get managed, how multi-region deployments work, why you want more AWS accounts than you think — to a README that gets written later. The cost of deferring shows up as migration work the first time the layout has to change, usually with cookiecutter scripts and an INFRASTRUCTURE.md papering over the gaps in the meantime.
There's a second dimension on top of this. Some teams take it the other way and split each root module into its own repository — one repo for VPC, one for EKS, one for RDS. The four layouts above still apply inside each repo, at smaller scale. The new problem is across the seam: keeping shared modules, conventions, and toolchain in sync across the fleet of repos, and managing version pinning between root modules whose outputs feed each other.
You don't just need Terraform or OpenTofu. Most teams also rely on the cloud CLI for whichever cloud they're on — aws, gcloud, or az — to bootstrap accounts, exchange credentials, and reach what doesn't live in IaC. If you're running Kubernetes, you'll likely also want kubectl and helm. There's a long tail of utilities too: jq for parsing JSON in your wrappers, curl for grabbing remote state or hitting webhooks. Pin the IaC binary alone and the rest drift; the next plan looks fine on your laptop and breaks in CI because somebody's kubectl is two minor versions ahead.
Pick one install method. Pin every version. Now do it again per environment, because production probably can't move at the same pace as dev. Now figure out how to promote a version through environments, and how to communicate the change so nobody runs the wrong binary against the wrong state.
Your team uses Mac, Linux, and Windows — or they will, eventually. You don't know who you'll hire. CI uses something else again. The install method has to work on all of those and produce identical behavior. And on top of that, some root modules will deliberately stay behind on older versions of the IaC binary because the upgrade refactor isn't worth the cost — even as the rest of the toolchain moves forward. So your version-pinning story isn't one number; it's a graph.
The same need to reproduce a toolchain across every laptop and every runner is one of the points I made in Build Your IDP Last. It applies here too.
The common approach is to pick a version manager and commit a manifest file. But every option covers a different slice of the toolchain, so most teams end up combining two or three:
# .tool-versions
terraform 1.9.8
kubectl 1.31.2
helm 3.16.2
jq 1.7.1Plugin-based, language-aware, the de facto choice for polyglot teams. Covers Terraform, kubectl, and helm cleanly. jq works through a community plugin whose maintenance comes and goes. curl isn't pinnable — it's whatever the OS ships. Plugin behavior diverges across Mac/Linux/Windows (asdf doesn't run on Windows at all without WSL), and CI runners typically don't bootstrap asdf, so you bolt on setup-terraform and azure/setup-helm actions and now have two parallel install paths to keep in sync.
The common thread: each option covers a different slice of the toolchain, none of them cover all of it cleanly across laptop + CI + every OS your team uses, and the moment one developer joins on a platform the chosen tool doesn't support, the version-pinning story breaks down and somebody has to merge their way out.
SSO for humans, ideally with short-lived role assumption. IAM users where you can't avoid them. In automation, OIDC tokens with subject-claim trust policies, exchanged for cloud credentials at the start of every run. That exchange happens outside Terraform, because Terraform is downstream of having credentials — so you encode it somewhere your runner can do reliably, somewhere your developers can do locally, and ideally those two paths look the same.
This is one of those things that looks small until you have ten repos, three clouds, and a contractor who needs read-only access to two of them.
The common approach is a patchwork: saml2aws,
aws-vault,
granted/assume, the AWS Extend Switch
Roles Chrome
extension on the laptop side, plus
aws-actions/configure-aws-credentials in CI — each tool
covering a piece of the path. Laptop and runner end up with two flows that have to be kept in sync by hand. Whatever
shape you settle on for AWS, the same shape gets repeated for GCP and Azure with different tools and different
conventions.
There's also an artifact you won't find in any of those tools: a ~/.aws/config file on every developer's machine, populated with the right SSO start URL, role ARNs, regions, and profile names per account. That file isn't in your repo, so no PR keeps it honest. Teams either ship a shell script that generates it on first run, or maintain an internal wiki page with the canonical snippet for new hires to copy-paste — and both go out of date the first time an account is added or a role is renamed. You find out which developers are stale the next time someone says their plan looks wrong.
Cloud credentials are the first hop. Most teams need more. Container registry credentials for docker push and docker pull against ECR, GHCR, or a third-party registry. A fresh kubeconfig for EKS, GKE, or AKS. Maybe Helm chart repos. Maybe a private package registry.
None of those come for free. Each requires a CLI call (aws ecr get-login-password for ECR, gh auth token | docker login ghcr.io ... for GHCR, aws eks update-kubeconfig for EKS, the GCP and Azure equivalents) or a purpose-built helper that exchanges your IAM credentials for short-lived tokens. You wire those into the same flow that runs Terraform — locally and in CI — or your developers and your pipelines spend their day chasing 401s.
The common approach is to stick a target in the task runner — make login, just login, an npm script — that shells
out to the CLI calls each downstream tool needs. The wrappers don't address token expiry: the token's still valid when
the job starts, then surfaces as a 401 halfway through a docker push or kubectl apply. And every downstream service
has its own bespoke incantation — laptop and CI each do it a different way:
# Laptop
aws ecr get-login-password --region us-east-2 \
| docker login --username AWS --password-stdin \
123456789012.dkr.ecr.us-east-2.amazonaws.com# CI
- uses: aws-actions/amazon-ecr-login@v2The token is good for ~12 hours, then docker push returns a 401 and the job dies in the middle of an image upload. Multi-account or multi-region pipelines need the call repeated with a different --region and registry URL per account, so the "one-line login" multiplies into a matrix the wrapper has to track. And the laptop incantation and the marketplace action are two different code paths reaching for the same credential — when one breaks in CI, you can't reproduce it locally without diverging again.
On top of that, the wrapper assumes everyone has those CLIs installed at the same versions and that the tool behaves the same way on Windows, Linux, and Mac. In CI the handoff is done with marketplace actions; on the laptop it's the raw CLI. If you ever want to reproduce a CI failure locally, the divergence between the two paths is permanent.
Chicken, meet egg. Terraform's remote state lives in a bucket — and Terraform can't create that bucket, because it needs it to run. So you decide how the bucket gets bootstrapped. Maybe a one-time CloudFormation template, maybe a script, maybe a special "zeroth" Terraform run with a local backend you migrate later. Pick a path. Document it. Run it once per environment. And keep the bootstrap stack out of arm's reach of routine workflows — if it can be destroyed, it can take the state for every stack in the environment with it.
The common approach is a shell script that creates the bucket, configures versioning and encryption, then migrates the bootstrap state into the bucket it just created:
aws s3api create-bucket --bucket my-tfstate-prod --region us-east-2 \
--create-bucket-configuration LocationConstraint=us-east-2
aws s3api put-bucket-versioning --bucket my-tfstate-prod \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption --bucket my-tfstate-prod \
--server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# Migrate the bootstrap state from local backend into the bucket
terraform init -migrate-state \
-backend-config="bucket=my-tfstate-prod" \
-backend-config="key=bootstrap/terraform.tfstate" \
-backend-config="region=us-east-2"Alternatively, a CloudFormation template that owns the bucket forever. That puts a different IaC tool in charge of the foundation of your IaC, with its own update path, its own drift behavior, and its own bus factor. The shell-script alternative gets treated as a one-shot, even though reproducible environments imply running it again every time you spin up a new one — which makes bootstrap a first-class concept rather than a one-off script. In practice, teams reuse a single state bucket across every environment, which works until blast radius or compliance scoping comes up.
Production Terraform wants the same shape of config that every other config-management tradition has settled on: organization defaults at the bottom, per-environment overrides on top, per-root-module tweaks on top of that, all layered together. Helm values, Kustomize overlays, Ansible group_vars — different ecosystems, same DRY pattern. Terraform doesn't do it. Variable values get replaced, not deep-merged: hand it two .tfvars files that both set tags = {...} and the second one wins outright — the keys don't combine. So the moment you want DRY, layered config — and you will — you have to encode the layering yourself, outside the language. .tfvars files. CLI -var flags. TF_VAR_* environment variables. JSON files generated at runtime. They all work, none of them compose, and most teams end up stitching two or three of them together with a task runner that picks the right -var-file order per call site.
The common approach is a mix of .tfvars files, TF_VAR_* exports, direnv rules, and
Makefile/Justfile/Taskfile targets that wrap the right -var and -var-file flags onto the Terraform binary.
Configuration design ends up living in the task runner — the layering you wanted ("org defaults → environment
overrides → root-module tweaks") exists only in the order arguments get assembled by whichever Make target you
happened to invoke. Change a value in one place; three places later still hold the old one. The canonical path is
unreconstructable, and a plan diverges between laptop and CI without an obvious reason.
Most teams want a standard set of tags on every resource — Environment, Owner, CostCenter, Project, the rest. That set has to be defined once, applied in every root module, and kept in sync as it evolves. If it lives in tribal knowledge, half your fleet won't have it. If it lives in a shared module, you'd better make sure every root module imports it.
It's a small decision that compounds. Cost allocation, attribution, security audits, cleanup of orphaned resources — all of that gets harder fast when tagging isn't consistent.
In practice, this becomes its own ongoing job — chasing tag-set drift across modules, writing custom validators, leaving the same PR-review comments over and over.
The common approach is a tags module imported by every root, copy-paste reminders in PR templates, and — at the deep
end of the rabbit hole — a tool like Bridgecrew's yor that literally rewrites
your Terraform code to inject tags. Each new root module is a fresh place where the import can be forgotten, and an
updated tag set has to make its way through every consumer of the module by hand.
Now the things you actually write to make Terraform usable as a system, not just a CLI.
CI/CD isn't optional anymore. To ship infrastructure-as-code at the speed developers ship application code, every team running Terraform at meaningful scale needs PR-gated plan, automated apply, an audit trail, and parity between what got reviewed and what got applied. That's table stakes for operating Terraform in a team, not a phase-2 deliverable. Without it, changes back up behind whoever has the laptop with the right credentials, plans drift from reality, "who applied what?" becomes a Slack archaeology project, and the infrastructure side of every release turns into the bottleneck the rest of engineering waits on.
The rest of this section is what that machinery actually costs to build by hand.
Bringing infrastructure up from zero is its own choreography. Bootstrap the backend. Seed the org. Prime the IAM roles. In the right order. With the right credentials. That's not Terraform — that's the thing that runs Terraform.
Copy-paste from a README only goes so far. The same handful of commands gets repeated across stacks, environments, and laptops, and soon enough most teams reach for a tool to automate the sequence. Those tools are called task runners — make, just, go-task, plain shell wrappers. Pick one. Document it. The bar to clear is local reproducibility — the same target has to work on a developer's laptop and in CI, with the same arguments, the same toolchain, and the same outcome. This same runner is also the thing your team will reach for when they need to repeat the same orchestration for sibling pipelines — Packer for golden images, Helm chart releases, schema migrations, whatever else you bake alongside Terraform. The questions you answer here repeat over and over.
The common approach is a Makefile that grew its own DSL. A few targets in, the file is already doing what a programming language is for — argument parsing, conditionals, string manipulation — without the tools to make it readable, and behavior diverges across whichever OS the new hire happens to use:
# Makefile
STACK ?= $(error STACK is required, e.g. make plan STACK=vpc-prod-ue2)
ROOT := $(firstword $(subst -, ,$(STACK)))
ENV := $(word 2,$(subst -, ,$(STACK)))
REGION := $(word 3,$(subst -, ,$(STACK)))-$(word 4,$(subst -, ,$(STACK)))
WORKDIR := terraform/$(ROOT)
TFVARS := -var-file=../../vars/org.tfvars \
-var-file=../../vars/$(ENV).tfvars \
-var-file=./$(REGION).tfvars
BACKEND := -backend-config=../../backends/$(ENV).hcl
.PHONY: plan apply
plan apply: _check-creds _init
cd $(WORKDIR) && terraform $@ $(TFVARS) $(if $(filter apply,$@),-auto-approve,)
_init:
cd $(WORKDIR) && terraform init -reconfigure $(BACKEND) >/dev/nullPure HCL is enough until it isn't. The classic example: HashiCorp's Terraform doesn't allow variables in the backend block, and until recently didn't allow them in module.source either. The backend is evaluated before the core boots, so bucket = var.state_bucket is rejected. (OpenTofu 1.8 added early static evaluation that lifts this restriction for variables and locals in both backend and module-source contexts — but that's OpenTofu, not Terraform, and it doesn't reach data sources or runtime values.)
The moment you want the same root module to deploy to multiple regions or accounts, the backend changes per deployment, and you're left juggling -backend-config flags forever or templating the file. The same story plays out with provider configurations that vary per environment, and with monkey-patching third-party modules where you can't change the upstream.
So you pick a templating tool. You wire it into your task runner. You make sure CI runs it before terraform init. And you've got one more thing to maintain.
The common approach is cookiecutter, envsubst, or a hand-rolled Jinja step in CI. But Terraform can't call any
of them — something outside Terraform has to, which means the "native Terraform" workflow is already gone before
terraform init ever runs. The pre-step becomes the actual interface to your stack, and dev and CI drift the
moment they don't render exactly the same file the same way.
As your organization grows and the team expands, infrastructure repos multiply — and so does duplication. The same VPC pattern, the same EKS pattern, the same RDS pattern shows up in three teams' codebases, drifting independently. A common response is a library of reusable root modules teams can share — versioned, deployed by reference, the same pattern as a private package registry but for infrastructure.
This is a different problem from sharing child modules. Child modules don't own state; they're building blocks you combine inside a root module to make something deployable, and Terraform already knows how to fetch them. A root-module library is for the deployable units themselves — each one a directory you run terraform apply against, with its own state file — and that's the piece Terraform doesn't ship a way to share.
Terraform runs from a local directory. The folder you point terraform apply at has to be on disk — there's no module "x" { source = "git::..." } for the root module itself. So if the root module lives somewhere else, something has to put a copy of it on disk before init runs.
Terraform does ship terraform init -from-module=git::..., which copies a remote source into the current directory once. It's scaffolding rather than a versioned dependency mechanism — there's no per-run pin that survives the next checkout — so most teams reach for an explicit copy mechanism instead.
A root module is the directory you run terraform apply against. It owns the state file, and it's where Terraform
actually executes. A root module calls child modules with module "x" { source = "..." } — and Terraform
happily fetches those children from Git or a registry.
A root module cannot embed another root module. There is no module "x" { source = "git::..." } for the directory
Terraform itself runs in — only for the children it calls. That gap is the entire reason step 10 exists.
That gap is invisible while you're inside one repo and one team. The moment a shared library exists, you're the one building the copy mechanism.
So you pick a mechanism for getting third-party module source onto disk: vendoring (committing the source into your repo), Git submodules, git subtree, a fetcher script that runs before init, a package-manager-style tool. Vendoring has a nice property worth naming on its own — the source lives in your tree, so PR diffs show exactly what was deployed at any commit. Other mechanisms keep the source out of your tree and trade that auditability for a smaller repo. Neither is wrong; they're different tradeoffs.
Whatever you pick, you're now maintaining a dependency-management story alongside the rest of it.
The common approach is whichever of these mechanisms a team grabbed first — Git submodules and git subtree are the
two most common, with vendor-pull scripts and package-manager-style tools filling out the rest. Any of them works.
The mechanism has to keep the same version pinned consistently across every laptop and every runner; the failure mode
is drift between those copies, not the choice of tool itself.
Add a remote root module to your platform repo:
git submodule add -b main \
git@github.com:acme/tf-root-vpc.git \
stacks/vpcSubmodules track commits, not refs, so pinning to a tag is a separate step:
cd stacks/vpc && git checkout v2.0.0 && cd ../..
git add stacks/vpc && git commit -m "Pin vpc to v2.0.0"Every fresh clone and every CI runner has to re-init or the directory is empty:
git submodule update --init --recursiveForget that step once and terraform init runs against an empty directory. The .gitmodules URL/branch and the actually-checked-out commit drift independently — reviewers see "Submodule changed" in a PR without seeing what changed.
Once you're in a monorepo, you don't want every change to trigger every plan. A typo fix in a README shouldn't replan production. You need tooling that reads the diff, understands which root modules are affected by which files, and runs Terraform only on those. This is independent of Terraform itself, and Terraform doesn't ship it.
So you write a path-based CI matrix. Or a Bash script. Or you adopt a tool that understands stack dependencies. Either way, it's one more layer of CI you own.
The common approach is tj-actions/changed-files — for years, the most popular pattern for path-based CI matrices.
A critical CI primitive is now a third-party dependency. In March 2025,
CVE-2025-30066 compromised exactly that Action across the ecosystem
and exfiltrated CI secrets from thousands of repos before anyone noticed. The point isn't that this Action was
uniquely careless; it's that the supply-chain surface was always there. The shape of what teams wire up —
changed-files resolves the diff, an awk-and-jq shim turns the file list into a matrix, a fan-out job runs
terraform plan per module — looks like this:
# .github/workflows/plan.yml
on: pull_request
jobs:
detect:
runs-on: ubuntu-latest
outputs:
stacks: ${{ steps.matrix.outputs.stacks }}
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
with: { fetch-depth: 0 }
- id: changed
uses: tj-actions/changed-files@95690f9ece77c1740f4a55b7f1de9023ed6b1f87 # v46.0.5
with: { files: terraform/** }
- id: matrix
run: |
echo "stacks=$(echo '${{ steps.changed.outputs.all_changed_files }}' \
| tr ' ' '\n' \
| awk -F/ '{print $2}' | sort -u \
| jq -R -s -c 'split("\n") | map(select(length>0))')" >> "$GITHUB_OUTPUT"
plan:
needs: detect
if: needs.detect.outputs.stacks != '[]'
strategy:
matrix: { stack: ${{ fromJSON(needs.detect.outputs.stacks) }} }
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
- run: make plan STACK=${{ matrix.stack }}-prod-ue2Eventually, one root module isn't enough. There's always a reason — governance, security blast radius, scale, performance, parallelism, team ownership. We wrote about this in Service-Oriented Terraform. Decomposition isn't really a Terraform decision; it's an organizational one.
The moment you decompose, you've opened a fresh category of problems. How do roots pass values to each other? Remote state lookups, SSM parameters, a service catalog? Where does each piece's state live, and who can read it? When workload teams own their own repos, how do they reuse modules and reach into shared infrastructure consistently?
These are real architectural questions, and Terraform leaves them entirely to you.
The common approach, once decomposition starts to bite, is terragrunt with dependency blocks pointing at
remote-state buckets across repos. Some teams reach for Makefile target dependencies, or Bazel, to express the DAG
of root-module dependencies. Terragrunt deserves credit here — it pioneered durable patterns for multi-account
organization, dependency wiring, and per-environment config inheritance, and a lot of teams are still on it for good
reason. It solves one slice. CI integration, plan rendering, drift detection, PR comments, supply-chain pinning, and
OIDC are still on you, on top of a second DSL to learn, version, and debug.
Up to here, everything's been about getting the system to work. Now somebody else has to use it. Take stock of what a new hire — or any engineer touching an unfamiliar root module — has to know to run a single terraform plan. The right binary version, on the path. Cloud credentials, refreshed. Registry credentials, handed off. The right working directory — which root module, in which folder, in which repo, especially after decomposition. The right -var-file flags in the right order, because the config layering from step 6 only resolves correctly if the task runner assembles them correctly. The right -backend-config, if the backend is templated. And whichever workspace selection the layout demands.
That's a checklist that lives in someone's head, in a Makefile target nobody remembers naming, or in a Slack thread from six months ago. Terraform doesn't ship a "what should I run, here?" — so the command, the folder, the flags, and the prerequisites stay there, and onboarding is whatever the previous hire wrote down before getting pulled into something else.
Teams settle in one of two places. One is a Makefile or Justfile that grows a target per root module
(make plan-vpc-prod, make plan-eks-dev, …). The wall of targets becomes the new discoverability problem: an
engineer asking "what do I run for vpc in prod-uw2?" either finds the matching target or reads the wrapper to figure
out what to type. The other is a thinner wrapper that takes a stack name as an argument — which works, but only after
the developer already knows the stack name, the -var-file order, and which -backend-config to pass. Both relocate
the long argument list rather than encoding the layout from step 1.
Same "plan the VPC in prod, us-east-2," three different places teams park the incantation:
cd terraform/vpc
terraform init -reconfigure \
-backend-config=../../backends/prod.hcl \
-backend-config="key=vpc/prod/us-east-2.tfstate"
terraform workspace select prod-ue2 || terraform workspace new prod-ue2
terraform plan \
-var-file=../../vars/org.tfvars \
-var-file=../../vars/prod.tfvars \
-var-file=./prod-ue2.tfvars \
-var region=us-east-2 \
-var environment=prod \
-out=plan.outOnce your team is running plans, the next surface is review. Reviewers don't open a terminal; they look at the PR, the CI job UI, and whatever a deploy posts back. The next four steps are about that review surface — making CI legible to the people who didn't write the change. None of it ships in Terraform.
Raw terraform plan output is not friendly. Inside a GitHub Actions job UI, it's a wall of green and red plus signs. You want a clean, scannable summary at the top of the job — what's changing, where, and how much. That's a tool, an action, or a script. Pick one.
The common approach is tfcmt — wrap terraform plan and write the
formatted result to $GITHUB_STEP_SUMMARY so it lands at the top of the job UI:
# .github/workflows/plan.yml (excerpt)
- name: Install tfcmt
run: |
curl -fsSL https://github.com/suzuki-shunsuke/tfcmt/releases/download/v4.14.5/tfcmt_linux_amd64.tar.gz \
| tar -xz -C /usr/local/bin tfcmt
- name: Plan + summary
run: |
tfcmt --output "$GITHUB_STEP_SUMMARY" plan -patch -- \
terraform plan -no-color -out=plan.outIt's one more pinned dependency — a binary you curl-install at a fixed release, with its own maintainer, its own
release cadence, and its own seat on your supply-chain surface.
Reviewers shouldn't have to click into the job to see what's changing. A PR comment with the plan summary is now table stakes. You'll need an action that posts it, updates it on subsequent pushes (rather than spamming a new one), and survives force-pushes without leaving stale comments behind. That action either exists, or you write one, or you live with the spam.
The common approach is to take the markdown summary from step 14 and post it through
peter-evans/create-or-update-comment, paired with
peter-evans/find-comment to find the existing sticky and update it
in place instead of spamming a new one on every push:
# .github/workflows/plan.yml (excerpt — continues from step 14)
- name: Build PR comment body
run: |
{
echo '<!-- terraform-plan:vpc -->'
echo '## Terraform plan: vpc'
echo
echo '```diff'
terraform show -no-color plan.out
echo '```'
} > plan.md
- name: Find existing plan comment
id: find
uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4.0.0
with:
issue-number: ${{ github.event.pull_request.number }}
comment-author: github-actions[bot]
body-includes: "<!-- terraform-plan:vpc -->"
- name: Post or update plan comment
uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5.0.0
with:
issue-number: ${{ github.event.pull_request.number }}
comment-id: ${{ steps.find.outputs.comment-id }}
edit-mode: replace
body-file: plan.mdYou now own a small protocol — the HTML-comment marker <!-- terraform-plan:vpc --> is the only thing that lets
find-comment re-locate the sticky. One marker per stack and per environment, or comments collide and overwrite each
other. GitHub also caps a single comment at 65,536 characters; large estates blow past that and the workflow has to
truncate or split. And there are two more pinned dependencies — find-comment and create-or-update-comment — on the
same supply-chain surface step 11 already warned about.
If you spin up preview environments per pull request, you need a place to surface the URL. GitHub's Deployments API is the right surface — it gives you a clean status indicator on the PR and a deployments tab on the repo. Pick a tool that posts there. Make sure it cleans up the deployment when the PR closes, or you'll have a graveyard of stale "active" environments before long.
The common approach is bobheadxi/deployments or
chrnorm/deployment-action — both unofficial, both written in
TypeScript, both adding to the supply-chain surface. Almost nobody owns the GitHub App side of preview-environment
plumbing in-house, so the question quietly becomes which third-party Action you bet on and how you handle it when the
maintainer disappears.
If subsequent CI steps consume Terraform outputs — uploading assets to a freshly-created bucket, triggering a deployment to a freshly-created cluster — you need to translate those outputs into GitHub-style environment variables or step outputs. Pick a tool. Test it against complex output types. Keep it working when Terraform's output format shifts between versions.
The common approach is terraform output -json piped through jq and echoed into $GITHUB_OUTPUT:
# Scalar string output — works fine
echo "bucket_name=$(terraform output -raw bucket_name)" >> "$GITHUB_OUTPUT"
# Nested object output — has to use the delimited multiline form
{
echo "vpc_config<<EOF_VPC_$(uuidgen)"
terraform output -json vpc_config | jq -c .
echo "EOF_VPC_$(uuidgen)"
} >> "$GITHUB_OUTPUT"Two things stack here. First, complex output types — nested objects, lists of objects, anything that isn't a scalar
string — need a custom jq expression per output, and that expression lands as a copy-pasted shell snippet in every
workflow that consumes the value, drifting the moment one of them gets edited. Second, $GITHUB_OUTPUT uses a
delimited multiline format (name<<DELIM … value … DELIM) for any value containing newlines or JSON. Producing that
correctly out of jq for a nested object means picking a delimiter that can't appear in the value, getting the quoting
right, and handling the case where jq emits multiple lines. The first nested output is usually where this breaks —
truncated values surface in downstream steps, and the per-workflow shell snippet ends up never quite the same in two
places.
A short aside on those last four. Each one is a small piece of CI ergonomics, and each typically gets solved by reaching for a third-party GitHub Action — most written in TypeScript with their own transient node_modules dependency tree. Browse a popular collection like dflook/terraform-github-actions and count: terraform-fmt, terraform-validate, terraform-plan, terraform-apply, terraform-output, terraform-version, terraform-new-workspace, terraform-destroy-workspace, and on. A team running real CI ends up pinning a dozen or more, each one expanding the supply-chain surface area of your infrastructure pipeline. The compromise of tj-actions/changed-files in March 2025 made the cost of that surface area concrete: a single popular Action was modified to exfiltrate CI secrets across thousands of repos before anyone noticed. The point isn't that GitHub Actions are dangerous. By the fifth Action, you've assembled a supply chain you didn't design.
By the time a team's CI is doing all of those things — change detection, OIDC, plan, sticky comment, deployment, output piping — a single deploy job stitches together eight separate maintainers' Actions, each pinned to its own SHA:
# .github/workflows/apply.yml (excerpt)
jobs:
apply:
runs-on: ubuntu-latest
permissions: { id-token: write, contents: read, pull-requests: write, deployments: write }
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
- uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2
- uses: aws-actions/configure-aws-credentials@b47578312673ae6fa5b5096b330d9fbac3d116df # v4.2.1
with: { role-to-assume: arn:aws:iam::123456789012:role/CIRole, aws-region: us-east-2 }
- uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
id: deployment
with: { step: start, env: prod }
- uses: dflook/terraform-plan@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
id: plan
with: { path: terraform/vpc, var_file: vars/prod.tfvars }
- uses: suzuki-shunsuke/tfcmt-action@b1f9f7a0b5b8b2dbcd0fce2e0a6b3c0d8a3f1c2e # v1.2.0
with: { config: .tfcmt.yaml }
- uses: dflook/terraform-apply@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
with: { path: terraform/vpc, var_file: vars/prod.tfvars, auto_approve: true }
- uses: dflook/terraform-output@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
id: tf-output
with: { path: terraform/vpc }
- uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
if: always()
with: { step: finish, status: ${{ job.status }}, deployment_id: ${{ steps.deployment.outputs.deployment_id }}, env_url: ${{ steps.tf-output.outputs.app_url }} }The system is built. Now you have to live with it.
Once you have dozens of root modules across a handful of environments, just seeing what's there becomes a job. Which root modules are deployed where? What does the merged configuration look like for prod-us-west-2? Which stacks reference which root modules, with what overrides?
You'll want a CLI that can list root modules, list stacks, describe the composed config for any one of them, and answer those questions without grepping through directories. If you don't have one, your developers will start writing little scripts that do half of it. Then those scripts will diverge.
The common approach is some flavor of directory walk taped to the wiki as the "how to see what we have" snippet:
# What root modules do we have?
find terraform -maxdepth 1 -mindepth 1 -type d
# Which stacks reference each one?
grep -rl "terraform/vpc" stacks/It relies on the directory tree being the source of truth — which reads back layout, not configuration, so it can't
answer "what does the merged config look like for prod-us-west-2" or "which stacks override what." Each developer who
needs that writes their own little script that does half of it, and the scripts diverge. A SaaS runner like Terraform
Cloud or Spacelift answers "what's deployed where" via its workspace list — but not "what's the composed config for
this one"; that question is still on you.
The people who use what you've provisioned aren't all running terraform plan. Somebody has to write secrets into Secrets Manager once the database is up. Somebody has to set parameters in Parameter Store that the app reads on boot. Somebody has to upload an artifact to the S3 bucket the app consumes. Somebody has to roll a credential at 3 a.m. The boring everyday work that sits between "infrastructure exists" and "the app uses it." You'll write down the commands they run — the playbooks — and that's a piece of the platform too.
Where do those playbooks live, though? Makefiles are convenient until you need to pass arguments cleanly, which Make doesn't do well. Justfiles handle arguments better. go-task is solid. Plain shell scripts only behave consistently across Mac, Linux, and Windows if you're disciplined about POSIX — and the moment somebody on Windows joins the team, that discipline breaks. Whichever you pick is one more binary to install on every laptop and every runner, one more set of conventions to teach, one more thing to keep current.
So you pick something. You commit to it. You make sure it works on every laptop your team carries. And you keep adding to it as the system grows, because every new piece of infrastructure ships with a new playbook for whoever consumes it.
The common approach is a Makefile with thirty targets and a README section called "Common Tasks." Make doesn't do
argument-passing well, behaves differently on Windows, and the README section lags the targets every time a new
playbook is added. And there isn't one Make: BSD make ships on macOS, GNU make ships on most Linux distros and CI
runners, and they diverge on everything past the basics — conditionals, includes, $(shell), $(call), pattern
rules, .PHONY semantics. Either you write to the GNU subset and document brew install make (then call it as
gmake), or you write to a portable subset that gives up half of what made you reach for Make in the first place. A
representative slice — note the %: ; @: no-op at the bottom, wired in to eat positional arguments because Make
doesn't actually support them:
# Makefile (excerpt — actual file is ~30 targets)
# NOTE: requires GNU make; BSD make on macOS will fail on the conditionals below.
# `brew install make` and invoke as `gmake`, or document the divergence in onboarding.
.PHONY: login seed-secrets set-params upload-fixtures rotate-secret refresh-creds
ENV ?= dev
NAME ?= $(error NAME is required, e.g. make rotate-secret NAME=db-password)
login: ; aws sso login --profile $(ENV)-admin
seed-secrets: login ; ./scripts/seed-secrets.sh $(ENV) $(filter-out $@,$(MAKECMDGOALS))
set-params: login ; ./scripts/set-params.sh $(ENV)
upload-fixtures: login ; aws s3 sync ./fixtures s3://$(ENV)-app-fixtures --profile $(ENV)-admin
rotate-secret: login ; aws secretsmanager rotate-secret --secret-id app/$(ENV)/$(NAME) --profile $(ENV)-admin
refresh-creds: login ; aws sts get-caller-identity --profile $(ENV)-admin
# Eats positional arguments to `make` so `make seed-secrets my-branch` doesn't error.
%: ; @:Twenty steps in. A new hire should be able to read the docs and ship safely. That means everything above — layout, install, auth, state, runner, config, tags, templating, module sourcing, decomposition, inventory, playbooks, CI — has to be written down current and stay current. Per-root-module reference docs are their own pipeline: variables, outputs, providers, and resources go out of date the moment someone edits the HCL. The community answer is generation, and the de facto tool is terraform-docs.
Beyond the docs themselves, each accumulated tool needs a maintenance story: who tracks releases, how upgrades land, and ideally a contract test or two so a dependency bump doesn't quietly break the next workflow run. Most teams don't write those tests, and find out something broke when a workflow fails on what looked like a routine change. Tools also come and go — driftctl entered maintenance mode in 2023, popular Actions get archived, and the supply-chain surface widens with every one you adopt: in March 2025, CVE-2025-30066 compromised tj-actions/changed-files and exfiltrated CI secrets from thousands of repos before anyone noticed.
And there's the cognitive load. The first month for a new engineer is a dozen unrelated tools — each with its own configuration, command structure, update cadence, and the workarounds the team patched around to make them play together. None of that is anybody's fault. It's the natural result of solving twenty-one decisions independently. It's just a lot to absorb before you can ship.
The common approach is terraform-docs in a pre-commit hook for the per-module table, alongside a docs/ folder for
architecture and onboarding. The reference table is regenerated on every change; the prose docs update on whatever
cadence the team writes them.
Once everything above is in place, there's still one last thing.
Reality and your state file diverge. It's not an if. Providers change behavior between versions, and a re-apply you didn't run silently changes the truth. Click-ops happens during incidents — somebody fixes prod through the console at 2 a.m., and your Terraform code doesn't know. And, occasionally, you've been breached and you don't know it — an attacker has changed something live and your code is the last place that'll catch it.
Detection is the easier half. You schedule a periodic plan. Diff the result. Pipe it to a Slack channel. Terraform doesn't do any of that for you, so you build it.
Reconciliation is the harder half — the part nobody wants to design upfront. What does drift resolve to? Some drifts get auto-applied back to source. Some get a ticket. Some block deploys until a human reviews them. Some need a person to decide whether the source or the live resource is right. None of that is a terraform plan flag; it's a workflow you build, with approval gates, audit trails, and paging policies attached.
This is the capstone — the step that makes everything before it actually keep working over time. Without it, drift accumulates silently, and your codebase slowly turns into a fossil of how things used to be configured.
If you're just running Terraform, the common approach is a scheduled GitHub Action that runs
terraform plan -detailed-exitcode per root module on a cron, posts the diff to a Slack channel, and opens a PR or
ticket on a non-zero exit code. That works on day one. The failure mode it ages into is that there's no policy
distinguishing drift that should auto-reconcile from drift that should page someone — so every drift gets the same
treatment, which after a quarter is "ignored." The Slack channel and cron get whatever attention the original author
still has cycles for, and driftctl (the tool many teams reached for to enrich the diff) has been in maintenance mode
since 2023.
Either way, the reconciliation-policy layer — auto-revert vs. ticket vs. block-the-deploy vs. page on-call, plus
who's allowed to override any of those at 2 a.m. — is additional tooling on top of terraform plan. You build it
yourself, or you pay for a SaaS that does it for you (Terraform Cloud, Spacelift, env0, Scalr, others). The choice
is real, but the layer doesn't disappear.
And per-stack drift is only the bottom layer. The view a team running infrastructure at scale actually wants is fleet-wide: which stacks are drifting right now, which have been perma-drifting for weeks (the drift nobody plans to fix because reverting would break something else), which workflow runs are failing and at what rate, and — when something does regress — what change to that stack landed most recently. That's change-failure-rate-and-MTTR for infrastructure: the DORA layer, applied to the platform. Terraform doesn't get you anywhere near it. Building it the Hard Way is OpenTelemetry from GitHub Actions into Prometheus, Grafana, or Datadog, with stack-and-root-module labels you keep clean across hundreds of workflow runs and hand-built dashboards somebody has to own. It's possible. It's also a platform engineer for a quarter, and then a permanent owner.
Look back at the list. Each of those twenty-one crossroads has its own small ecosystem of choices — a version manager, a templating tool, a fetch mechanism, an Action for change detection, an Action for the plan summary, an Action for deployments, a tool for outputs, a renderer for docs, a runner for drift. Compound twenty-one decisions across that catalog and you're maintaining dozens of independently-versioned tools, glue scripts, and CI shims with the same conventions nowhere in particular.
Or one tool that handles the whole set with the same conventions all the way through.
That's what most teams end up with — the dozens, not by choice, but one decision at a time. Each individual decision is reasonable; the result is the system this post just walked you through.
The other path is a framework — one tool that's already made the decisions, with the same conventions across every step. Atmos is the one we built and the one we use. It's not the only valid choice — Terragrunt has been doing real work in this space for years, and a team with its own framework that fits should keep using it. The point of this post isn't which framework. It's that the choice exists, and the cost of treating it as one decision instead of twenty-one is what determines whether the system you're running in three years is something you designed or something that happened to you.
The companion to this post — Terraform the Easy Way — walks the same crossroads with concrete Atmos snippets at each one, so you can see what each decision looks like once a framework has made it for you.
If you're somewhere in this list and want a second set of eyes on what to keep, what to consolidate, and where a framework would buy you the most leverage right now, let's talk.
Subscribe to the Production Ready newsletter.

Continue reading with these featured articles