Terraform the Hard Way
terraformdevopsinfrastructure-as-codeplatform-engineeringci-cdgithub-actions

Terraform the Hard Way

Terraform the Hard Way

Erik Osterman
byErik OstermanCEO & Founder of Cloud Posse
May 08 2026

Kelsey Hightower wrote Kubernetes the Hard Way almost a decade ago, and he was clear from the first sentence about what it wasn't. It wasn't a deployment guide. It wasn't a recommendation. The whole point was to walk you through standing up a cluster yourself, by hand, so you'd see what the abstractions normally hide — and then go pick a managed offering with a much better understanding of what you were running.

Here's the Terraform equivalent. Not how I'd recommend running Terraform. What it actually takes.

A note up front. If you're new to infrastructure as code and you're staring at this list thinking "all of this for terraform apply?" — that's fair. For a hello-world, most of it is overkill. We get it. This isn't a list for hello-world. It's the list of decisions a team makes on the way to production-grade, maintainable Terraform that holds up across years, environments, teams, and regions. If you're already there, none of this will be a surprise. If you're not, this is the road.

To keep the list legible, I've grouped it into three phases: things you design before you write much code, things you build to make it run, and things you operate to keep it running. The phases overlap in practice — every "design" decision gets revisited the first time it survives contact with reality — but they're a useful way to read.

Design

Decisions that shape everything that comes after. Easier to make once, deliberately, than to migrate later.

1
Decide your repo layout

This is the first decision and the easiest to get wrong. One repo or many. If one, how do infrastructure changes coordinate with application changes — same PR, separate PRs, gated by approval? If many, how do they share modules, state, and conventions? Folder structure inside each repo. Naming. Where stacks live, where root modules live, where shared code lives.

Everything a framework would encode for you, you'll encode by hand in conventions, READMEs, and tribal knowledge. Either way, it's a decision — not a discovery — and the longer you wait to make it deliberately, the more migration work you've signed up for later.

Common approach

There isn't one common approach — there are four, each with a distinct failure mode.

  • Folder per environment (prod/, dev/, staging/). The most common starting point. Layering between environment and root module is ad hoc; every new piece of infrastructure makes its own placement choice; the first multi-region or second-account need breaks the convention.
  • Folder per app or service, with environments as subfolders inside. Environment-wide concerns — org defaults, regional overrides, shared networking — have nowhere clean to live, so cross-app coordination ends up in scripts and tribal knowledge.
  • Folder per root module, often with prod.tfvars and dev.tfvars next to it. Fifty leaf folders with identical backend.tf and provider.tf boilerplate; promoting a config change across environments means editing N files; nothing enforces consistency between siblings.
  • One folder with everything — a single root module that keeps growing. Monolithic state, plan times that balloon, blast radius that's the whole estate, and a decomposition project later that takes quarters.

All four are stable starting points and quietly accumulate the same gap: a README that punts on the real questions — where centralized logs go, where DNS zones get managed, how multi-region deployments work, why you want more AWS accounts than you think — and a layout that stays inconsistent forever, with cookiecutter scripts and an INFRASTRUCTURE.md papering over the gaps until untangling it later costs more than getting it right would have.

2
Pick how you'll install your toolchain

You don't just need Terraform or OpenTofu. You need kubectl for cluster operations, helm for chart releases, jq for parsing JSON in your wrappers, curl for grabbing remote state files or hitting webhooks, and the cloud CLIs (aws, gcloud, az) for everything in step 4. Pin the IaC binary alone and the rest drift; the next plan looks fine on your laptop and breaks in CI because somebody's kubectl is two minor versions ahead.

Pick one install method. Pin every version. Now do it again per environment, because production probably can't move at the same pace as dev. Now figure out how to promote a version through environments, and how to communicate the change so nobody runs the wrong binary against the wrong state.

Your team uses Mac, Linux, and Windows — or they will, eventually. You don't know who you'll hire. CI uses something else again. The install method has to work on all of those and produce identical behavior. And on top of that, some root modules will deliberately stay behind on older versions of the IaC binary because the upgrade refactor isn't worth the cost — even as the rest of the toolchain moves forward. So your version-pinning story isn't one number; it's a graph.

The same need to reproduce a toolchain across every laptop and every runner is one of the points I made in Build Your IDP Last. It applies here too.

Common approach

The common approach is to pick a version manager and commit a manifest file. But every option covers a different slice of the toolchain, so most teams end up combining two or three:

# .tool-versions
terraform 1.9.8
kubectl   1.31.2
helm      3.16.2
jq       1.7.1

Plugin-based, language-aware, the de facto choice for polyglot teams. Covers Terraform, kubectl, and helm cleanly. jq works through a community plugin whose maintenance comes and goes. curl isn't pinnable — it's whatever the OS ships. Plugin behavior diverges across Mac/Linux/Windows (asdf doesn't run on Windows at all without WSL), and CI runners typically don't bootstrap asdf, so you bolt on setup-terraform and azure/setup-helm actions and now have two parallel install paths to keep in sync.

The common thread: each option covers a different slice of the toolchain, none of them cover all of it cleanly across laptop + CI + every OS your team uses, and the moment one developer joins on a platform the chosen tool doesn't support, the version-pinning story breaks down and somebody has to merge their way out.

3
Authenticate to your cloud

SSO for humans, ideally with short-lived role assumption. IAM users where you can't avoid them. In automation, OIDC tokens with subject-claim trust policies, exchanged for cloud credentials at the start of every run. That exchange happens outside Terraform, because Terraform is downstream of having credentials — so you encode it somewhere your runner can do reliably, somewhere your developers can do locally, and ideally those two paths look the same.

This is one of those things that looks small until you have ten repos, three clouds, and a contractor who needs read-only access to two of them.

Common approach

The common approach is a patchwork: saml2aws, aws-vault, granted/assume, the AWS Extend Switch Roles Chrome extension on the laptop side, plus aws-actions/configure-aws-credentials in CI — each tool covering a piece of the path. But the problem with this is that it leads to two divergent auth flows — laptop and runner — that have to be kept in sync by hand. And whatever shape you settle on for AWS, you repeat — different tools, different conventions — for GCP and Azure, so every additional cloud multiplies the divergence rather than amortizing the work.

There's also an artifact you won't find in any of those tools: a ~/.aws/config file on every developer's machine, populated with the right SSO start URL, role ARNs, regions, and profile names per account. That file isn't in your repo, so no PR keeps it honest. Teams either ship a shell script that generates it on first run, or maintain an internal wiki page with the canonical snippet for new hires to copy-paste — and both go out of date the first time an account is added or a role is renamed. You find out which developers are stale the next time someone says their plan looks wrong.

4
Hand auth off to your downstream tools

Cloud credentials are the first hop. Most teams need more. Container registry credentials for docker push and docker pull against ECR, GHCR, or a third-party registry. A fresh kubeconfig for EKS, GKE, or AKS. Maybe Helm chart repos. Maybe a private package registry.

None of those come for free. Each requires a CLI call (aws ecr get-login-password, aws eks update-kubeconfig, the GCP and Azure equivalents) or a purpose-built helper that exchanges your IAM credentials for short-lived tokens. You wire those into the same flow that runs Terraform — locally and in CI — or your developers and your pipelines spend their day chasing 401s.

Common approach

The common approach is to stick a target in the task runner — make login, just login, an npm script — that shells out to the CLI calls each downstream tool needs. But the most expensive failure mode is the same one the wrapper doesn't address: silent token expiry surfacing as a 401 mid-pipeline. The token expired, the wrapper has no idea, and the next push or kubectl apply fails halfway through whatever sequence the job was running. And every downstream service has its own bespoke incantation — laptop and CI each do it a different way:

# Laptop
aws ecr get-login-password --region us-east-2 \
  | docker login --username AWS --password-stdin \
      123456789012.dkr.ecr.us-east-2.amazonaws.com
# CI
- uses: aws-actions/amazon-ecr-login@v2

The token is good for ~12 hours, then docker push returns a 401 and the job dies in the middle of an image upload. Multi-account or multi-region pipelines need the call repeated with a different --region and registry URL per account, so the "one-line login" multiplies into a matrix the wrapper has to track. And the laptop incantation and the marketplace action are two different code paths reaching for the same credential — when one breaks in CI, you can't reproduce it locally without diverging again.

On top of that, the wrapper assumes everyone has those CLIs installed at the same versions and that the tool behaves the same way on Windows, Linux, and Mac. In CI the handoff is done with marketplace actions; on the laptop it's the raw CLI. If you ever want to reproduce a CI failure locally, the divergence between the two paths is permanent.

5
Decide how state is bootstrapped and stored

Chicken, meet egg. Terraform's remote state lives in a bucket — and Terraform can't create that bucket, because it needs it to run. So you decide how the bucket gets bootstrapped. Maybe a one-time CloudFormation template, maybe a script, maybe a special "zeroth" Terraform run with a local backend you migrate later. Pick a path. Document it. Run it once per environment. And hope nobody fat-fingers a destroy against the bootstrap stack.

Common approach

The common approach is a shell script that creates the bucket, configures versioning and encryption, then migrates the bootstrap state into the bucket it just created:

aws s3api create-bucket --bucket my-tfstate-prod --region us-east-2 \
  --create-bucket-configuration LocationConstraint=us-east-2
aws s3api put-bucket-versioning --bucket my-tfstate-prod \
  --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption --bucket my-tfstate-prod \
  --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
 
# Migrate the bootstrap state from local backend into the bucket
terraform init -migrate-state \
  -backend-config="bucket=my-tfstate-prod" \
  -backend-config="key=bootstrap/terraform.tfstate" \
  -backend-config="region=us-east-2"

Alternatively, a CloudFormation template that owns the bucket forever. But the problem with this is that it leads to a different IaC tool managing the foundation of your IaC — its own update path, its own drift behavior, its own bus factor. And it isn't a one-shot operation, even though it gets treated like one: if you actually want reproducible environments, you bootstrap state every time you spin up a new one — which means this should be a first-class concept in how you manage Terraform, not a brittle script. When it isn't, teams quietly take the obvious shortcut and reuse a single state bucket across every environment, which works fine until somebody asks about blast radius or compliance scoping.

6
Decide how configuration flows in

Production Terraform wants the same shape of config that every other config-management tradition has settled on: organization defaults at the bottom, per-environment overrides on top, per-root-module tweaks on top of that, all layered together. Helm values, Kustomize overlays, Ansible group_vars — different ecosystems, same DRY pattern. Terraform doesn't do it. Variable values get replaced, not deep-merged: hand it two .tfvars files that both set tags = {...} and the second one wins outright — the keys don't combine. So the moment you want DRY, layered config — and you will — you have to encode the layering yourself, outside the language. .tfvars files. CLI -var flags. TF_VAR_* environment variables. JSON files generated at runtime. They all work, none of them compose, and most teams end up stitching two or three of them together with a task runner that picks the right -var-file order per call site.

Common approach

The common approach is a mix of .tfvars files, TF_VAR_* exports, direnv rules, and Makefile/Justfile/Taskfile targets that wrap the right -var and -var-file flags onto the Terraform binary. But the problem with this is that it leads to configuration design leaking into the task runner — the layering you actually wanted ("org defaults → environment overrides → root-module tweaks") lives nowhere as data, only in the order arguments get assembled by whichever Make target you happened to invoke. You change a value in one place, three places later still hold the old one, the canonical path is unreconstructable, and a plan mysteriously diverges between laptop and CI.

7
Pick a tagging strategy

Most teams want a standard set of tags on every resource — Environment, Owner, CostCenter, Project, the rest. That set has to be defined once, applied in every root module, and kept in sync as it evolves. If it lives in tribal knowledge, half your fleet won't have it. If it lives in a shared module, you'd better make sure every root module imports it.

It's a small decision that compounds. Cost allocation, attribution, security audits, cleanup of orphaned resources — all of that gets harder fast when tagging isn't consistent.

In practice, we see teams spend a remarkable amount of platform-engineering time on just this — chasing tag-set drift across modules, writing custom validators, leaving the same PR-review comments over and over — and most fleets are still inconsistent at audit time.

Common approach

The common approach is a tags module imported by every root, copy-paste reminders in PR templates, and — at the deep end of the rabbit hole — a tool like Bridgecrew's yor that literally rewrites your Terraform code to inject tags. But the problem with this is that it leads to half the fleet having the current tag set and half having last year's, and copy-paste reminders catch maybe two-thirds of new modules. When a security company has to ship a code-generation tool just to keep tagging consistent, that's a signal the underlying problem isn't yours alone.

Build

Now the things you actually write to make Terraform usable as a system, not just a CLI.

CI/CD isn't optional anymore. To ship infrastructure-as-code at the speed developers ship application code, every team running Terraform at meaningful scale needs PR-gated plan, automated apply, an audit trail, and parity between what got reviewed and what got applied. That's table stakes for operating Terraform in a team, not a phase-2 deliverable. Without it, changes back up behind whoever has the laptop with the right credentials, plans drift from reality, "who applied what?" becomes a Slack archaeology project, and the infrastructure side of every release turns into the bottleneck the rest of engineering waits on.

The rest of this section is what that machinery actually costs to build by hand.

8
Pick a task runner

Bringing infrastructure up from zero is its own choreography. Bootstrap the backend. Seed the org. Prime the IAM roles. In the right order. With the right credentials. That's not Terraform — that's the thing that runs Terraform.

So pick a task runner. make, just, go-task, plain shell. Pick one. Document it. The bar to clear is local reproducibility — the same target has to work on a developer's laptop and in CI, with the same arguments, the same toolchain, and the same outcome. This same runner is also the thing your team will reach for when they need to repeat the same orchestration for sibling pipelines — Packer for golden images, Helm chart releases, schema migrations, whatever else you bake alongside Terraform. The questions you answer here repeat over and over.

Common approach

The common approach is a Makefile that grew its own DSL. But the problem with this is that it leads to an internal language only the original author can read, brittle argument-passing, and broken behavior on whichever OS the new hire happens to use. A few targets in, the file is already doing what a programming language is for, badly:

# Makefile
STACK    ?= $(error STACK is required, e.g. make plan STACK=vpc-prod-ue2)
ROOT     := $(firstword $(subst -, ,$(STACK)))
ENV      := $(word 2,$(subst -, ,$(STACK)))
REGION   := $(word 3,$(subst -, ,$(STACK)))-$(word 4,$(subst -, ,$(STACK)))
WORKDIR  := terraform/$(ROOT)
TFVARS   := -var-file=../../vars/org.tfvars \
            -var-file=../../vars/$(ENV).tfvars \
            -var-file=./$(REGION).tfvars
BACKEND  := -backend-config=../../backends/$(ENV).hcl
 
.PHONY: plan apply
plan apply: _check-creds _init
	cd $(WORKDIR) && terraform $@ $(TFVARS) $(if $(filter apply,$@),-auto-approve,)
 
_init:
	cd $(WORKDIR) && terraform init -reconfigure $(BACKEND) >/dev/null

9
Reach for a templating tool

Pure HCL is enough until it isn't. The classic example: HashiCorp's Terraform doesn't allow variables in the backend block, and until recently didn't allow them in module.source either. The backend is evaluated before the core boots, so bucket = var.state_bucket is rejected. (OpenTofu 1.8 added early static evaluation that lifts this restriction for variables and locals in both backend and module-source contexts — but that's OpenTofu, not Terraform, and it doesn't reach data sources or runtime values.)

The moment you want the same root module to deploy to multiple regions or accounts, the backend changes per deployment, and you're left juggling -backend-config flags forever or templating the file. The same story plays out with provider configurations that vary per environment, and with monkey-patching third-party modules where you can't change the upstream.

So you pick a templating tool. You wire it into your task runner. You make sure CI runs it before terraform init. And you've got one more thing to maintain.

Common approach

The common approach is cookiecutter, envsubst, or a hand-rolled Jinja step in CI. But Terraform can't call any of them — something outside Terraform has to, which means the "native Terraform" workflow is already gone before terraform init ever runs. The pre-step becomes the actual interface to your stack, and dev and CI drift the moment they don't render exactly the same file the same way.

You author a template tree plus a cookiecutter.json that declares its prompts:

{
  "region": "us-east-1",
  "state_bucket": "acme-tf-state"
}
# template/{{cookiecutter.region}}/backend.tf
terraform {
  backend "s3" {
    bucket = "{{ cookiecutter.state_bucket }}"
    key    = "{{ cookiecutter.region }}/terraform.tfstate"
    region = "{{ cookiecutter.region }}"
  }
}

A developer scaffolds a stack interactively, answering the prompts:

cookiecutter ./template
# region [us-east-1]: us-west-2
# state_bucket [acme-tf-state]:

CI has to do the same thing non-interactively, with the answers passed as arguments:

cookiecutter ./template --no-input \
  region=us-west-2 \
  state_bucket=acme-tf-state

cookiecutter is a generator, not a renderer — it scaffolds a new directory once. Updating a stack you already scaffolded means hand-merging the new template output into the live tree, so drift between the template and the stacks you've already shipped is the default state.

10
Get remote root modules onto disk

Terraform has no story for materializing remote source code locally — and it's worse than just the root-module gap. Two things compound here. First, the directory you run terraform apply against has to be a local directory on disk; there's no module "x" { source = "git::..." } for the root. Second, even for the child modules Terraform can fetch, terraform init re-fetches them every time — there's no shared cache across runs or runners, no lockfile of source bytes, no way to say "I already have these, use them." Every plan pays the network round-trip.

Root modules vs. child modules

A root module is the directory you run terraform apply against. It owns the state file, and it's where Terraform actually executes. A root module calls child modules with module "x" { source = "..." } — and Terraform happily fetches those children from Git or a registry.

A root module cannot embed another root module. There is no module "x" { source = "git::..." } for the directory Terraform itself runs in — only for the children it calls. That gap is the entire reason step 10 exists.

That's invisible while you're inside one repo and one team. The moment you're not — once shared root modules need to live in a platform repo and be consumed by workload repos, or once you have enough runners that re-fetching child modules on every job adds up — Terraform leaves you to build the missing piece yourself.

So you pick a mechanism for getting third-party module source onto disk: vendoring (committing the source into your repo), Git submodules, git subtree, a fetcher script that runs before init, a package-manager-style tool. Vendoring has a nice property worth naming on its own — the source lives in your tree, so PR diffs show exactly what was deployed at any commit. Other mechanisms keep the source out of your tree and trade that auditability for a smaller repo. Neither is wrong; they're different tradeoffs.

Whatever you pick, you're now maintaining a dependency-management story alongside the rest of it.

Common approach

The common approach is whichever of these mechanisms a team grabbed first — Git submodules and git subtree are the two most common, with vendor-pull scripts and package-manager-style tools filling out the rest. Any of them works. But the problem with this is that there are now two different fetch behaviors in the system — child modules pulled by init on every run, and root modules pulled by whatever mechanism you chose on whatever cadence it runs — and they have to be kept in version-pinning sync with each other across every laptop and every runner. Drift between those two paths is the failure mode, not the choice of tool.

Add a remote root module to your platform repo:

git submodule add -b main \
  git@github.com:acme/tf-root-vpc.git \
  stacks/vpc

Submodules track commits, not refs, so pinning to a tag is a separate step:

cd stacks/vpc && git checkout v2.0.0 && cd ../..
git add stacks/vpc && git commit -m "Pin vpc to v2.0.0"

Every fresh clone and every CI runner has to re-init or the directory is empty:

git submodule update --init --recursive

Forget that step once and terraform init runs against an empty directory. The .gitmodules URL/branch and the actually-checked-out commit drift independently — reviewers see "Submodule changed" in a PR without seeing what changed.

11
Make CI git-aware

Once you're in a monorepo, you don't want every change to trigger every plan. A typo fix in a README shouldn't replan production. You need tooling that reads the diff, understands which root modules are affected by which files, and runs Terraform only on those. This is independent of Terraform itself, and Terraform doesn't ship it.

So you write a path-based CI matrix. Or a Bash script. Or you adopt a tool that understands stack dependencies. Either way, it's one more layer of CI you own.

Common approach

The common approach is tj-actions/changed-files — for years, the most popular pattern for path-based CI matrices. But the problem with this is that it leads to a critical CI primitive built on a third-party dependency you don't own. In March 2025, CVE-2025-30066 compromised exactly that Action across the ecosystem and exfiltrated CI secrets from thousands of repos before anyone noticed. The point isn't that this Action was uniquely careless; it's that the supply-chain surface was always there. The shape of what teams wire up — changed-files resolves the diff, an awk-and-jq shim turns the file list into a matrix, a fan-out job runs terraform plan per module — looks like this:

# .github/workflows/plan.yml
on: pull_request
jobs:
  detect:
    runs-on: ubuntu-latest
    outputs:
      stacks: ${{ steps.matrix.outputs.stacks }}
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
        with: { fetch-depth: 0 }
      - id: changed
        uses: tj-actions/changed-files@95690f9ece77c1740f4a55b7f1de9023ed6b1f87 # v46.0.5
        with: { files: terraform/** }
      - id: matrix
        run: |
          echo "stacks=$(echo '${{ steps.changed.outputs.all_changed_files }}' \
            | tr ' ' '\n' \
            | awk -F/ '{print $2}' | sort -u \
            | jq -R -s -c 'split("\n") | map(select(length>0))')" >> "$GITHUB_OUTPUT"
 
  plan:
    needs: detect
    if: needs.detect.outputs.stacks != '[]'
    strategy:
      matrix: { stack: ${{ fromJSON(needs.detect.outputs.stacks) }} }
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
      - run: make plan STACK=${{ matrix.stack }}-prod-ue2

12
Decompose your monolithic root module

Eventually, one root module isn't enough. There's always a reason — governance, security blast radius, scale, performance, parallelism, team ownership. We wrote about this in Service-Oriented Terraform. Decomposition isn't really a Terraform decision; it's an organizational one.

The moment you decompose, you've opened a fresh category of problems. How do roots pass values to each other? Remote state lookups, SSM parameters, a service catalog? Where does each piece's state live, and who can read it? When workload teams own their own repos, how do they reuse modules and reach into shared infrastructure consistently?

These are real architectural questions, and Terraform leaves them entirely to you.

Common approach

The common approach, once decomposition starts to bite, is terragrunt with dependency blocks pointing at remote-state buckets across repos. The brave try Makefile target dependencies, or even Bazel, to express the DAG of root-module dependencies. Terragrunt deserves credit here — it pioneered durable patterns for multi-account organization, dependency wiring, and per-environment config inheritance, and a lot of teams are still on it for good reason. But the problem with this is that it only solves one slice of the puzzle. You still own CI integration, plan rendering, drift detection, PR comments, supply-chain pinning, OIDC, and the rest of the steps in this post — and now you're learning, versioning, and debugging a second DSL on top of Terraform to get there. Bazel is far more machinery than the problem requires.

13
Make running Terraform discoverable

Take stock of what a developer now has to know to run a single terraform plan. The right binary version, on the path. Cloud credentials, refreshed. Registry credentials, handed off. The right working directory — which root module, in which folder, in which repo, especially after decomposition. The right -var-file flags in the right order, because the config layering from step 6 only resolves correctly if the task runner assembles them correctly. The right -backend-config, if the backend is templated. And whichever workspace selection the layout demands.

That's a checklist a new hire shouldn't have to memorize and an experienced engineer shouldn't have to reconstruct for an unfamiliar root module. Terraform doesn't ship a "what should I run, here?" — so the command, the folder, the flags, and the prerequisites all live in someone's head, in a Makefile target nobody remembers naming, or in a Slack thread from six months ago.

Common approach

Teams settle in one of two places, and both move the incantation rather than removing it. One is a Makefile or Justfile that grows a target per root module (make plan-vpc-prod, make plan-eks-dev, …). The wall of targets becomes the new discoverability problem: an engineer asking "what do I run for vpc in prod-uw2?" either finds the matching target or doesn't, and when they don't, they're back to reading the wrapper to figure out what to type. The other is a thinner wrapper that takes a stack name as an argument — which works, but only after the developer already knows the stack name, the -var-file order, and which -backend-config to pass. Either way, the canonical invocation lives in shell history, not in the system, and "what command did you actually run?" is the first question in every incident thread.

Same "plan the VPC in prod, us-east-2," three different places teams park the incantation. None of them encode the layout you decided on in step 1; they just relocate the long argument list:

cd terraform/vpc
terraform init -reconfigure \
  -backend-config=../../backends/prod.hcl \
  -backend-config="key=vpc/prod/us-east-2.tfstate"
terraform workspace select prod-ue2 || terraform workspace new prod-ue2
terraform plan \
  -var-file=../../vars/org.tfvars \
  -var-file=../../vars/prod.tfvars \
  -var-file=./prod-ue2.tfvars \
  -var region=us-east-2 \
  -var environment=prod \
  -out=plan.out

14
Render a readable CI job summary

Raw terraform plan output is not friendly. Inside a GitHub Actions job UI, it's a wall of green and red plus signs. You want a clean, scannable summary at the top of the job — what's changing, where, and how much. That's a tool, an action, or a script. Pick one.

Common approach

The common approach is tfcmt — wrap terraform plan and write the formatted result to $GITHUB_STEP_SUMMARY so it lands at the top of the job UI:

# .github/workflows/plan.yml (excerpt)
- name: Install tfcmt
  run: |
    curl -fsSL https://github.com/suzuki-shunsuke/tfcmt/releases/download/v4.14.5/tfcmt_linux_amd64.tar.gz \
      | tar -xz -C /usr/local/bin tfcmt
 
- name: Plan + summary
  run: |
    tfcmt --output "$GITHUB_STEP_SUMMARY" plan -patch -- \
      terraform plan -no-color -out=plan.out

But the problem with this is that it leads to one more pinned dependency — a binary you curl-install at a fixed release, with its own maintainer, its own release cadence, and its own seat on your supply-chain surface.

15
Post a plan summary as a PR comment

Reviewers shouldn't have to click into the job to see what's changing. A PR comment with the plan summary is now table stakes. You'll need an action that posts it, updates it on subsequent pushes (rather than spamming a new one), and survives force-pushes without leaving stale comments behind. That action either exists, or you write one, or you live with the spam.

Common approach

The common approach is to take the markdown summary from step 14 and post it through peter-evans/create-or-update-comment, paired with peter-evans/find-comment to find the existing sticky and update it in place instead of spamming a new one on every push:

# .github/workflows/plan.yml (excerpt — continues from step 14)
- name: Build PR comment body
  run: |
    {
      echo '<!-- terraform-plan:vpc -->'
      echo '## Terraform plan: vpc'
      echo
      echo '```diff'
      terraform show -no-color plan.out
      echo '```'
    } > plan.md
 
- name: Find existing plan comment
  id: find
  uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4.0.0
  with:
    issue-number: ${{ github.event.pull_request.number }}
    comment-author: github-actions[bot]
    body-includes: "<!-- terraform-plan:vpc -->"
 
- name: Post or update plan comment
  uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5.0.0
  with:
    issue-number: ${{ github.event.pull_request.number }}
    comment-id: ${{ steps.find.outputs.comment-id }}
    edit-mode: replace
    body-file: plan.md

But the problem with this is that you now own a small protocol — the HTML-comment marker <!-- terraform-plan:vpc --> is the only thing that lets find-comment re-locate the sticky. One marker per stack and per environment, or comments collide and overwrite each other. GitHub also caps a single comment at 65,536 characters; large estates blow past that and the workflow has to truncate or split. And you've added two more pinned dependencies — find-comment and create-or-update-comment — to the same supply-chain surface step 11 already warned about.

16
Wire preview environments to the Deployments API

If you spin up preview environments per pull request, you need a place to surface the URL. GitHub's Deployments API is the right surface — it gives you a clean status indicator on the PR and a deployments tab on the repo. Pick a tool that posts there. Make sure it cleans up the deployment when the PR closes, or you'll have a graveyard of stale "active" environments before long.

Common approach

The common approach is bobheadxi/deployments or chrnorm/deployment-action — both unofficial, both written in TypeScript, both adding to the supply-chain surface. But the problem with this is that it leads to almost nobody owning the GitHub App side of preview-environment plumbing in-house, and the question quietly becomes which third-party Action you bet on and how you handle it when the maintainer disappears.

17
Pipe Terraform outputs to downstream steps

If subsequent CI steps consume Terraform outputs — uploading assets to a freshly-created bucket, triggering a deployment to a freshly-created cluster — you need to translate those outputs into GitHub-style environment variables or step outputs. Pick a tool. Test it against complex output types. Keep it working when Terraform's output format shifts between versions.

Common approach

The common approach is terraform output -json piped through jq and echoed into $GITHUB_OUTPUT:

# Scalar string output — works fine
echo "bucket_name=$(terraform output -raw bucket_name)" >> "$GITHUB_OUTPUT"
 
# Nested object output — has to use the delimited multiline form
{
  echo "vpc_config<<EOF_VPC_$(uuidgen)"
  terraform output -json vpc_config | jq -c .
  echo "EOF_VPC_$(uuidgen)"
} >> "$GITHUB_OUTPUT"

But the problem with this is two things stacked. First, complex output types — nested objects, lists of objects, anything that isn't a scalar string — need a custom jq expression per output, and that expression lands as a copy-pasted shell snippet in every workflow that consumes the value, drifting the moment one of them gets edited. Second, $GITHUB_OUTPUT uses a delimited multiline format (name<<DELIM … value … DELIM) for any value containing newlines or JSON. Producing that correctly out of jq for a nested object means picking a delimiter that can't appear in the value, getting the quoting right, and handling the case where jq emits multiple lines. Most teams get this wrong on the first nested output, see truncated values surface in downstream steps, and end up with a per-workflow shell snippet that's never quite the same in two places.

A short aside on those last four. Each one is a small piece of CI ergonomics, and each typically gets solved by reaching for a third-party GitHub Action — most written in TypeScript with their own transient node_modules dependency tree. Browse a popular collection like dflook/terraform-github-actions and count: terraform-fmt, terraform-validate, terraform-plan, terraform-apply, terraform-output, terraform-version, terraform-new-workspace, terraform-destroy-workspace, and on. A team running real CI ends up pinning a dozen or more, each one expanding the supply-chain surface area of your infrastructure pipeline. The compromise of tj-actions/changed-files in March 2025 made the cost of that surface area concrete: a single popular Action was modified to exfiltrate CI secrets across thousands of repos before anyone noticed. The point isn't that GitHub Actions are dangerous. It's that "I'll just grab an Action for it" stops being a free decision somewhere around the fifth one.

The end result

By the time a team's CI is doing all of those things — change detection, OIDC, plan, sticky comment, deployment, output piping — a single deploy job stitches together eight separate maintainers' Actions, each pinned to its own SHA:

# .github/workflows/apply.yml (excerpt)
jobs:
  apply:
    runs-on: ubuntu-latest
    permissions: { id-token: write, contents: read, pull-requests: write, deployments: write }
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
      - uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2
      - uses: aws-actions/configure-aws-credentials@b47578312673ae6fa5b5096b330d9fbac3d116df # v4.2.1
        with: { role-to-assume: arn:aws:iam::123456789012:role/CIRole, aws-region: us-east-2 }
      - uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
      - uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
        id: deployment
        with: { step: start, env: prod }
      - uses: dflook/terraform-plan@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        id: plan
        with: { path: terraform/vpc, var_file: vars/prod.tfvars }
      - uses: suzuki-shunsuke/tfcmt-action@b1f9f7a0b5b8b2dbcd0fce2e0a6b3c0d8a3f1c2e # v1.2.0
        with: { config: .tfcmt.yaml }
      - uses: dflook/terraform-apply@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        with: { path: terraform/vpc, var_file: vars/prod.tfvars, auto_approve: true }
      - uses: dflook/terraform-output@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        id: tf-output
        with: { path: terraform/vpc }
      - uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
        if: always()
        with: { step: finish, status: ${{ job.status }}, deployment_id: ${{ steps.deployment.outputs.deployment_id }}, env_url: ${{ steps.tf-output.outputs.app_url }} }

Operate

The system is built. Now you have to live with it.

18
Inventory what you've got

Once you have dozens of root modules across a handful of environments, just seeing what's there becomes a job. Which root modules are deployed where? What does the merged configuration look like for prod-us-west-2? Which stacks reference which root modules, with what overrides?

You'll want a CLI that can list root modules, list stacks, describe the composed config for any one of them, and answer those questions without grepping through directories. If you don't have one, your developers will start writing little scripts that do half of it. Then those scripts will diverge.

Common approach

The common approach is some flavor of directory walk taped to the wiki as the "how to see what we have" snippet:

# What root modules do we have?
find terraform -maxdepth 1 -mindepth 1 -type d
 
# Which stacks reference each one?
grep -rl "terraform/vpc" stacks/

But the problem with this is that it relies on the directory tree being the source of truth — which reads back layout, not configuration, so it can't answer "what does the merged config look like for prod-us-west-2" or "which stacks override what." Every developer who needs that ends up writing their own little script that does half of it, and those scripts diverge until nobody knows which one is right. And if you've already bought a SaaS runner like Terraform Cloud or Spacelift, its workspace list answers "what's deployed where" — but not "what's the composed config for this one"; that question is still on you.

19
Encode operator playbooks for everyone else

The people who use what you've provisioned aren't all running terraform plan. The Kubernetes admin needs to refresh a kubeconfig and kubectl exec into a pod. The app developer needs to log into the registry and push an image. The on-call needs to roll a credential at 3 a.m. You'll write down the commands they run — the playbooks — and that's a piece of the platform too.

Where do those playbooks live, though? Makefiles are convenient until you need to pass arguments cleanly, which Make doesn't do well. Justfiles handle arguments better. go-task is solid. Plain shell scripts only behave consistently across Mac, Linux, and Windows if you're disciplined about POSIX — and the moment somebody on Windows joins the team, that discipline breaks. And there we have it: whichever you pick, it's one more tool to add to the stack — one more binary to install on every laptop and every runner, one more set of conventions to teach, one more thing to keep current.

So you pick something. You commit to it. You make sure it works on every laptop your team carries. And you keep adding to it as the system grows, because every new piece of infrastructure ships with a new playbook for whoever consumes it.

Common approach

The common approach is a Makefile with thirty targets and a README section called "Common Tasks." But the problem with this is that it leads to argument-passing Make doesn't do well, broken behavior on Windows, and a README section that lags the targets every time a new playbook is added. And the kicker: there isn't one Make. BSD make ships on macOS, GNU make ships on most Linux distros and CI runners, and they diverge on everything past the basics — conditionals, includes, $(shell), $(call), pattern rules, .PHONY semantics. Either you write to the GNU subset and document brew install make (then call it as gmake), or you write to a portable subset that gives up half of what made you reach for Make in the first place. A representative slice — note the %: ; @: no-op at the bottom, wired in to eat positional arguments because Make doesn't actually support them:

# Makefile (excerpt — actual file is ~30 targets)
# NOTE: requires GNU make; BSD make on macOS will fail on the conditionals below.
# `brew install make` and invoke as `gmake`, or document the divergence in onboarding.
 
.PHONY: login kubeconfig ecr-login seed-fixtures rotate-secret port-forward exec logs
 
ENV  ?= dev
NS   ?= default
POD  ?= $(error POD is required, e.g. make exec POD=api-7d4b8)
 
login:        ; aws sso login --profile $(ENV)-admin
kubeconfig:   ; aws eks update-kubeconfig --name $(ENV)-cluster --alias $(ENV) --profile $(ENV)-admin
ecr-login:    ; aws ecr get-login-password --region us-east-2 --profile $(ENV)-admin \
                | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-2.amazonaws.com
seed-fixtures: login ; ./scripts/seed-fixtures.sh $(ENV) $(filter-out $@,$(MAKECMDGOALS))
rotate-secret: login ; aws secretsmanager rotate-secret --secret-id app/$(ENV)/$(NAME) --profile $(ENV)-admin
port-forward: kubeconfig ; kubectl -n $(NS) port-forward $(POD) 8080:8080
exec:         kubeconfig ; kubectl -n $(NS) exec -it $(POD) -- /bin/sh
logs:         kubeconfig ; kubectl -n $(NS) logs -f $(POD)
 
# Eats positional arguments to `make` so `make seed-fixtures my-branch` doesn't error.
%: ; @:

20
Document the whole thing end-to-end

Twenty steps in. A new hire should be able to read the docs and ship safely. That means everything above — layout, install, auth, registries, state, runner, config, tags, templating, module sourcing, decomposition, inventory, playbooks, CI — has to be written down somewhere current. And it has to stay current, because every one of those moving parts will get touched in the next year.

That's the system-level half. There's a second half underneath it: per-root-module reference docs. Every root module has a set of variables, outputs, providers, and resources, and the moment somebody edits the HCL, the README that lists them is out of date. Hand-writing that table is a dead end — the answer the community settled on years ago is generation. Read the HCL, render a Markdown table, commit the result. The de facto tool is terraform-docs, and it's the right answer to the problem it solves. It's just one more thing to add to the stack — another binary to install on every laptop and every runner, another pinned version, another pre-commit hook or CI check to keep wired in — alongside everything else this post has already accumulated.

So you've got two documentation pipelines: the hand-written architecture docs at the system level, and the generated reference docs per root module. They don't share a substrate, neither knows about your stack configuration, and both have to live.

Common approach

The common approach is a docs/ folder that hasn't been touched since the last hire, plus terraform-docs wired into a pre-commit hook to keep each module's input/output table honest. terraform-docs is the right tool for that table — it just only solves the reference half, and it's another tool you install, version, and keep wired in. Meanwhile the architecture, runbook, and onboarding docs in docs/ rot at their usual pace, and a new hire's first week becomes "figure out what's actually true" instead of "ship something." You end up maintaining two unrelated documentation pipelines, neither of them aware of your stack configuration.

21
Detect drift and reconcile it

Once everything above is in place, there's still one last thing.

Reality and your state file diverge. It's not an if. Providers change behavior between versions, and a re-apply you didn't run silently changes the truth. Click-ops happens during incidents — somebody fixes prod through the console at 2 a.m., and your Terraform code doesn't know. And, occasionally, you've been breached and you don't know it — an attacker has changed something live and your code is the last place that'll catch it.

Detection is the easier half. You schedule a periodic plan. Diff the result. Pipe it to a Slack channel that nobody reads anymore. Terraform doesn't do any of that for you, so you build it.

Reconciliation is the harder half — the part nobody wants to design upfront. What does drift resolve to? Some drifts get auto-applied back to source. Some get a ticket. Some block deploys until a human reviews them. Some need a person to decide whether the source or the live resource is right. None of that is a terraform plan flag; it's a workflow you build, with approval gates, audit trails, and paging policies attached.

This is the capstone — the step that makes everything before it actually keep working over time. Without it, drift accumulates silently, and your codebase slowly turns into a fossil of how things used to be configured.

Common approach

If you're just running Terraform, the common approach is a scheduled GitHub Action that runs terraform plan -detailed-exitcode per root module on a cron, posts the diff to a Slack channel, and opens a PR or ticket on a non-zero exit code. That works on day one. The failure mode it ages into is that there's no policy distinguishing drift that should auto-reconcile from drift that should page someone — so every drift gets the same treatment, which after a quarter is "ignored." The Slack channel is read by nobody, the cron is owned by whoever set it up, and driftctl (the tool most teams reached for to enrich the diff) has been in maintenance mode since 2023.

Either way, the reconciliation-policy layer — auto-revert vs. ticket vs. block-the-deploy vs. page on-call, plus who's allowed to override any of those at 2 a.m. — is additional tooling on top of terraform plan. You build it yourself, or you pay for a SaaS that does it for you (Terraform Cloud, Spacelift, env0, Scalr, others). The choice is real, but the layer doesn't disappear.

And per-stack drift is only the bottom layer. The view a team running infrastructure at scale actually wants is fleet-wide: which stacks are drifting right now, which have been perma-drifting for weeks (the drift nobody plans to fix because reverting would break something else), which workflow runs are failing and at what rate, and — when something does regress — what change to that stack landed most recently. That's change-failure-rate-and-MTTR for infrastructure: the DORA layer, applied to the platform. Terraform doesn't get you anywhere near it. Building it the Hard Way is OpenTelemetry from GitHub Actions into Prometheus, Grafana, or Datadog, with stack-and-root-module labels you keep clean across hundreds of workflow runs and hand-built dashboards somebody has to own. It's possible. It's also a platform engineer for a quarter, and then a permanent owner.

One Tool, or Four

Look back at the list. At every one of those crossroads, you have the same four choices.

  • A glue script that someone wrote in a hurry and nobody owns six months later.
  • A handful of third-party GitHub Actions wired into your pipeline — some abandoned, all of them direct exposure to supply-chain compromise. npm and the GitHub Actions marketplace have historically been the most common path for these attacks; that's where the ecosystem's hits have landed, regardless of how careful any individual maintainer is.

  • Something you build in-house — and now have to staff.

Or — fourth option — pick one tool that handles the whole thing as a coherent set: install, auth, state, monorepo awareness, decomposition, templating, module sourcing, tagging, CI ergonomics, inventory, playbooks, drift detection, and reconciliation, with the same conventions all the way through.

That tool is Atmos. It's the framework we built, and it's the one we use ourselves. It's not the only valid choice — there are others worth considering, and a team that already has its own framework that works should keep using it. But pretending the choice doesn't exist is what gets a team three years into running Terraform with twenty internal scripts, a handful of abandoned Actions, a broken bootstrap, and nobody left who remembers how the Friday-night cold-start ritual works.

The companion to this post — Terraform the Easy Way — walks the same crossroads with concrete Atmos snippets at each one, so you can see what each decision looks like once a framework has made it for you.

If you're somewhere in this list and want a second set of eyes on what to keep, what to consolidate, and where a framework would buy you the most leverage right now, let's talk.

Get DevOps insights delivered to your inbox

Subscribe to the Production Ready newsletter.

Erik Osterman
Erik Osterman
CEO & Founder of Cloud Posse
Founder & CEO of Cloud Posse. DevOps thought leader.
Book a Meeting

Share This Post

Related Posts

Continue reading with these featured articles

Build Your IDP Last

Native Terraform Myth

Enterprise Terraform Problems

The Production Ready Newsletter

Build Smarter. Avoid Mistakes. Stay Ahead of DevOps Trends That Matter.

Turn SOC 2 controls into code and evidence into automation.

For Developers

  • GitHub
  • Documentation
  • Quickstart Docs
  • Resources
  • Read our Blog

Community

  • Join Office Hours
  • Join the Slack Community
  • DevOps Podcast
  • Try our Newsletter

Company

  • Services & Support
  • AWS Migrations
  • Pricing
  • Book a Meeting
  • Media Kit

Legal

  • Terms of Use
  • Privacy Policy
  • Disclaimer
  • Cookie Policy
Copyright ©2026 Cloud Posse, LLC. All rights reserved.