Terraform the Hard Way
terraformdevopsinfrastructure-as-codeplatform-engineeringci-cdgithub-actions

Terraform the Hard Way

Terraform the Hard Way

Erik Osterman
byErik OstermanCEO & Founder of Cloud Posse
May 08 2026

Kelsey Hightower wrote Kubernetes the Hard Way almost a decade ago, and he was clear from the first sentence about what it wasn't. It wasn't a deployment guide. It wasn't a recommendation. The whole point was to walk you through standing up a cluster yourself, by hand, so you'd see what the abstractions normally hide — and then go pick a managed offering with a much better understanding of what you were running.

Here's the Terraform equivalent. Not how I'd recommend running Terraform. What it actually takes.

Kelsey's piece is meant to be run, command by command. This one shows plenty of commands too — but as illustrations of what teams stitch together, not steps to follow verbatim. Same spirit, one level up.

A note up front. If you're new to infrastructure as code and you're staring at this list thinking "all of this for terraform apply?" — that's fair. For a hello-world, most of it is overkill. This isn't a list for hello-world. It's the list of decisions a team makes on the way to production-grade, maintainable Terraform that holds up across years, environments, teams, and regions. If you're already there, none of this will be a surprise. If you're not, this is the road.

To keep the list legible, I've grouped it into three phases: things you design before you write much code, things you build to make it run, and things you operate to keep it running. The phases overlap in practice — every "design" decision gets revisited the first time it survives contact with reality — but they're a useful way to read.

Design

Decisions that shape everything that comes after. Easier to make once, deliberately, than to migrate later.

1
Decide your repo layout

This is the first decision and the easiest to get wrong. One repo or many. If one, how do infrastructure changes coordinate with application changes — same PR, separate PRs, gated by approval? If many, how do they share modules, state, and conventions? Folder structure inside each repo. Naming. Where stacks live, where root modules live, where shared code lives.

Everything a framework would encode for you, you'll encode by hand in conventions, READMEs, and tribal knowledge. Either way, it's a decision — not a discovery — and the longer you wait to make it deliberately, the more migration work you've signed up for later.

Common approach

There isn't one common approach — there are four, each with a distinct failure mode.

The most common starting point. Top-level folders are environments; root modules nest inside.

.
├── prod/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── backend.tf
│   │   └── terraform.tfvars
│   ├── eks/
│   └── rds/
├── staging/
│   ├── vpc/
│   └── eks/
├── dev/
│   ├── vpc/
│   └── eks/
└── modules/                # shared child modules
    └── networking/

Layering between environment and root module is ad hoc; every new piece of infrastructure makes its own placement choice; the first multi-region or second-account need breaks the convention.

All four work. Each one defers the same set of questions — where centralized logs go, where DNS zones get managed, how multi-region deployments work, why you want more AWS accounts than you think — to a README that gets written later. The cost of deferring shows up as migration work the first time the layout has to change, usually with cookiecutter scripts and an INFRASTRUCTURE.md papering over the gaps in the meantime.

There's a second dimension on top of this. Some teams take it the other way and split each root module into its own repository — one repo for VPC, one for EKS, one for RDS. The four layouts above still apply inside each repo, at smaller scale. The new problem is across the seam: keeping shared modules, conventions, and toolchain in sync across the fleet of repos, and managing version pinning between root modules whose outputs feed each other.

2
Pick how you'll install your toolchain

You don't just need Terraform or OpenTofu. Most teams also rely on the cloud CLI for whichever cloud they're on — aws, gcloud, or az — to bootstrap accounts, exchange credentials, and reach what doesn't live in IaC. If you're running Kubernetes, you'll likely also want kubectl and helm. There's a long tail of utilities too: jq for parsing JSON in your wrappers, curl for grabbing remote state or hitting webhooks. Pin the IaC binary alone and the rest drift; the next plan looks fine on your laptop and breaks in CI because somebody's kubectl is two minor versions ahead.

Pick one install method. Pin every version. Now do it again per environment, because production probably can't move at the same pace as dev. Now figure out how to promote a version through environments, and how to communicate the change so nobody runs the wrong binary against the wrong state.

Your team uses Mac, Linux, and Windows — or they will, eventually. You don't know who you'll hire. CI uses something else again. The install method has to work on all of those and produce identical behavior. And on top of that, some root modules will deliberately stay behind on older versions of the IaC binary because the upgrade refactor isn't worth the cost — even as the rest of the toolchain moves forward. So your version-pinning story isn't one number; it's a graph.

The same need to reproduce a toolchain across every laptop and every runner is one of the points I made in Build Your IDP Last. It applies here too.

Common approach

The common approach is to pick a version manager and commit a manifest file. But every option covers a different slice of the toolchain, so most teams end up combining two or three:

# .tool-versions
terraform 1.9.8
kubectl   1.31.2
helm      3.16.2
jq       1.7.1

Plugin-based, language-aware, the de facto choice for polyglot teams. Covers Terraform, kubectl, and helm cleanly. jq works through a community plugin whose maintenance comes and goes. curl isn't pinnable — it's whatever the OS ships. Plugin behavior diverges across Mac/Linux/Windows (asdf doesn't run on Windows at all without WSL), and CI runners typically don't bootstrap asdf, so you bolt on setup-terraform and azure/setup-helm actions and now have two parallel install paths to keep in sync.

The common thread: each option covers a different slice of the toolchain, none of them cover all of it cleanly across laptop + CI + every OS your team uses, and the moment one developer joins on a platform the chosen tool doesn't support, the version-pinning story breaks down and somebody has to merge their way out.

3
Authenticate to your cloud

SSO for humans, ideally with short-lived role assumption. IAM users where you can't avoid them. In automation, OIDC tokens with subject-claim trust policies, exchanged for cloud credentials at the start of every run. That exchange happens outside Terraform, because Terraform is downstream of having credentials — so you encode it somewhere your runner can do reliably, somewhere your developers can do locally, and ideally those two paths look the same.

This is one of those things that looks small until you have ten repos, three clouds, and a contractor who needs read-only access to two of them.

Common approach

The common approach is a patchwork: saml2aws, aws-vault, granted/assume, the AWS Extend Switch Roles Chrome extension on the laptop side, plus aws-actions/configure-aws-credentials in CI — each tool covering a piece of the path. Laptop and runner end up with two flows that have to be kept in sync by hand. Whatever shape you settle on for AWS, the same shape gets repeated for GCP and Azure with different tools and different conventions.

There's also an artifact you won't find in any of those tools: a ~/.aws/config file on every developer's machine, populated with the right SSO start URL, role ARNs, regions, and profile names per account. That file isn't in your repo, so no PR keeps it honest. Teams either ship a shell script that generates it on first run, or maintain an internal wiki page with the canonical snippet for new hires to copy-paste — and both go out of date the first time an account is added or a role is renamed. You find out which developers are stale the next time someone says their plan looks wrong.

4
Hand auth off to your downstream tools

Cloud credentials are the first hop. Most teams need more. Container registry credentials for docker push and docker pull against ECR, GHCR, or a third-party registry. A fresh kubeconfig for EKS, GKE, or AKS. Maybe Helm chart repos. Maybe a private package registry.

None of those come for free. Each requires a CLI call (aws ecr get-login-password for ECR, gh auth token | docker login ghcr.io ... for GHCR, aws eks update-kubeconfig for EKS, the GCP and Azure equivalents) or a purpose-built helper that exchanges your IAM credentials for short-lived tokens. You wire those into the same flow that runs Terraform — locally and in CI — or your developers and your pipelines spend their day chasing 401s.

Common approach

The common approach is to stick a target in the task runner — make login, just login, an npm script — that shells out to the CLI calls each downstream tool needs. The wrappers don't address token expiry: the token's still valid when the job starts, then surfaces as a 401 halfway through a docker push or kubectl apply. And every downstream service has its own bespoke incantation — laptop and CI each do it a different way:

# Laptop
aws ecr get-login-password --region us-east-2 \
  | docker login --username AWS --password-stdin \
      123456789012.dkr.ecr.us-east-2.amazonaws.com
# CI
- uses: aws-actions/amazon-ecr-login@v2

The token is good for ~12 hours, then docker push returns a 401 and the job dies in the middle of an image upload. Multi-account or multi-region pipelines need the call repeated with a different --region and registry URL per account, so the "one-line login" multiplies into a matrix the wrapper has to track. And the laptop incantation and the marketplace action are two different code paths reaching for the same credential — when one breaks in CI, you can't reproduce it locally without diverging again.

On top of that, the wrapper assumes everyone has those CLIs installed at the same versions and that the tool behaves the same way on Windows, Linux, and Mac. In CI the handoff is done with marketplace actions; on the laptop it's the raw CLI. If you ever want to reproduce a CI failure locally, the divergence between the two paths is permanent.

5
Decide how state is bootstrapped and stored

Chicken, meet egg. Terraform's remote state lives in a bucket — and Terraform can't create that bucket, because it needs it to run. So you decide how the bucket gets bootstrapped. Maybe a one-time CloudFormation template, maybe a script, maybe a special "zeroth" Terraform run with a local backend you migrate later. Pick a path. Document it. Run it once per environment. And keep the bootstrap stack out of arm's reach of routine workflows — if it can be destroyed, it can take the state for every stack in the environment with it.

Common approach

The common approach is a shell script that creates the bucket, configures versioning and encryption, then migrates the bootstrap state into the bucket it just created:

aws s3api create-bucket --bucket my-tfstate-prod --region us-east-2 \
  --create-bucket-configuration LocationConstraint=us-east-2
aws s3api put-bucket-versioning --bucket my-tfstate-prod \
  --versioning-configuration Status=Enabled
aws s3api put-bucket-encryption --bucket my-tfstate-prod \
  --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
 
# Migrate the bootstrap state from local backend into the bucket
terraform init -migrate-state \
  -backend-config="bucket=my-tfstate-prod" \
  -backend-config="key=bootstrap/terraform.tfstate" \
  -backend-config="region=us-east-2"

Alternatively, a CloudFormation template that owns the bucket forever. That puts a different IaC tool in charge of the foundation of your IaC, with its own update path, its own drift behavior, and its own bus factor. The shell-script alternative gets treated as a one-shot, even though reproducible environments imply running it again every time you spin up a new one — which makes bootstrap a first-class concept rather than a one-off script. In practice, teams reuse a single state bucket across every environment, which works until blast radius or compliance scoping comes up.

6
Decide how configuration flows in

Production Terraform wants the same shape of config that every other config-management tradition has settled on: organization defaults at the bottom, per-environment overrides on top, per-root-module tweaks on top of that, all layered together. Helm values, Kustomize overlays, Ansible group_vars — different ecosystems, same DRY pattern. Terraform doesn't do it. Variable values get replaced, not deep-merged: hand it two .tfvars files that both set tags = {...} and the second one wins outright — the keys don't combine. So the moment you want DRY, layered config — and you will — you have to encode the layering yourself, outside the language. .tfvars files. CLI -var flags. TF_VAR_* environment variables. JSON files generated at runtime. They all work, none of them compose, and most teams end up stitching two or three of them together with a task runner that picks the right -var-file order per call site.

Common approach

The common approach is a mix of .tfvars files, TF_VAR_* exports, direnv rules, and Makefile/Justfile/Taskfile targets that wrap the right -var and -var-file flags onto the Terraform binary. Configuration design ends up living in the task runner — the layering you wanted ("org defaults → environment overrides → root-module tweaks") exists only in the order arguments get assembled by whichever Make target you happened to invoke. Change a value in one place; three places later still hold the old one. The canonical path is unreconstructable, and a plan diverges between laptop and CI without an obvious reason.

7
Pick a tagging strategy

Most teams want a standard set of tags on every resource — Environment, Owner, CostCenter, Project, the rest. That set has to be defined once, applied in every root module, and kept in sync as it evolves. If it lives in tribal knowledge, half your fleet won't have it. If it lives in a shared module, you'd better make sure every root module imports it.

It's a small decision that compounds. Cost allocation, attribution, security audits, cleanup of orphaned resources — all of that gets harder fast when tagging isn't consistent.

In practice, this becomes its own ongoing job — chasing tag-set drift across modules, writing custom validators, leaving the same PR-review comments over and over.

Common approach

The common approach is a tags module imported by every root, copy-paste reminders in PR templates, and — at the deep end of the rabbit hole — a tool like Bridgecrew's yor that literally rewrites your Terraform code to inject tags. Each new root module is a fresh place where the import can be forgotten, and an updated tag set has to make its way through every consumer of the module by hand.

Build

Now the things you actually write to make Terraform usable as a system, not just a CLI.

CI/CD isn't optional anymore. To ship infrastructure-as-code at the speed developers ship application code, every team running Terraform at meaningful scale needs PR-gated plan, automated apply, an audit trail, and parity between what got reviewed and what got applied. That's table stakes for operating Terraform in a team, not a phase-2 deliverable. Without it, changes back up behind whoever has the laptop with the right credentials, plans drift from reality, "who applied what?" becomes a Slack archaeology project, and the infrastructure side of every release turns into the bottleneck the rest of engineering waits on.

The rest of this section is what that machinery actually costs to build by hand.

8
Pick a task runner

Bringing infrastructure up from zero is its own choreography. Bootstrap the backend. Seed the org. Prime the IAM roles. In the right order. With the right credentials. That's not Terraform — that's the thing that runs Terraform.

Copy-paste from a README only goes so far. The same handful of commands gets repeated across stacks, environments, and laptops, and soon enough most teams reach for a tool to automate the sequence. Those tools are called task runners — make, just, go-task, plain shell wrappers. Pick one. Document it. The bar to clear is local reproducibility — the same target has to work on a developer's laptop and in CI, with the same arguments, the same toolchain, and the same outcome. This same runner is also the thing your team will reach for when they need to repeat the same orchestration for sibling pipelines — Packer for golden images, Helm chart releases, schema migrations, whatever else you bake alongside Terraform. The questions you answer here repeat over and over.

Common approach

The common approach is a Makefile that grew its own DSL. A few targets in, the file is already doing what a programming language is for — argument parsing, conditionals, string manipulation — without the tools to make it readable, and behavior diverges across whichever OS the new hire happens to use:

# Makefile
STACK    ?= $(error STACK is required, e.g. make plan STACK=vpc-prod-ue2)
ROOT     := $(firstword $(subst -, ,$(STACK)))
ENV      := $(word 2,$(subst -, ,$(STACK)))
REGION   := $(word 3,$(subst -, ,$(STACK)))-$(word 4,$(subst -, ,$(STACK)))
WORKDIR  := terraform/$(ROOT)
TFVARS   := -var-file=../../vars/org.tfvars \
            -var-file=../../vars/$(ENV).tfvars \
            -var-file=./$(REGION).tfvars
BACKEND  := -backend-config=../../backends/$(ENV).hcl
 
.PHONY: plan apply
plan apply: _check-creds _init
	cd $(WORKDIR) && terraform $@ $(TFVARS) $(if $(filter apply,$@),-auto-approve,)
 
_init:
	cd $(WORKDIR) && terraform init -reconfigure $(BACKEND) >/dev/null

9
Reach for a templating tool

Pure HCL is enough until it isn't. The classic example: HashiCorp's Terraform doesn't allow variables in the backend block, and until recently didn't allow them in module.source either. The backend is evaluated before the core boots, so bucket = var.state_bucket is rejected. (OpenTofu 1.8 added early static evaluation that lifts this restriction for variables and locals in both backend and module-source contexts — but that's OpenTofu, not Terraform, and it doesn't reach data sources or runtime values.)

The moment you want the same root module to deploy to multiple regions or accounts, the backend changes per deployment, and you're left juggling -backend-config flags forever or templating the file. The same story plays out with provider configurations that vary per environment, and with monkey-patching third-party modules where you can't change the upstream.

So you pick a templating tool. You wire it into your task runner. You make sure CI runs it before terraform init. And you've got one more thing to maintain.

Common approach

The common approach is cookiecutter, envsubst, or a hand-rolled Jinja step in CI. But Terraform can't call any of them — something outside Terraform has to, which means the "native Terraform" workflow is already gone before terraform init ever runs. The pre-step becomes the actual interface to your stack, and dev and CI drift the moment they don't render exactly the same file the same way.

You author a template tree plus a cookiecutter.json that declares its prompts:

{
  "region": "us-east-1",
  "state_bucket": "acme-tf-state"
}
# template/{{cookiecutter.region}}/backend.tf
terraform {
  backend "s3" {
    bucket = "{{ cookiecutter.state_bucket }}"
    key    = "{{ cookiecutter.region }}/terraform.tfstate"
    region = "{{ cookiecutter.region }}"
  }
}

A developer scaffolds a stack interactively, answering the prompts:

cookiecutter ./template
# region [us-east-1]: us-west-2
# state_bucket [acme-tf-state]:

CI has to do the same thing non-interactively, with the answers passed as arguments:

cookiecutter ./template --no-input \
  region=us-west-2 \
  state_bucket=acme-tf-state

cookiecutter is a generator, not a renderer — it scaffolds a new directory once. Updating a stack you already scaffolded means hand-merging the new template output into the live tree, so drift between the template and the stacks you've already shipped is the default state.

10
Fetch remote root modules

As your organization grows and the team expands, infrastructure repos multiply — and so does duplication. The same VPC pattern, the same EKS pattern, the same RDS pattern shows up in three teams' codebases, drifting independently. A common response is a library of reusable root modules teams can share — versioned, deployed by reference, the same pattern as a private package registry but for infrastructure.

This is a different problem from sharing child modules. Child modules don't own state; they're building blocks you combine inside a root module to make something deployable, and Terraform already knows how to fetch them. A root-module library is for the deployable units themselves — each one a directory you run terraform apply against, with its own state file — and that's the piece Terraform doesn't ship a way to share.

Terraform runs from a local directory. The folder you point terraform apply at has to be on disk — there's no module "x" { source = "git::..." } for the root module itself. So if the root module lives somewhere else, something has to put a copy of it on disk before init runs.

Terraform does ship terraform init -from-module=git::..., which copies a remote source into the current directory once. It's scaffolding rather than a versioned dependency mechanism — there's no per-run pin that survives the next checkout — so most teams reach for an explicit copy mechanism instead.

Root modules vs. child modules

A root module is the directory you run terraform apply against. It owns the state file, and it's where Terraform actually executes. A root module calls child modules with module "x" { source = "..." } — and Terraform happily fetches those children from Git or a registry.

A root module cannot embed another root module. There is no module "x" { source = "git::..." } for the directory Terraform itself runs in — only for the children it calls. That gap is the entire reason step 10 exists.

That gap is invisible while you're inside one repo and one team. The moment a shared library exists, you're the one building the copy mechanism.

So you pick a mechanism for getting third-party module source onto disk: vendoring (committing the source into your repo), Git submodules, git subtree, a fetcher script that runs before init, a package-manager-style tool. Vendoring has a nice property worth naming on its own — the source lives in your tree, so PR diffs show exactly what was deployed at any commit. Other mechanisms keep the source out of your tree and trade that auditability for a smaller repo. Neither is wrong; they're different tradeoffs.

Whatever you pick, you're now maintaining a dependency-management story alongside the rest of it.

Common approach

The common approach is whichever of these mechanisms a team grabbed first — Git submodules and git subtree are the two most common, with vendor-pull scripts and package-manager-style tools filling out the rest. Any of them works. The mechanism has to keep the same version pinned consistently across every laptop and every runner; the failure mode is drift between those copies, not the choice of tool itself.

Add a remote root module to your platform repo:

git submodule add -b main \
  git@github.com:acme/tf-root-vpc.git \
  stacks/vpc

Submodules track commits, not refs, so pinning to a tag is a separate step:

cd stacks/vpc && git checkout v2.0.0 && cd ../..
git add stacks/vpc && git commit -m "Pin vpc to v2.0.0"

Every fresh clone and every CI runner has to re-init or the directory is empty:

git submodule update --init --recursive

Forget that step once and terraform init runs against an empty directory. The .gitmodules URL/branch and the actually-checked-out commit drift independently — reviewers see "Submodule changed" in a PR without seeing what changed.

11
Plan only what changed

Once you're in a monorepo, you don't want every change to trigger every plan. A typo fix in a README shouldn't replan production. You need tooling that reads the diff, understands which root modules are affected by which files, and runs Terraform only on those. This is independent of Terraform itself, and Terraform doesn't ship it.

So you write a path-based CI matrix. Or a Bash script. Or you adopt a tool that understands stack dependencies. Either way, it's one more layer of CI you own.

Common approach

The common approach is tj-actions/changed-files — for years, the most popular pattern for path-based CI matrices. A critical CI primitive is now a third-party dependency. In March 2025, CVE-2025-30066 compromised exactly that Action across the ecosystem and exfiltrated CI secrets from thousands of repos before anyone noticed. The point isn't that this Action was uniquely careless; it's that the supply-chain surface was always there. The shape of what teams wire up — changed-files resolves the diff, an awk-and-jq shim turns the file list into a matrix, a fan-out job runs terraform plan per module — looks like this:

# .github/workflows/plan.yml
on: pull_request
jobs:
  detect:
    runs-on: ubuntu-latest
    outputs:
      stacks: ${{ steps.matrix.outputs.stacks }}
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
        with: { fetch-depth: 0 }
      - id: changed
        uses: tj-actions/changed-files@95690f9ece77c1740f4a55b7f1de9023ed6b1f87 # v46.0.5
        with: { files: terraform/** }
      - id: matrix
        run: |
          echo "stacks=$(echo '${{ steps.changed.outputs.all_changed_files }}' \
            | tr ' ' '\n' \
            | awk -F/ '{print $2}' | sort -u \
            | jq -R -s -c 'split("\n") | map(select(length>0))')" >> "$GITHUB_OUTPUT"
 
  plan:
    needs: detect
    if: needs.detect.outputs.stacks != '[]'
    strategy:
      matrix: { stack: ${{ fromJSON(needs.detect.outputs.stacks) }} }
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
      - run: make plan STACK=${{ matrix.stack }}-prod-ue2

12
Decompose your monolithic root module

Eventually, one root module isn't enough. There's always a reason — governance, security blast radius, scale, performance, parallelism, team ownership. We wrote about this in Service-Oriented Terraform. Decomposition isn't really a Terraform decision; it's an organizational one.

The moment you decompose, you've opened a fresh category of problems. How do roots pass values to each other? Remote state lookups, SSM parameters, a service catalog? Where does each piece's state live, and who can read it? When workload teams own their own repos, how do they reuse modules and reach into shared infrastructure consistently?

These are real architectural questions, and Terraform leaves them entirely to you.

Common approach

The common approach, once decomposition starts to bite, is terragrunt with dependency blocks pointing at remote-state buckets across repos. Some teams reach for Makefile target dependencies, or Bazel, to express the DAG of root-module dependencies. Terragrunt deserves credit here — it pioneered durable patterns for multi-account organization, dependency wiring, and per-environment config inheritance, and a lot of teams are still on it for good reason. It solves one slice. CI integration, plan rendering, drift detection, PR comments, supply-chain pinning, and OIDC are still on you, on top of a second DSL to learn, version, and debug.

13
Onboard the rest of your team

Up to here, everything's been about getting the system to work. Now somebody else has to use it. Take stock of what a new hire — or any engineer touching an unfamiliar root module — has to know to run a single terraform plan. The right binary version, on the path. Cloud credentials, refreshed. Registry credentials, handed off. The right working directory — which root module, in which folder, in which repo, especially after decomposition. The right -var-file flags in the right order, because the config layering from step 6 only resolves correctly if the task runner assembles them correctly. The right -backend-config, if the backend is templated. And whichever workspace selection the layout demands.

That's a checklist that lives in someone's head, in a Makefile target nobody remembers naming, or in a Slack thread from six months ago. Terraform doesn't ship a "what should I run, here?" — so the command, the folder, the flags, and the prerequisites stay there, and onboarding is whatever the previous hire wrote down before getting pulled into something else.

Common approach

Teams settle in one of two places. One is a Makefile or Justfile that grows a target per root module (make plan-vpc-prod, make plan-eks-dev, …). The wall of targets becomes the new discoverability problem: an engineer asking "what do I run for vpc in prod-uw2?" either finds the matching target or reads the wrapper to figure out what to type. The other is a thinner wrapper that takes a stack name as an argument — which works, but only after the developer already knows the stack name, the -var-file order, and which -backend-config to pass. Both relocate the long argument list rather than encoding the layout from step 1.

Same "plan the VPC in prod, us-east-2," three different places teams park the incantation:

cd terraform/vpc
terraform init -reconfigure \
  -backend-config=../../backends/prod.hcl \
  -backend-config="key=vpc/prod/us-east-2.tfstate"
terraform workspace select prod-ue2 || terraform workspace new prod-ue2
terraform plan \
  -var-file=../../vars/org.tfvars \
  -var-file=../../vars/prod.tfvars \
  -var-file=./prod-ue2.tfvars \
  -var region=us-east-2 \
  -var environment=prod \
  -out=plan.out

Once your team is running plans, the next surface is review. Reviewers don't open a terminal; they look at the PR, the CI job UI, and whatever a deploy posts back. The next four steps are about that review surface — making CI legible to the people who didn't write the change. None of it ships in Terraform.

14
Render a readable CI job summary

Raw terraform plan output is not friendly. Inside a GitHub Actions job UI, it's a wall of green and red plus signs. You want a clean, scannable summary at the top of the job — what's changing, where, and how much. That's a tool, an action, or a script. Pick one.

Common approach

The common approach is tfcmt — wrap terraform plan and write the formatted result to $GITHUB_STEP_SUMMARY so it lands at the top of the job UI:

# .github/workflows/plan.yml (excerpt)
- name: Install tfcmt
  run: |
    curl -fsSL https://github.com/suzuki-shunsuke/tfcmt/releases/download/v4.14.5/tfcmt_linux_amd64.tar.gz \
      | tar -xz -C /usr/local/bin tfcmt
 
- name: Plan + summary
  run: |
    tfcmt --output "$GITHUB_STEP_SUMMARY" plan -patch -- \
      terraform plan -no-color -out=plan.out

It's one more pinned dependency — a binary you curl-install at a fixed release, with its own maintainer, its own release cadence, and its own seat on your supply-chain surface.

15
Post a plan summary as a PR comment

Reviewers shouldn't have to click into the job to see what's changing. A PR comment with the plan summary is now table stakes. You'll need an action that posts it, updates it on subsequent pushes (rather than spamming a new one), and survives force-pushes without leaving stale comments behind. That action either exists, or you write one, or you live with the spam.

Common approach

The common approach is to take the markdown summary from step 14 and post it through peter-evans/create-or-update-comment, paired with peter-evans/find-comment to find the existing sticky and update it in place instead of spamming a new one on every push:

# .github/workflows/plan.yml (excerpt — continues from step 14)
- name: Build PR comment body
  run: |
    {
      echo '<!-- terraform-plan:vpc -->'
      echo '## Terraform plan: vpc'
      echo
      echo '```diff'
      terraform show -no-color plan.out
      echo '```'
    } > plan.md
 
- name: Find existing plan comment
  id: find
  uses: peter-evans/find-comment@b30e6a3c0ed37e7c023ccd3f1db5c6c0b0c23aad # v4.0.0
  with:
    issue-number: ${{ github.event.pull_request.number }}
    comment-author: github-actions[bot]
    body-includes: "<!-- terraform-plan:vpc -->"
 
- name: Post or update plan comment
  uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5.0.0
  with:
    issue-number: ${{ github.event.pull_request.number }}
    comment-id: ${{ steps.find.outputs.comment-id }}
    edit-mode: replace
    body-file: plan.md

You now own a small protocol — the HTML-comment marker <!-- terraform-plan:vpc --> is the only thing that lets find-comment re-locate the sticky. One marker per stack and per environment, or comments collide and overwrite each other. GitHub also caps a single comment at 65,536 characters; large estates blow past that and the workflow has to truncate or split. And there are two more pinned dependencies — find-comment and create-or-update-comment — on the same supply-chain surface step 11 already warned about.

16
Wire preview environments to the Deployments API

If you spin up preview environments per pull request, you need a place to surface the URL. GitHub's Deployments API is the right surface — it gives you a clean status indicator on the PR and a deployments tab on the repo. Pick a tool that posts there. Make sure it cleans up the deployment when the PR closes, or you'll have a graveyard of stale "active" environments before long.

Common approach

The common approach is bobheadxi/deployments or chrnorm/deployment-action — both unofficial, both written in TypeScript, both adding to the supply-chain surface. Almost nobody owns the GitHub App side of preview-environment plumbing in-house, so the question quietly becomes which third-party Action you bet on and how you handle it when the maintainer disappears.

17
Pipe Terraform outputs to downstream steps

If subsequent CI steps consume Terraform outputs — uploading assets to a freshly-created bucket, triggering a deployment to a freshly-created cluster — you need to translate those outputs into GitHub-style environment variables or step outputs. Pick a tool. Test it against complex output types. Keep it working when Terraform's output format shifts between versions.

Common approach

The common approach is terraform output -json piped through jq and echoed into $GITHUB_OUTPUT:

# Scalar string output — works fine
echo "bucket_name=$(terraform output -raw bucket_name)" >> "$GITHUB_OUTPUT"
 
# Nested object output — has to use the delimited multiline form
{
  echo "vpc_config<<EOF_VPC_$(uuidgen)"
  terraform output -json vpc_config | jq -c .
  echo "EOF_VPC_$(uuidgen)"
} >> "$GITHUB_OUTPUT"

Two things stack here. First, complex output types — nested objects, lists of objects, anything that isn't a scalar string — need a custom jq expression per output, and that expression lands as a copy-pasted shell snippet in every workflow that consumes the value, drifting the moment one of them gets edited. Second, $GITHUB_OUTPUT uses a delimited multiline format (name<<DELIM … value … DELIM) for any value containing newlines or JSON. Producing that correctly out of jq for a nested object means picking a delimiter that can't appear in the value, getting the quoting right, and handling the case where jq emits multiple lines. The first nested output is usually where this breaks — truncated values surface in downstream steps, and the per-workflow shell snippet ends up never quite the same in two places.

A short aside on those last four. Each one is a small piece of CI ergonomics, and each typically gets solved by reaching for a third-party GitHub Action — most written in TypeScript with their own transient node_modules dependency tree. Browse a popular collection like dflook/terraform-github-actions and count: terraform-fmt, terraform-validate, terraform-plan, terraform-apply, terraform-output, terraform-version, terraform-new-workspace, terraform-destroy-workspace, and on. A team running real CI ends up pinning a dozen or more, each one expanding the supply-chain surface area of your infrastructure pipeline. The compromise of tj-actions/changed-files in March 2025 made the cost of that surface area concrete: a single popular Action was modified to exfiltrate CI secrets across thousands of repos before anyone noticed. The point isn't that GitHub Actions are dangerous. By the fifth Action, you've assembled a supply chain you didn't design.

The end result

By the time a team's CI is doing all of those things — change detection, OIDC, plan, sticky comment, deployment, output piping — a single deploy job stitches together eight separate maintainers' Actions, each pinned to its own SHA:

# .github/workflows/apply.yml (excerpt)
jobs:
  apply:
    runs-on: ubuntu-latest
    permissions: { id-token: write, contents: read, pull-requests: write, deployments: write }
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v5.1.0
      - uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3.1.2
      - uses: aws-actions/configure-aws-credentials@b47578312673ae6fa5b5096b330d9fbac3d116df # v4.2.1
        with: { role-to-assume: arn:aws:iam::123456789012:role/CIRole, aws-region: us-east-2 }
      - uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
      - uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
        id: deployment
        with: { step: start, env: prod }
      - uses: dflook/terraform-plan@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        id: plan
        with: { path: terraform/vpc, var_file: vars/prod.tfvars }
      - uses: suzuki-shunsuke/tfcmt-action@b1f9f7a0b5b8b2dbcd0fce2e0a6b3c0d8a3f1c2e # v1.2.0
        with: { config: .tfcmt.yaml }
      - uses: dflook/terraform-apply@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        with: { path: terraform/vpc, var_file: vars/prod.tfvars, auto_approve: true }
      - uses: dflook/terraform-output@5fc11949b8db4d3f3e75bea2e2cc1f6d11afcc8a # v2.5.0
        id: tf-output
        with: { path: terraform/vpc }
      - uses: bobheadxi/deployments@648679e8e4915b27893bd7dbc35cb504dc915bc8 # v1.5.0
        if: always()
        with: { step: finish, status: ${{ job.status }}, deployment_id: ${{ steps.deployment.outputs.deployment_id }}, env_url: ${{ steps.tf-output.outputs.app_url }} }

Operate

The system is built. Now you have to live with it.

18
Inventory what you've got

Once you have dozens of root modules across a handful of environments, just seeing what's there becomes a job. Which root modules are deployed where? What does the merged configuration look like for prod-us-west-2? Which stacks reference which root modules, with what overrides?

You'll want a CLI that can list root modules, list stacks, describe the composed config for any one of them, and answer those questions without grepping through directories. If you don't have one, your developers will start writing little scripts that do half of it. Then those scripts will diverge.

Common approach

The common approach is some flavor of directory walk taped to the wiki as the "how to see what we have" snippet:

# What root modules do we have?
find terraform -maxdepth 1 -mindepth 1 -type d
 
# Which stacks reference each one?
grep -rl "terraform/vpc" stacks/

It relies on the directory tree being the source of truth — which reads back layout, not configuration, so it can't answer "what does the merged config look like for prod-us-west-2" or "which stacks override what." Each developer who needs that writes their own little script that does half of it, and the scripts diverge. A SaaS runner like Terraform Cloud or Spacelift answers "what's deployed where" via its workspace list — but not "what's the composed config for this one"; that question is still on you.

19
Encode operator playbooks for everyone else

The people who use what you've provisioned aren't all running terraform plan. Somebody has to write secrets into Secrets Manager once the database is up. Somebody has to set parameters in Parameter Store that the app reads on boot. Somebody has to upload an artifact to the S3 bucket the app consumes. Somebody has to roll a credential at 3 a.m. The boring everyday work that sits between "infrastructure exists" and "the app uses it." You'll write down the commands they run — the playbooks — and that's a piece of the platform too.

Where do those playbooks live, though? Makefiles are convenient until you need to pass arguments cleanly, which Make doesn't do well. Justfiles handle arguments better. go-task is solid. Plain shell scripts only behave consistently across Mac, Linux, and Windows if you're disciplined about POSIX — and the moment somebody on Windows joins the team, that discipline breaks. Whichever you pick is one more binary to install on every laptop and every runner, one more set of conventions to teach, one more thing to keep current.

So you pick something. You commit to it. You make sure it works on every laptop your team carries. And you keep adding to it as the system grows, because every new piece of infrastructure ships with a new playbook for whoever consumes it.

Common approach

The common approach is a Makefile with thirty targets and a README section called "Common Tasks." Make doesn't do argument-passing well, behaves differently on Windows, and the README section lags the targets every time a new playbook is added. And there isn't one Make: BSD make ships on macOS, GNU make ships on most Linux distros and CI runners, and they diverge on everything past the basics — conditionals, includes, $(shell), $(call), pattern rules, .PHONY semantics. Either you write to the GNU subset and document brew install make (then call it as gmake), or you write to a portable subset that gives up half of what made you reach for Make in the first place. A representative slice — note the %: ; @: no-op at the bottom, wired in to eat positional arguments because Make doesn't actually support them:

# Makefile (excerpt — actual file is ~30 targets)
# NOTE: requires GNU make; BSD make on macOS will fail on the conditionals below.
# `brew install make` and invoke as `gmake`, or document the divergence in onboarding.
 
.PHONY: login seed-secrets set-params upload-fixtures rotate-secret refresh-creds
 
ENV  ?= dev
NAME ?= $(error NAME is required, e.g. make rotate-secret NAME=db-password)
 
login:           ; aws sso login --profile $(ENV)-admin
seed-secrets:    login ; ./scripts/seed-secrets.sh $(ENV) $(filter-out $@,$(MAKECMDGOALS))
set-params:      login ; ./scripts/set-params.sh $(ENV)
upload-fixtures: login ; aws s3 sync ./fixtures s3://$(ENV)-app-fixtures --profile $(ENV)-admin
rotate-secret:   login ; aws secretsmanager rotate-secret --secret-id app/$(ENV)/$(NAME) --profile $(ENV)-admin
refresh-creds:   login ; aws sts get-caller-identity --profile $(ENV)-admin
 
# Eats positional arguments to `make` so `make seed-secrets my-branch` doesn't error.
%: ; @:

20
Document the whole thing end-to-end

Twenty steps in. A new hire should be able to read the docs and ship safely. That means everything above — layout, install, auth, state, runner, config, tags, templating, module sourcing, decomposition, inventory, playbooks, CI — has to be written down current and stay current. Per-root-module reference docs are their own pipeline: variables, outputs, providers, and resources go out of date the moment someone edits the HCL. The community answer is generation, and the de facto tool is terraform-docs.

Beyond the docs themselves, each accumulated tool needs a maintenance story: who tracks releases, how upgrades land, and ideally a contract test or two so a dependency bump doesn't quietly break the next workflow run. Most teams don't write those tests, and find out something broke when a workflow fails on what looked like a routine change. Tools also come and go — driftctl entered maintenance mode in 2023, popular Actions get archived, and the supply-chain surface widens with every one you adopt: in March 2025, CVE-2025-30066 compromised tj-actions/changed-files and exfiltrated CI secrets from thousands of repos before anyone noticed.

And there's the cognitive load. The first month for a new engineer is a dozen unrelated tools — each with its own configuration, command structure, update cadence, and the workarounds the team patched around to make them play together. None of that is anybody's fault. It's the natural result of solving twenty-one decisions independently. It's just a lot to absorb before you can ship.

Common approach

The common approach is terraform-docs in a pre-commit hook for the per-module table, alongside a docs/ folder for architecture and onboarding. The reference table is regenerated on every change; the prose docs update on whatever cadence the team writes them.

21
Detect drift and reconcile it

Once everything above is in place, there's still one last thing.

Reality and your state file diverge. It's not an if. Providers change behavior between versions, and a re-apply you didn't run silently changes the truth. Click-ops happens during incidents — somebody fixes prod through the console at 2 a.m., and your Terraform code doesn't know. And, occasionally, you've been breached and you don't know it — an attacker has changed something live and your code is the last place that'll catch it.

Detection is the easier half. You schedule a periodic plan. Diff the result. Pipe it to a Slack channel. Terraform doesn't do any of that for you, so you build it.

Reconciliation is the harder half — the part nobody wants to design upfront. What does drift resolve to? Some drifts get auto-applied back to source. Some get a ticket. Some block deploys until a human reviews them. Some need a person to decide whether the source or the live resource is right. None of that is a terraform plan flag; it's a workflow you build, with approval gates, audit trails, and paging policies attached.

This is the capstone — the step that makes everything before it actually keep working over time. Without it, drift accumulates silently, and your codebase slowly turns into a fossil of how things used to be configured.

Common approach

If you're just running Terraform, the common approach is a scheduled GitHub Action that runs terraform plan -detailed-exitcode per root module on a cron, posts the diff to a Slack channel, and opens a PR or ticket on a non-zero exit code. That works on day one. The failure mode it ages into is that there's no policy distinguishing drift that should auto-reconcile from drift that should page someone — so every drift gets the same treatment, which after a quarter is "ignored." The Slack channel and cron get whatever attention the original author still has cycles for, and driftctl (the tool many teams reached for to enrich the diff) has been in maintenance mode since 2023.

Either way, the reconciliation-policy layer — auto-revert vs. ticket vs. block-the-deploy vs. page on-call, plus who's allowed to override any of those at 2 a.m. — is additional tooling on top of terraform plan. You build it yourself, or you pay for a SaaS that does it for you (Terraform Cloud, Spacelift, env0, Scalr, others). The choice is real, but the layer doesn't disappear.

And per-stack drift is only the bottom layer. The view a team running infrastructure at scale actually wants is fleet-wide: which stacks are drifting right now, which have been perma-drifting for weeks (the drift nobody plans to fix because reverting would break something else), which workflow runs are failing and at what rate, and — when something does regress — what change to that stack landed most recently. That's change-failure-rate-and-MTTR for infrastructure: the DORA layer, applied to the platform. Terraform doesn't get you anywhere near it. Building it the Hard Way is OpenTelemetry from GitHub Actions into Prometheus, Grafana, or Datadog, with stack-and-root-module labels you keep clean across hundreds of workflow runs and hand-built dashboards somebody has to own. It's possible. It's also a platform engineer for a quarter, and then a permanent owner.

A Dozen Ways, or One

Look back at the list. Each of those twenty-one crossroads has its own small ecosystem of choices — a version manager, a templating tool, a fetch mechanism, an Action for change detection, an Action for the plan summary, an Action for deployments, a tool for outputs, a renderer for docs, a runner for drift. Compound twenty-one decisions across that catalog and you're maintaining dozens of independently-versioned tools, glue scripts, and CI shims with the same conventions nowhere in particular.

Or one tool that handles the whole set with the same conventions all the way through.

That's what most teams end up with — the dozens, not by choice, but one decision at a time. Each individual decision is reasonable; the result is the system this post just walked you through.

The other path is a framework — one tool that's already made the decisions, with the same conventions across every step. Atmos is the one we built and the one we use. It's not the only valid choice — Terragrunt has been doing real work in this space for years, and a team with its own framework that fits should keep using it. The point of this post isn't which framework. It's that the choice exists, and the cost of treating it as one decision instead of twenty-one is what determines whether the system you're running in three years is something you designed or something that happened to you.

The companion to this post — Terraform the Easy Way — walks the same crossroads with concrete Atmos snippets at each one, so you can see what each decision looks like once a framework has made it for you.

If you're somewhere in this list and want a second set of eyes on what to keep, what to consolidate, and where a framework would buy you the most leverage right now, let's talk.

Get DevOps insights delivered to your inbox

Subscribe to the Production Ready newsletter.

Erik Osterman
Erik Osterman
CEO & Founder of Cloud Posse
Founder & CEO of Cloud Posse. DevOps thought leader.
Book a Meeting

Share This Post

Related Posts

Continue reading with these featured articles

Terraform the Easy Way

Build Your IDP Last

Native Terraform Myth

The Production Ready Newsletter

Build Smarter. Avoid Mistakes. Stay Ahead of DevOps Trends That Matter.

Turn SOC 2 controls into code and evidence into automation.

For Developers

  • GitHub
  • Documentation
  • Quickstart Docs
  • Resources
  • Read our Blog

Community

  • Join Office Hours
  • Join the Slack Community
  • DevOps Podcast
  • Try our Newsletter

Company

  • Services & Support
  • AWS Migrations
  • Pricing
  • Book a Meeting
  • Media Kit

Legal

  • Terms of Use
  • Privacy Policy
  • Disclaimer
  • Cookie Policy
Copyright ©2026 Cloud Posse, LLC. All rights reserved.