No One Tells You How

Issue #4infrastructure operating model

No One Tells You How

byErik OstermanCEO & Founder of Cloud Posse

No one tells you how to do this part.

They tell you how to write Terraform. They tell you how to install a Helm chart. They tell you how to create a GitHub Actions workflow, configure OIDC, store a secret, run a scanner, build a container, push a manifest, and wire up a notification.

Every one of those things has documentation.

What no one really tells you is how all of it is supposed to work together.

That is the part every platform team is left to invent. The operating model. The lifecycle around the tools. The answer to questions like: how does this run locally? How does the same thing run in CI? How do credentials work in both places? How do we initialize secrets across every environment, then inject them without leaking? How do we validate this before it touches a real account? How do Kubernetes deployments fit next to Terraform? How do rendered artifacts get committed for Flux or Argo CD? How does a new engineer know the path?

And now: how is an agent supposed to do all of that reliably if the path was never encoded anywhere?

Most teams answer those questions one reasonable decision at a time.

One Action. One wrapper. One script. One Make target. One README section. One scanner. One deploy step. One exception because production is special. One more environment variable because the previous one did not quite fit.

None of those decisions are irrational. That is the important part.

Bash is everywhere. GitHub Actions is convenient. Make is familiar. A wrapper script is often the fastest way to encode the thing you just learned. The first version of most infrastructure platforms is not designed. It accumulates.

And for a while, accumulation looks like progress.

The Part Terraform Solved

Terraform proved something important: infrastructure can be declarative.

You can describe resources, review the change, plan the diff, apply it, and keep state. That was a real shift. It gave teams a shared language for infrastructure changes and moved a huge amount of work out of clickops and into code review.

That part worked.

But Terraform did not solve the entire lifecycle. It was never going to.

It does not decide how your engineers work locally. It does not give you a universal credential model across laptop and CI. It does not make every tool version magically appear on every machine. It does not make secrets safe across every command. It does not tell you how to run realistic infrastructure against local emulators. It does not decide how your GitOps artifacts get rendered and committed. It does not turn SAST scan results into evidence for auditors. It does not encode your operating model so an agent can follow it.

And that is fine. Terraform is a great general-purpose tool.

But the lifecycle around Terraform is where the mess grows.

The most obvious place this shows up is Kubernetes. Most teams do not reach for Terraform as their primary Kubernetes deployment tool, and with good reason. Kubernetes has better native answers. Helm, Helmfile, Flux, Argo CD, KRO, ACK, and combinations of all of them exist because Kubernetes deployments have their own shape, their own reconciliation model, and their own operational expectations.

So the platform team does the sensible thing. Terraform manages the cloud resources. Helm or Helmfile manages releases of third-party software and dependencies. Flux or Argo CD reconciles manifests. KRO or ACK brings cloud resources closer to the cluster and simplifies application deployments for developers. GitHub Actions coordinates the workflow. A scanner checks the results. A script glues together the parts that do not naturally agree.

Again, every individual choice makes sense.

The aggregate is the problem.

Now the real operating model lives between the tools. It lives in the wrapper that knows which stack name maps to which namespace. It lives in the GitHub Action that knows when to render but not apply. It lives in the script that knows how to assume the right role. It lives in the one Make target that works locally but not in CI, and the other one that works in CI but not locally. It lives in that outdated paragraph in the README from six months ago.

Terraform is still declarative.

The lifecycle is not.

The Operating Model Becomes Invisible

This is the part that bothers me most.

The more successful the glue becomes, the harder it is to see.

New engineers do not experience "twenty-five tools." They experience one step at a time. Install this. Export that. Run this target. Push a branch. Wait for CI. Check the comment. Rerun the failed job. Ask someone why staging is different.

The team does not experience supply-chain surface area as one big decision either. It accumulates the same way. One third-party Action to detect changes. One Action to set up Terraform. One Action to post a comment. One Action to update a deployment. One tool to render a summary. One script to pass outputs downstream.

Then an incident happens somewhere in the ecosystem and suddenly the surface area is visible.

The problem is not that every third-party Action is bad. The problem is that the deploy path is now a collection of independently maintained projects, scripts, snippets, environment variables, and assumptions. Nobody meant to build an infrastructure runtime out of all that. But that is what happened.

And because the operating model is scattered, it becomes hard to do the things platform teams actually need to do:

inspect it as a system
reproduce it on a laptop
validate it before CI
explain it to a new engineer
secure the whole path
produce evidence for an auditor
reuse it across repositories
teach it to an agent

And because everyone got there one reasonable step at a time, the answer is usually some version of: "It's just how it's done."

This is where I think AI makes the problem worse before it makes it better.

Not because AI is bad. I use it every day. I think it is already good enough to change how infrastructure work gets done.

But agents amplify the patterns they are given.

If the pattern is "copy this workflow, add another Action, shell out to this script, parse this output, and hope the environment variables are right," an agent will happily do that faster than a human. It will string together tools. It will find examples. It will create more glue. It may even pick dependencies with known supply-chain risk because, from the agent's point of view, that is what the surrounding codebase taught it to do.

The disappointing version of AI for infrastructure is not that it cannot write Terraform.

It can.

The disappointing version is that it writes infrastructure into an operating model nobody can fully explain.

That is not an AI problem. That is a systems problem.

AI needs an operating model.

What Would Be Better?

The better version is not "replace all scripts with magic."

Scripts are useful. Escape hatches matter. Teams will always need custom behavior.

The better version is that the common lifecycle becomes declarative enough that humans, CI, and agents can all reason about it the same way.

Auth should be configuration.

Secrets should be configuration.

Toolchains should be configuration.

Workflows should be configuration.

Dependencies should be configuration.

Local emulators should be configuration.

Containers should be configuration.

Kubernetes and Helm releases should sit beside the rest of the stack instead of living in a separate universe.

Git operations should be a capability, not a pile of bespoke git commands hidden in a deploy job.

The point is not to force every tool into one lowest-common-denominator abstraction. Terraform should still be Terraform. Helm should still be Helm. Flux and Argo CD should still do what they are good at. The point is to give the lifecycle around them a consistent runtime.

Same command. Same auth. Same secrets. Same toolchain. Same stack model. Same behavior locally and in CI.

That is the part agents can use.

Not because agents need everything simplified into one button. Because agents need a system they can inspect. They need capabilities with names. They need repeatable commands. They need configuration that declares intent. They need skills that say how to accomplish a task using known runtime capabilities, not open-ended instructions to rummage through arbitrary scripts and infer what the team probably meant.

That is when AI starts to feel different.

Not "generate me some Terraform and good luck."

More like: "add this component, wire the secrets, validate it locally against an emulator, render the Kubernetes release, commit the artifact to the GitOps repo, and open the PR."

That should not require inventing a new operating model every time.

Where Atmos Fits

This is what Atmos is becoming: an open-source runtime for infrastructure.

Not another point tool. Not another pile of CI glue. A runtime, and in some ways a harness, where more of the lifecycle can be declared, inspected, repeated, and handed to agents.

Terraform and OpenTofu are part of it. So are Kubernetes and Helm. So are containers, emulators, workflows, hooks, secrets, stores, vendoring, toolchains, native CI behavior, Git operations, and scaffolding.

That list sounds broad because the lifecycle is broad.

If you want infrastructure to be reproducible, the runtime has to understand more than terraform apply. It has to understand how the component is configured, which tools it needs, which identity it uses, which secrets are required, what it depends on, how it runs locally, how it runs in CI, what artifacts it emits, and how the next step consumes them.

That is why the recent Atmos work matters.

Container components and emulators make local development real. You can stand up AWS, GCP, Azure, or K3S-style local targets and validate workflows without starting in a cloud account. Kubernetes and Helm support can live beside Terraform instead of off to the side. Step types and hooks turn lifecycle behavior into declared workflow steps instead of one-off shell. Native CI keeps the same command path between laptop and pipeline. Secrets management gives the runtime ownership of masking and injection. Git operations let rendered artifacts move into the repositories Flux or Argo CD already watch. Scaffolding starts to make whole architectures reproducible from the beginning instead of merely described after the fact.

Put together, the direction is simple:

The operating model should be configuration.

When that happens, humans can review it. CI can repeat it. Emulators can validate it. Agents can reason about it. Skills can declare how to get real tasks done.

That is the part I am most excited about.

Because AI for infrastructure could be so much better than "write some code and hope the scripts work."

It can be a real collaborator once the system gives it something coherent to collaborate with.

And that is the shift I think platform teams should be looking for. Not more glue. Not another CI-only ritual. A runtime for the lifecycle that no one ever told us how to build.

That runtime is Atmos.

If this sounds familiar, take a look at where your operating model lives today. Is it declared in one place your team can inspect, test, and teach to agents? Or is it scattered across workflows, scripts, READMEs, and memory?

If you want to see the direction we are taking, start with Atmos or run through the quick start.

Erik Osterman Founder & CEO, Cloud Posse

Erik Osterman

CEO & Founder of Cloud Posse

Founder & CEO of Cloud Posse. DevOps thought leader.

Book a Meeting

Share This Issue

← All Issues Subscribe →

No One Tells You How

Issue #4infrastructure operating model

No One Tells You How

byErik OstermanCEO & Founder of Cloud Posse

No one tells you how to do this part.

Every one of those things has documentation.

What no one really tells you is how all of it is supposed to work together.

And now: how is an agent supposed to do all of that reliably if the path was never encoded anywhere?

Most teams answer those questions one reasonable decision at a time.

None of those decisions are irrational. That is the important part.

And for a while, accumulation looks like progress.

The Part Terraform Solved

Terraform proved something important: infrastructure can be declarative.

That part worked.

But Terraform did not solve the entire lifecycle. It was never going to.

And that is fine. Terraform is a great general-purpose tool.

But the lifecycle around Terraform is where the mess grows.

Again, every individual choice makes sense.

The aggregate is the problem.

Terraform is still declarative.

The lifecycle is not.

The Operating Model Becomes Invisible

This is the part that bothers me most.

The more successful the glue becomes, the harder it is to see.

Then an incident happens somewhere in the ecosystem and suddenly the surface area is visible.

And because the operating model is scattered, it becomes hard to do the things platform teams actually need to do:

inspect it as a system
reproduce it on a laptop
validate it before CI
explain it to a new engineer
secure the whole path
produce evidence for an auditor
reuse it across repositories
teach it to an agent

And because everyone got there one reasonable step at a time, the answer is usually some version of: "It's just how it's done."

This is where I think AI makes the problem worse before it makes it better.

Not because AI is bad. I use it every day. I think it is already good enough to change how infrastructure work gets done.

But agents amplify the patterns they are given.

The disappointing version of AI for infrastructure is not that it cannot write Terraform.

It can.

The disappointing version is that it writes infrastructure into an operating model nobody can fully explain.

That is not an AI problem. That is a systems problem.

AI needs an operating model.

What Would Be Better?

The better version is not "replace all scripts with magic."

Scripts are useful. Escape hatches matter. Teams will always need custom behavior.

The better version is that the common lifecycle becomes declarative enough that humans, CI, and agents can all reason about it the same way.

Auth should be configuration.

Secrets should be configuration.

Toolchains should be configuration.

Workflows should be configuration.

Dependencies should be configuration.

Local emulators should be configuration.

Containers should be configuration.

Kubernetes and Helm releases should sit beside the rest of the stack instead of living in a separate universe.

Git operations should be a capability, not a pile of bespoke git commands hidden in a deploy job.

Same command. Same auth. Same secrets. Same toolchain. Same stack model. Same behavior locally and in CI.

That is the part agents can use.

That is when AI starts to feel different.

Not "generate me some Terraform and good luck."

More like: "add this component, wire the secrets, validate it locally against an emulator, render the Kubernetes release, commit the artifact to the GitOps repo, and open the PR."

That should not require inventing a new operating model every time.

Where Atmos Fits

This is what Atmos is becoming: an open-source runtime for infrastructure.

Not another point tool. Not another pile of CI glue. A runtime, and in some ways a harness, where more of the lifecycle can be declared, inspected, repeated, and handed to agents.

That list sounds broad because the lifecycle is broad.

That is why the recent Atmos work matters.

Put together, the direction is simple:

The operating model should be configuration.

When that happens, humans can review it. CI can repeat it. Emulators can validate it. Agents can reason about it. Skills can declare how to get real tasks done.

That is the part I am most excited about.

Because AI for infrastructure could be so much better than "write some code and hope the scripts work."

It can be a real collaborator once the system gives it something coherent to collaborate with.

And that is the shift I think platform teams should be looking for. Not more glue. Not another CI-only ritual. A runtime for the lifecycle that no one ever told us how to build.

That runtime is Atmos.

If you want to see the direction we are taking, start with Atmos or run through the quick start.

Erik Osterman Founder & CEO, Cloud Posse

Erik Osterman

CEO & Founder of Cloud Posse

Founder & CEO of Cloud Posse. DevOps thought leader.

Book a Meeting

Share This Issue

← All Issues Subscribe →