Why Companies Fail at Terraform

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

Managing cloud infrastructure has become an increasingly complex task, especially at scale. To keep up with the demands of modern software development, companies need the organizational and operational leverage that infrastructure as code (IaC) provides. Terraform is a popular choice for implementing IaC due to its broad support and great ecosystem. However, despite its popularity, Terraform is not a silver-bullet solution, and there are many ways to get it wrong. In fact, even the most experienced practitioners can fall victim to nuanced and subtle failures in design and implementation that can lead to technical debt and insecure, sprawling cloud infrastructure. 

This article explores some of the common mistakes and pitfalls that companies face when attempting to implement Terraform for infrastructure as code. We won't focus on fixing a failed plan or when a “terraform apply” goes sideways but rather on the underlying design and implementation issues that can cause small problems to snowball into major challenges.

Design Failures

The initial choices made when designing a Terraform-based cloud infrastructure implementation will be some of the most consequential; their impact will ripple throughout the project's lifecycle. It is beyond critical that engineering organizations make the right choices during this phase. Unfortunately, poor design decisions are common. There is a persistent tendency to focus most of the design effort on the application itself, and the infrastructure config is neglected. The resulting technical debt and cultural missteps can grind technical objectives to a halt.

“Not Built Here” Syndrome

Otherwise known as “re-inventing the wheel”; engineering teams often feel compelled to design their own configuration from scratch, regardless of complexity or time commitment, because they “don't trust” third-party code, or it isn't precisely to specifications. When engineers insist on building their own solutions, they may duplicate work already done elsewhere, wasting time, effort, and resources. Re-inventing the wheel may also be indicative of underlying cultural issues; isolation from external innovations or ideas, as well as resistance to change, will significantly hamper the effectiveness of DevOps initiatives.

Not Designing for Scalability or Modularity

Too often, infrastructure implementations aren't designed to take full advantage of one of Terraform's most powerful features: modularity. Designing application infrastructure for scalability is a must-have in the modern tech ecosystem, and Terraform provides the tools to get it done.

What often ends up happening in these scenarios is the following:

  1. Design phase finishes; time to build!
  2. There is a significant push and effort to get a working application stack shipped.
  3. Some Terraform configuration is hastily written in the root of the application repository.
  4. Version 1 is complete! The compute nodes and databases are deployed with Terraform. Success!

Unfortunately, step 5 never comes, and the entire process is repeated when it comes time to deploy a new application. If the existing application stack needs an updated infrastructure configuration, such as a new region, it's just added to the same root module. Over time, this grows and grows into an unmanageable mess. What about deploying to different environments like QA or Staging? What about disaster recovery? Without the use of modules, engineers are forced to duplicate efforts and code, violating the principle of “Don't Repeat Yourself” and creating a huge pile of tech debt.

No Design Collaboration with Developers

One of the original goals of DevOps was to foster closer collaboration between development and operations staff. Sadly, modern development environments still maintain unnecessary silos between these two groups. Nearly fully formed application designs are kicked to DevOps and Cloud engineers to implement without collaboration on the design and its potential issues; literally the antithesis of DevOps culture.

A key indicator of success in modern software development is the number of successful deployments over a given period of time; deployment velocity is crucial to achieving business goals. If design choices and feedback yo-yo between disparate groups and engineering teams are at odds over technical ownership and agency, then deployment velocity will suffer.

Unrealistic Deadlines or Goals

Design failures and breakdowns can lead to various issues, including the burden placed on engineers. As a result of these problems, engineers often face unrealistic goals and are expected to meet unreasonable deadlines for project implementation.

Feature delivery doesn't take into account the effort cost of the implementation of infrastructure as the design was done in a vacuum. Looming deadlines inevitably lead to shortcuts, which leads to tech debt, which leads to security issues. Engineers classically underestimate Time & Effort and the likelihood of getting it right the first time. The planning was done without considering infrastructure goals or security outcomes, assuming a “perfect world” without interruptions. Then the “real world” sets in, and the inevitable firefighting kills a lot of projects before they ever get off the ground.

Implementation Failures

A well-thought-out design is crucial, but ensuring that the implementation phase is executed effectively is equally essential. If good design is not followed by proper implementation, engineering teams may find themselves dealing with a new set of challenges. Design failures, if caught early, are easier to rectify. Implementation failures tend to be far more costly and time-consuming. 

Not Enforcing Standards and Conventions

Terraform codebases that have never had naming standards or usage conventions enforced are a mess, and it can be tough to walk back from this one once it has set in at scale.

Common breakdowns with standards and conventions include:

  • No consistent way of naming deployed resources (prod-db, db-PROD). 
  • Inability to deploy resources multiple times because they are not adequately disambiguated with input parameters. 
  • Hardcoded settings that should be parameterized. 
  • One application stack separates folders by the environment; another might use one root module with multiple workspaces. Others stick all environments in one root module with a single workspace with three dozen parameters.
  • Sometimes a data source is used to create an IAM policy, or a HEREDOC, other times it's aws_iam_role_policy.
  • One configuration organizes resource types by files; another organizes application components by files.

Here’s an example of an inconsistency that’s very hard to walk back from: the resource naming choice of hyphens vs. underscores: Terraform resource names using a random mishmash of snake_case with CamelCase and pascalCase:

resource "aws_instance" "frontend_web_server" {

...

}

vs.

resource "aws_instance" "frontend-webServer" {

...

}

(and by the way, why are these web servers static EC2 instances anyways!)

While both resource declarations will correctly create an EC2 instance, even if their configuration is identical, they'll have different resource notations and references in the state file and Terraform backend. This may seem innocuous, but it becomes a much bigger problem in larger codebases with lots of interpolated values and outputs.

These little inconsistencies can add up to untenable tech debt over time that makes understanding the infrastructure complex for newer engineers and can engender a sense of fear or reluctance towards change.

Allowing Code and Resources to Sprawl

As mentioned in a previous section, Terraform provides a variety of conventions to enable modularity and the DRY principle: modules and workspaces. Terraform deployments with repeated code are hard to test and work with.

When greeted with a sprawling, unmanaged tangle of Terraform resources, engineers often follow their first impulse: start from scratch with a greenfield configuration. Unfortunately, this approach often exacerbates the existing problem. Resource sprawl represents a two-fold problem for most software organizations: it leads to increased costs and a weakened security posture. You can't secure something if you don't know it exists in the first place. Using hardened Terraform modules enables the creation of opinionated infrastructure configurations that can be reused for different applications, reducing duplicated effort and improving security. Workspaces can be used to extend this pattern further, isolating environment-specific resources into their own contexts. Terraform also has a robust ecosystem of third-party modules; it's possible your team can have a nearly complete implementation just by grabbing something off the shelf.

Not leveraging CI/CD automation

Continuous Integration/Continuous Delivery (CI/CD) pipelines are the foundational pillar upon which modern software delivery automation is built. They enable software to be checked in, tested, integrated, and deployed faster and without error. Terraform code should be treated like application code: checked into version control, linted, tested, and deployed automatically.

While it's possible to deploy Terraform infrastructure from a single workstation, it's usually the first usage pattern developers and engineers employ when learning Terraform. However, single-workstation deploys are not suitable for large-scale application environments with production deployments. Terraform provides several features that are meant for a multi-user environment, including automation command switches, remote-state management, and state locking. Laptop deploys don’t scale in team environments and represent a single-point-of-failure (SPOF) in infrastructure engineering and offer no audit trails.

Not using policy engines and testing to enforce security

Policy engines like OPA and Sentinel can enforce security standards, such as preventing public S3 bucket configurations or roles with overly broad permissions. However, it depends on automation to implement at scale properly. IAM solves the problem of how systems talk to each other while limiting access to only what's needed, but it isn’t sufficient for automation. Organizations that don't implement policy engines to check infrastructure configuration are often left blind to insecure or non-standard infrastructure being deployed. They depend on manual, time-consuming, and error-prone manual review. Policies provide the necessary guardrails to enable teams to become autonomous. 

Operational Failures

Operational failures can significantly hinder the success of Terraform-based cloud infrastructure projects. These types of failures are expected not just for infrastructure but all types of software engineering and application development: organizations tend to neglect soft, non-development initiatives like consistent documentation and maintenance of existing systems. 

Not Building a Knowledge Foundation

Engineering organizations often don't devote the time or resources to building a solid knowledge foundation. Documentation, architecture drawings, architectural decision records (ADRs), code pairing, and training are all needed to help ensure continuity and project health.

The consequences of not making this critical investment won't necessarily manifest right away– the first-generation engineers to work on the project will have the most relevant context and freshest memory when confronting issues. However, what happens when staff inevitably turns over? Without the context behind important implementation decisions and a written record of implicit, accumulated knowledge, project continuity and the overall sustainability of the infrastructure will suffer. Documentation also provides valuable information for stakeholders and non-team members, allowing them to onboard to a shared understanding without taking up the valuable time of project engineers.

Re-inventing the Wheel: Part 2

Without a solid knowledge foundation, new engineers and project maintainers will be tempted to start from scratch rather than take the time to understand and ultimately fix existing implementations. Starting a new implementation over an existing, unmaintained one leads to a “Now you have two problems” situation.

Facilitating knowledge transfer should be one of the primary goals of a knowledge base. Documentation helps transfer knowledge between team members, departments, or organizations. It provides a reference for future developers who may need to work on the system, reducing the risk of knowledge loss and ensuring the project remains maintainable and scalable.

Focusing on the New Without Paying Down Technical Debt

Once a project is shipped, it is quickly forgotten, and all design/development effort is focused on the next new project. No engineer wants to get stuck with the maintenance of old projects. Over time, this leads to the accumulation of tech debt. In the case of Terraform, which often sees several new releases per year with breaking changes, this means older infrastructure configurations become brittle and locked to out-of-date versions.

Organizations that want to fix this need to approach the problem in a cultural context; only rewarding new feature delivery is a sure way to end up with a lot of needless greenfield development work. Creating a technical culture that rewards craftsmanship and maintenance as much as building new projects will lead to better, more secure applications and a better overall engineering culture.

Getting Terraform Right Means Leveraging the Right Resources

Avoiding the failures described in this article isn't strictly a technical exercise; fostering a good engineering culture is essential. This involves making pragmatic and forward-looking design choices that can adapt to future requirements. Additionally, embracing Terraform's robust ecosystem of existing resources while avoiding the push to build everything internally can significantly streamline processes and prevent re-inventing the wheel. Organizations that commit to these objectives will reap the long-term rewards of high-performing infrastructure and empowered engineering teams.

How to Pick Your Primary AWS Region?

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

While your company might operate in multiple regions, one region should typically be selected as the primary region. Certain resources will not be geographically distributed, and these should be provisioned in this default region.

When building out your AWS infrastructure from scratch, it's a good time to revisit decisions that might have been made decades ago. Many new AWS regions might be better suited for the business.

Customer Proximity

One good option is picking a default region that is closest to where the majority of end-users reside.

Business Headquarters

Frequently we see the default region selected that is closest to where the majority of business operations take place. This is especially true if most of the services in the default region will be consumed by the business itself.

Stability

When operating on AWS, selecting a region other than us-east-1 is advisable as this is the default region (or used to be) for most AWS users. It has historically had the most service interruptions presumably because it is one of the most heavily-used regions and operates at a scale much larger than other AWS regions. Therefore we advise using us-east-2 over us-east-1 and the latencies between these regions are very minimal.

High Availability / Availability Zones

Not all AWS regions support the same number of availability zones. Many regions only offer (2) availability zones when a minimum of (3) is recommended when operating kubernetes to avoid “split-brain” problems.

Cost

Not all regions cost the same to operate. On the other hand, if you have significant resources deployed in an existing region, migrating to a new region could be cost-prohibitive; data transfer costs are not cheap, and petabyte-scale S3 buckets would be costly to migrate.

Service Availability

Not all regions offer the full suite of AWS services or receive new services at the same rate as others. The newest regions frequently lack many of the newer services. Other times, certain regions receive platform infrastructure updates slower than others. Also, AWS now offers Local Zones (e.g. us-west-2-lax-1a) which operate a subset of AWS services.

Instance Types

Not all instance types are available in all regions

Latency

The latency between infrastructure across regions could be a factor. See cloudping.co/grid for more information.

References

Should You Run Stateful Systems via Container Orchestration?

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

Recently it was brought up that ThoughtWorks now says that:

We recommend caution in managing stateful systems via container orchestration platforms such as Kubernetes. Some databases are not built with native support for orchestration — they don’t expect a scheduler to kill and relocate them to a different host. Building a highly available service on top of such databases is not trivial, and we still recommend running them on bare metal hosts or a virtual machine (VM) rather than to force-fit them into a container orchestration platform

https://www.thoughtworks.com/radar/techniques/managing-stateful-systems-via-container-orchestration

This is just more FUD that deserves to be cleared up. First, not all container management platforms are the same. I can only address from experience, what it means for Kubernetes. Kubernetes is ideally suited to run these kinds of workloads when used properly.

NOTE: Just so we're clear–our recommendation for production-grade infrastructure is to always use a fully-managed service like RDS, Kinesis, MSK, Elasticache, etc rather than self-hosting it, whether it be on Kubernetes or bare-metal/VMs. Of course, that only works if these services meet your requirements.

To set the record straight, Kubernetes won't randomly kill Pods and relocate them to a different host if configured correctly. First, by setting requested resources equal to the limits, the pods will have a Guaranteed QoS (Quality of Service) – the highest scheduling priority and be the last ones evicted. Then by setting a PodDisruptionBudget, we can be very explicit on what sort of “SLA” we want on our pods.

The other recommendation is to use the appropriate replication controller for the Pods. For databases, it's typically recommended to use StatefulSets (formerly called PetSets for a good reason!). With StatefulSets, we get the same kinds of lifecycle semantics when working with discrete VMs. We can get static IPs, assurances that there won't ever be 2 concurrent pods (“Pets”) with the same name, etc. We've experienced first hand how some applications like Kafka hate it when their IP changes. StatefulSets solve that.

If StatefulSets are not enough of a guarantee, we can provision dedicated node pools. These node pools can even run on bare-metal to assuage even the staunchest critics of virtualization. Using taints and tolerations, we can ensure that the databases on run exactly where want them. There's no risk that the “spot instance” will randomly nuke the pod. Then using affinity rules, we can ensure that the Kubernetes scheduler places the workloads as best as possible on different physical nodes.

Lastly, Kubernetes above all else is a framework for consistent cloud operations. It exposes all the primitives that developers need to codify the business logic required to operate even the most complex business applications. Contrast this to ThoughtWorks' recommendation of running applications on bare metal hosts or a virtual machine (VM) rather than to “force-fit” into a container orchestration platform: when you “roll your own”, almost no organization posses the in-house skillsets to orchestrate and automate this system effectively. In fact, this kind of skillset used to only be posses by technology like Google and Netflix. Kubernetes has leveled the playing field.

Using Kubernetes Operators, the business logic of how to operate a highly available legacy application or cloud-native application can be captured and codified. There's an ever-growing list of operators. Companies have popped up whose whole business model is around building robust operators to manage databases in Kubernetes. Because this business logic is captured in code, it can be iterated and improved upon. As companies encounter new edge-cases those can be addressed by the operator, so that everyone benefits. With the traditional “snowflake” approach where every company implements its own kind of Rube Goldberg apparatus. Hard lessons learned are not shared and we're back in the dark ages of cloud computing.

As with any tool, it's the operator's responsibility to know how to operate it. There are a lot of ways to blow one's leg off using Kubernetes. Kubernetes is a tool that when used the right way, will unlock the superpowers your organization needs.

Rock Solid WordPress

Erik OstermanCloud Architecture & PlatformsLeave a Comment

Learn how Cloud Posse recently architected and implemented WordPress for massive scale on Amazon EC2. We'll show you exactly the tools that we used and our recipe to both secure and power WordPress setups on AWS using Elastic Beanstalk, EFS, CodePipeline, Memcached, Aurora and Varnish.

Managing Secrets in Production

Erik OstermanCloud Architecture & PlatformsLeave a Comment

Secrets are any sensitive piece of information (like a password, API token, TLS private key) that must be kept safe. This presentation is a practical guide covering what we've done at Cloud Posse to lock down secrets in production. It includes our answer to avoid the same pitfalls that Shape Shift encountered when they were hacked. The techniques presented are compatible with automated cloud environments and even legacy systems.