Why Companies Fail at Terraform

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

10 min read

Managing cloud infrastructure has become an increasingly complex task, especially at scale. To keep up with the demands of modern software development, companies need the organizational and operational leverage that infrastructure as code (IaC) provides. Terraform is a popular choice for implementing IaC due to its broad support and great ecosystem. However, despite its popularity, Terraform is not a silver-bullet solution, and there are many ways to get it wrong. In fact, even the most experienced practitioners can fall victim to nuanced and subtle failures in design and implementation that can lead to technical debt and insecure, sprawling cloud infrastructure. 

This article explores some of the common mistakes and pitfalls that companies face when attempting to implement Terraform for infrastructure as code. We won't focus on fixing a failed plan or when a “terraform apply” goes sideways but rather on the underlying design and implementation issues that can cause small problems to snowball into major challenges.

Design Failures

The initial choices made when designing a Terraform-based cloud infrastructure implementation will be some of the most consequential; their impact will ripple throughout the project's lifecycle. It is beyond critical that engineering organizations make the right choices during this phase. Unfortunately, poor design decisions are common. There is a persistent tendency to focus most of the design effort on the application itself, and the infrastructure config is neglected. The resulting technical debt and cultural missteps can grind technical objectives to a halt.

“Not Built Here” Syndrome

Otherwise known as “re-inventing the wheel”; engineering teams often feel compelled to design their own configuration from scratch, regardless of complexity or time commitment, because they “don't trust” third-party code, or it isn't precisely to specifications. When engineers insist on building their own solutions, they may duplicate work already done elsewhere, wasting time, effort, and resources. Re-inventing the wheel may also be indicative of underlying cultural issues; isolation from external innovations or ideas, as well as resistance to change, will significantly hamper the effectiveness of DevOps initiatives.

Not Designing for Scalability or Modularity

Too often, infrastructure implementations aren't designed to take full advantage of one of Terraform's most powerful features: modularity. Designing application infrastructure for scalability is a must-have in the modern tech ecosystem, and Terraform provides the tools to get it done.

What often ends up happening in these scenarios is the following:

  1. Design phase finishes; time to build!
  2. There is a significant push and effort to get a working application stack shipped.
  3. Some Terraform configuration is hastily written in the root of the application repository.
  4. Version 1 is complete! The compute nodes and databases are deployed with Terraform. Success!

Unfortunately, step 5 never comes, and the entire process is repeated when it comes time to deploy a new application. If the existing application stack needs an updated infrastructure configuration, such as a new region, it's just added to the same root module. Over time, this grows and grows into an unmanageable mess. What about deploying to different environments like QA or Staging? What about disaster recovery? Without the use of modules, engineers are forced to duplicate efforts and code, violating the principle of “Don't Repeat Yourself” and creating a huge pile of tech debt.

No Design Collaboration with Developers

One of the original goals of DevOps was to foster closer collaboration between development and operations staff. Sadly, modern development environments still maintain unnecessary silos between these two groups. Nearly fully formed application designs are kicked to DevOps and Cloud engineers to implement without collaboration on the design and its potential issues; literally the antithesis of DevOps culture.

A key indicator of success in modern software development is the number of successful deployments over a given period of time; deployment velocity is crucial to achieving business goals. If design choices and feedback yo-yo between disparate groups and engineering teams are at odds over technical ownership and agency, then deployment velocity will suffer.

Unrealistic Deadlines or Goals

Design failures and breakdowns can lead to various issues, including the burden placed on engineers. As a result of these problems, engineers often face unrealistic goals and are expected to meet unreasonable deadlines for project implementation.

Feature delivery doesn't take into account the effort cost of the implementation of infrastructure as the design was done in a vacuum. Looming deadlines inevitably lead to shortcuts, which leads to tech debt, which leads to security issues. Engineers classically underestimate Time & Effort and the likelihood of getting it right the first time. The planning was done without considering infrastructure goals or security outcomes, assuming a “perfect world” without interruptions. Then the “real world” sets in, and the inevitable firefighting kills a lot of projects before they ever get off the ground.

Implementation Failures

A well-thought-out design is crucial, but ensuring that the implementation phase is executed effectively is equally essential. If good design is not followed by proper implementation, engineering teams may find themselves dealing with a new set of challenges. Design failures, if caught early, are easier to rectify. Implementation failures tend to be far more costly and time-consuming. 

Not Enforcing Standards and Conventions

Terraform codebases that have never had naming standards or usage conventions enforced are a mess, and it can be tough to walk back from this one once it has set in at scale.

Common breakdowns with standards and conventions include:

  • No consistent way of naming deployed resources (prod-db, db-PROD). 
  • Inability to deploy resources multiple times because they are not adequately disambiguated with input parameters. 
  • Hardcoded settings that should be parameterized. 
  • One application stack separates folders by the environment; another might use one root module with multiple workspaces. Others stick all environments in one root module with a single workspace with three dozen parameters.
  • Sometimes a data source is used to create an IAM policy, or a HEREDOC, other times it's aws_iam_role_policy.
  • One configuration organizes resource types by files; another organizes application components by files.

Here’s an example of an inconsistency that’s very hard to walk back from: the resource naming choice of hyphens vs. underscores: Terraform resource names using a random mishmash of snake_case with CamelCase and pascalCase:

resource "aws_instance" "frontend_web_server" {

...

}

vs.

resource "aws_instance" "frontend-webServer" {

...

}

(and by the way, why are these web servers static EC2 instances anyways!)

While both resource declarations will correctly create an EC2 instance, even if their configuration is identical, they'll have different resource notations and references in the state file and Terraform backend. This may seem innocuous, but it becomes a much bigger problem in larger codebases with lots of interpolated values and outputs.

These little inconsistencies can add up to untenable tech debt over time that makes understanding the infrastructure complex for newer engineers and can engender a sense of fear or reluctance towards change.

Allowing Code and Resources to Sprawl

As mentioned in a previous section, Terraform provides a variety of conventions to enable modularity and the DRY principle: modules and workspaces. Terraform deployments with repeated code are hard to test and work with.

When greeted with a sprawling, unmanaged tangle of Terraform resources, engineers often follow their first impulse: start from scratch with a greenfield configuration. Unfortunately, this approach often exacerbates the existing problem. Resource sprawl represents a two-fold problem for most software organizations: it leads to increased costs and a weakened security posture. You can't secure something if you don't know it exists in the first place. Using hardened Terraform modules enables the creation of opinionated infrastructure configurations that can be reused for different applications, reducing duplicated effort and improving security. Workspaces can be used to extend this pattern further, isolating environment-specific resources into their own contexts. Terraform also has a robust ecosystem of third-party modules; it's possible your team can have a nearly complete implementation just by grabbing something off the shelf.

Not leveraging CI/CD automation

Continuous Integration/Continuous Delivery (CI/CD) pipelines are the foundational pillar upon which modern software delivery automation is built. They enable software to be checked in, tested, integrated, and deployed faster and without error. Terraform code should be treated like application code: checked into version control, linted, tested, and deployed automatically.

While it's possible to deploy Terraform infrastructure from a single workstation, it's usually the first usage pattern developers and engineers employ when learning Terraform. However, single-workstation deploys are not suitable for large-scale application environments with production deployments. Terraform provides several features that are meant for a multi-user environment, including automation command switches, remote-state management, and state locking. Laptop deploys don’t scale in team environments and represent a single-point-of-failure (SPOF) in infrastructure engineering and offer no audit trails.

Not using policy engines and testing to enforce security

Policy engines like OPA and Sentinel can enforce security standards, such as preventing public S3 bucket configurations or roles with overly broad permissions. However, it depends on automation to implement at scale properly. IAM solves the problem of how systems talk to each other while limiting access to only what's needed, but it isn’t sufficient for automation. Organizations that don't implement policy engines to check infrastructure configuration are often left blind to insecure or non-standard infrastructure being deployed. They depend on manual, time-consuming, and error-prone manual review. Policies provide the necessary guardrails to enable teams to become autonomous. 

Operational Failures

Operational failures can significantly hinder the success of Terraform-based cloud infrastructure projects. These types of failures are expected not just for infrastructure but all types of software engineering and application development: organizations tend to neglect soft, non-development initiatives like consistent documentation and maintenance of existing systems. 

Not Building a Knowledge Foundation

Engineering organizations often don't devote the time or resources to building a solid knowledge foundation. Documentation, architecture drawings, architectural decision records (ADRs), code pairing, and training are all needed to help ensure continuity and project health.

The consequences of not making this critical investment won't necessarily manifest right away– the first-generation engineers to work on the project will have the most relevant context and freshest memory when confronting issues. However, what happens when staff inevitably turns over? Without the context behind important implementation decisions and a written record of implicit, accumulated knowledge, project continuity and the overall sustainability of the infrastructure will suffer. Documentation also provides valuable information for stakeholders and non-team members, allowing them to onboard to a shared understanding without taking up the valuable time of project engineers.

Re-inventing the Wheel: Part 2

Without a solid knowledge foundation, new engineers and project maintainers will be tempted to start from scratch rather than take the time to understand and ultimately fix existing implementations. Starting a new implementation over an existing, unmaintained one leads to a “Now you have two problems” situation.

Facilitating knowledge transfer should be one of the primary goals of a knowledge base. Documentation helps transfer knowledge between team members, departments, or organizations. It provides a reference for future developers who may need to work on the system, reducing the risk of knowledge loss and ensuring the project remains maintainable and scalable.

Focusing on the New Without Paying Down Technical Debt

Once a project is shipped, it is quickly forgotten, and all design/development effort is focused on the next new project. No engineer wants to get stuck with the maintenance of old projects. Over time, this leads to the accumulation of tech debt. In the case of Terraform, which often sees several new releases per year with breaking changes, this means older infrastructure configurations become brittle and locked to out-of-date versions.

Organizations that want to fix this need to approach the problem in a cultural context; only rewarding new feature delivery is a sure way to end up with a lot of needless greenfield development work. Creating a technical culture that rewards craftsmanship and maintenance as much as building new projects will lead to better, more secure applications and a better overall engineering culture.

Getting Terraform Right Means Leveraging the Right Resources

Avoiding the failures described in this article isn't strictly a technical exercise; fostering a good engineering culture is essential. This involves making pragmatic and forward-looking design choices that can adapt to future requirements. Additionally, embracing Terraform's robust ecosystem of existing resources while avoiding the push to build everything internally can significantly streamline processes and prevent re-inventing the wheel. Organizations that commit to these objectives will reap the long-term rewards of high-performing infrastructure and empowered engineering teams.

What the heck is a DevOps Accelerator?

Erik OstermanDevOpsLeave a Comment

14 min read

Cloud Posse is the foremost DevOps Accelerator specialing in venture-backed tech startups up through F100 enterprises. This might sound like marketing hype to some, but it's our mission—it means so much more to us.

A “DevOps Accelerator” is a subcategory of Professional Services. It means that the provider has a proven, systematic, and repeatable process to help companies achieve a certain level of organizational “DevOps” maturity. The right accelerator for your business will have a similar ideology for how you would like to implement and run the technical operations of an organization. It will have demonstrated its ability to do this successfully for other businesses and will have tons of pre-existing materials that save you the cost of building it all from scratch. Some accelerators may require ongoing licenses and subscriptions, while others just give it away for free so that they can evolve at a faster rate by leveraging their community.

Working with an accelerator means you benefit from all their continuous learning across the various market segments they operate in. It means you benefit from their 20/20 hindsight—as your 20/20 foresight; you avoid all the common pitfalls associated with going it alone and instead get it right the first time and on time.

The Process

What does such a process look like? It starts with having a pre-existing project plan (e.g. in Jira) that takes a business through the journey of owning its infrastructure and includes the systems that go along with that, from beginning to end. It's much more than knowing how to write Infrastructure as Code (e.g. Terraform). It's knowing all the Design Decisions that go along with it so you can build a library of Architectural Design Records[1]https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/architectural-decision-records.pdf. It's knowing the proper order of operations to execute it. And it's getting all the pieces to fit together across the various areas of your organization. It involves laying a solid foundation, building a platform on top of it to consistently run your services, creating a process to deliver your software consistently and reliably to that platform, and ensuring you have the observability and operational experience to drive it.

For some companies, it may also mean achieving various security benchmarks for compliance required to operate in their industry. Once this is all said and done, there needs to be a process to keep everything current and the ability to keep it up to date. There needs to exist materials for training, documentation, and knowledge transfer across the organization so that it sticks and doesn't become another failed initiative.

Our approach follows the “4-Layers of Infrastructure” model, and we always start at the bottom and build our way up to the top.

Lay a Solid Foundation

For example, for us, that means starting with your Foundational Infrastructure, which is how you manage your AWS Organization, divide up into different Organizational Units, and then segment workloads across accounts, VPCs, and subnets. A solid foundation is required for building anything, just like in the physical world. Without it, you end up with something like the Millenium Tower in San Francisco, which has a sinking problem [2]https://en.wikipedia.org/wiki/Millennium_Tower_(San_Francisco)#Sinking_and_tilting_problem. It doesn't matter how awesome or beautiful the architecture is on top of the foundation if it is at the mercy of what's beneath it.

A few of the considerations that need to be considered at the very beginning are how to operate in multiple regions, naming conventions, CIDR allocations, and so forth. Knowing your requirements for security & compliance is essential, so you get things right from the get-go, with minimal re-engineering efforts.

Build out your Platform

When you build out the platform, you decide how you want to deliver your services consistently. There are countless ways to go about this, but whatever the choice it has ramifications on how to do release engineering (e.g. CI/CD), the way applications are built, tested, and deployed, as well as how to achieve the subsequent benchmarks for compliance. A good platform will have just the right level of abstraction necessary to standardize the delivery without obfuscating how everything works and limiting the business's ability to capitalize on new trends.

A common pitfall is that companies prematurely attempt to abstract too much of their platform before they have the real-world experience to have an educated opinion on it. The wrong platform can have just as much of a detrimental effect on the business as a good one. A smaller business should be weary of emulating the platforms used by massively successful companies like Netflix and Spotify which have legions of engineers supporting them. Instead, they need to focus on a nimble platform that isn't too trendy but isn't so limited that all business value is lost.

Working with a DevOps Accelerator will ensure you get the right amount of platform for the stage you are at today. While things like a Service Mesh are tremendously advantageous for one business, it can be the Achilles Heel of the next if they don't have the in-house expertise to leverage it.

Rollout Shared Services

Once the platform is in place we're ready to deploy the shared services that are needed before we begin delivering applications on top of the platform. Businesses commonly require things like self-hosted CI/CD runners (e.g. for GitHub Actions), GitOps toolings like Atlantis or Spacelift, or observability tools like Datadog. These are part of the shared services layer because they extend the platform's capabilities.

Deploy your Applications

Modern engineering organizations are polyglots as a result of evolving with the times. They'll have something like a React frontend, with a service-oriented backend architecture consisting of Go-microservices, legacy rails apps, and maybe some high-performance rust microservices services, etc. We can never assume that the software development trends and landscape will remain the same. That's why when we work with customers to implement their Release Engineering patterns, we implement the reusable workflows and composable GitHub Actions that make build-test-and-deploy consistent and straightforward for every language, framework, and platform. We treat CI/CD as an interface for delivering your applications; therefore it must be consistent and repeatable.

Achieve Benchmark Security & Compliance

Security and Compiance is as much a technical need as a strategic business advantage. Frequently our customers want to outmaneuver their competition by achieving various benchmarks of compliance so they can land bigger deals or use it as a major differentiator when compared to their competition. Amazon understands this, which is why they offer an entire suite of security-oriented products that make achieving compliance easier. We've implemented native support for this in Terraform as part of our Security and Compliance offering which is available as part of our Reference Architecture for AWS.

Alternatives

Many more approaches are commonly taken when building out the infrastructure and platform for a business. These approaches may sound familiar, as well as the problems associated with them. Maybe the DevOps Accelerator route is for you if you've tried any of them and failed.

Build it all In-House

By in large most companies build their infrastructure using entirely in-house resources. They either pull from their existing talent pool or hire a team to build it. The advantage is that the business may think it knows exactly what it needs, so it starts building it immediately. It's a rewarding process for engineers because developers are inventive, natural creators who love to build new things.

To go this route, the business must allocate a budget, pull the resources aside to focus entirely on building out the next generation of its platform, and then construct the plan for getting there.

The risk is that the company hasn't done this before. While the team is smart and accomplished, they lack the end-to-end plan for getting there and likely underestimate the time & effort required to complete the project. In engineering, it's all too common for engineers to underestimate the level of effort required and the impact of distractions on timelines, so we overestimate the likelihood of succeeding the first time we implement anything.

Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.[2]

Douglas Hofstadter 

More commonly, we see that the engineers get pulled back into firefighting mode. Now the business has two problems. Maintaining the old infrastructure while attempting to innovate on the new infrastructure while exhausted from fighting fires.

Once you've built this shiny new infrastructure, it moves into the second stage: maintenance. This is a boring phase compared to the stage of building it. We see many companies hemorrhage employees, once the infrastructure is built because the fun is over; the company is left holding the bag, and the persons who built it may no longer be there to maintain it.

On the other hand, working with a DevOps Accelerator means you have the plan already in place, and the plan has been vetted and iterated on with each implementation executed by the DevOps Accelerator. This continual learning leads to a better outcome with less risk. The time estimates will be more accurate; the outcome will be guaranteed upfront. Everything is always better on the second iteration. Imagine that a DevOps Accelerator performs the same work, a dozen or more iterations per year.

Hire an Independent Contractor

This relatively safe and proven path works well for smaller businesses whose infrastructure requirements are satisfied by the output of a sole individual. Independent Contractors are the best choice when you have a small project and know what you need. You can often negotiate better rates when working directly with a freelancer since there are no middlemen. Other times, companies hire contractors directly because the relationship is so successful.

It can be problematic, however, since as a business you have legally no control over when or how the Independent Contractor performs the work. They are an independent business and under the law, need to operate like a business and have multiple customers. Customer demands are unpredictable, project scopes can easily explode unexpectedly in scope, and life's surprises can pull them away from your project. The best contractors will want to please all their customers, but this can easily overwhelm them if the demands exceed their capacity.

You also must be aware of laws in your jurisdiction working with Independent Contractors. Many states are beginning to tighten their laws around working with Independent Contractors and using California as an example.

California AB5 Law

California is very strict on this under the AB5 law[3]https://www.dir.ca.gov/dlse/faq_independentcontractor.htm. Most companies' understanding of how AB5 works are outdated, as the law was re-interpreted in 2019 [4]Vazquez v. Jan-Pro Franchising International, Inc. that Dynamex, which means how it is enforced is retroactive to the date the law was originally introduced. If Independent Contractors do not pass the ABC test, they may be considered employees under the law, and there's no amount of contract language that can circumvent it or prevent it. All that matters is where the work is performed; if you are a California company or the Independent Contractor is based in California, the state will claim jurisdiction. The longer you work with the Independent Contractor, the greater the risk that they will be classified as an employee. The worst part is that there are almost no consequences for the Independent Contractor, so the hiring entity bears all financial risks. There's almost no way for a business to verify that an Independent Contractor meets the “Business Services Provider” criteria of the AB5 law, and many Independent Contractors are unaware of how this law works. As of 2020, the California EDD resumed all payroll tax audits targeting the 2 million independent contractors in California [5]https://www.prnewswire.com/news-releases/ca-edd-confirms-it-has-resumed-tax-audits-relating-to-the-misclassification-of-10-million-contractors-301125947.html.

Working with an established DevOps Accelerator eliminates these risks. The DevOps Accelerator is responsible for complying with all local employment laws. It handles all the staffing and business continuity issues and has the preexisting materials ready to perform the implementation. The hiring company can continue to focus on its core product rather than get distracted by implementing its next-generation infrastructure and platform, a non-core competency of the business. When the project completes, the hiring company can scale back on the services from the accelerator or scale back later if it needs more help. It offers the best of both forms of engagement.

Partner with a DevOps Professional Services Company

Our industry is ripe with Professional Services Companies (thousands of them) that will work closely with you to implement a fully custom solution that meets exactly your requirements. The first obstacle encountered is deciding on which one to go with when every one of them claims they are the industry leader. You might reach out to your AWS Account Representative to ask for some referrals, and they'll have recommendations for AWS Partners that are a good fit for your stage. Working with a major professional services company might be the safest option—you'll avoid the risks of working with an independent contractor, but it's fraught with problems.

The first challenge is how to vet them and their solution. They'll almost always say they can't show you what it looks like because their work is confidential and protected by an NDA between them and their clients. That's fair, but it still puts the buyer at the disadvantage of not knowing what you're buying before you commit. So instead, you'll need to rely on Case Studies, if they even have them, which are wonderful but sell a pretty picture of the outcome but not all the dirty details.

Throughout this process, you'll mostly talk with sales reps and account managers. You'll be frustrated with their explanations' hyperbole, flashy pitch decks, and lack of technical detail if you're a technical organization. You'll have little knowledge of who you'll actually be working with, and chances are they'll take a junior engineer and mark them up 2-4x. It's a profitable business for them, but you're left ultimately holding the bag: custom infrastructure built for you might sound great, but it's a hornet's nest to maintain in the long run. Of course, this is what they hope for so that there'll be a steady stream of follow-up work. As a result, these sorts of projects cost way more than budgeted.

Pitfalls of Custom Software Development [6]https://the-innovator.club/index.php/2020/11/09/what-happens-when-what-you-deliver-is-not-what-the-customer-expected/

While the professional services company may have a library of case studies, it doesn't mean they have a repeatable process. On top of that, they may be reimplementing everything for you unless they bring materials, which is reinventing the wheel every time; each implementation is a snowflake that will stop evolving when the contract ends. Any continued learning by the professional services company will not be passed along to you.

Contrast that with working with a DevOps Accelerator. The accelerator will have a proven, documented process that they follow every time, ensuring consistent results without creating unmaintainable snowflakes. They can show you exactly what you will get before you begin.

You should expect to receive live demos as part of the presales process, with clear and concise answers on what exactly you will receive — less handwaving and no fancy PowerPoint presentations. You should meet directly with highly technical engineers and skip all the b.s. sales mumbo-jumbo that makes developers roll their eyes.

Their solution will not be conveniently gated behind confidentiality agreements. The best accelerators will leverage large amounts of Open Source as a licensing and distribution model ensuring that you do not miss out on all the continued learnings and are not on the hook for expensive ongoing license fees. It's like buying and owning a Tesla; you continually receive free upgrades that increase the longevity of your investment by providing free Over-the-Air (OTA) upgrades that contain bug fixes and enhancements.

Outsource It

Outsourcing is probably the most economical way of building out your infrastructure, but it's fraught with risk. In the risk/reward paradigm, you are rewarded for taking bigger risks when they pay off, but you need to know what you're doing. Do not outsource your cloud architecture and its implementation to the same partner unless you have the in-house expertise to validate everything delivered to you one pull request at a time. Be very cautious if you don't know what you need because you may be sold something that will simply need to be redesigned at the next stage of your growth. Make sure all the work is performed in repositories you control, with regular commits and demonstrated functionality. Don't sign off on anything until you've seen it in action. Also, be advised that it's almost impossible to conduct meaningful background checks in many foreign countries. It's almost impossible to enforce contract terms; if you need to, it will be very costly and conducted in unfamiliar jurisdictions.

On the flip side, when you work with an onshore DevOps Accelerator, you buy a known quantity – a working solution that you can vet before starting. You have the peace of mind of knowing that if anything goes seriously wrong, you have recourse; the provider should guarantee their work and carry sufficient insurance to do the work they perform. The services provider will know how their solution scales as your business evolves and will continually invest in the solution, which benefits the buyer over the long term.

Use a Managed Services Provider (MSP)

Working with a Managed Services Provider makes sense if you're a nontechnical organization and there's no sense in understanding or knowing how your infrastructure works. Your business has no competitive advantage in controlling all the toggles, and you have very few opinions on how it gets done or implemented. This is powerful when you can focus entirely on delivering your product, and you're not held up on the minutia of infrastructure.

The problem is that it's like “outsourcing” what should be your industry competitive advantage. While it may make a lot of sense to outsource transactional areas of your business like HR and accounting, infrastructure relates to your product. When you adopt the DevOps methods, you are leveling the business up across multiple divisions, enabling you to be nimble and rapidly respond to market trends. That's why this method will fail in the long run, even with initial success. Remember, most “Case Studies” are conducted immediately after a project's implementation, the time at which it's most likely to be successful; they are not longitudinal studies on how the solution performs years later.

In the DevOps Accelerator model, you have the benefits associated with working together with an MSP without the risks of outsourcing your advantage. They will work directly with you to ensure you have everything you need to own and operate your infrastructure – including infrastructure as code, documentation, and processes for day-2 operations. They will remain engaged even after the initial scope of work is completed and provide you with the ongoing support you need until you have the operational excellence to do it all yourself. Because you own this business area, the methods become part of your processes. These will evolve and become the strategic advantage you need in a competitive market landscape, enabling you to phase shift and out-tack your competition.

Next Steps

If you're curious about what benefits may await you working with a DevOps Accelerator, we encourage you to reach out. You can expect a refreshingly different experience. No hungry sales reps. Just an honest review of your problems with an experienced, highly technical cloud architect who can demonstrate with live examples how we can address them using our proven process.

Get Price

References

References
1 https://docs.aws.amazon.com/prescriptive-guidance/latest/architectural-decision-records/architectural-decision-records.pdf
2 https://en.wikipedia.org/wiki/Millennium_Tower_(San_Francisco)#Sinking_and_tilting_problem
3 https://www.dir.ca.gov/dlse/faq_independentcontractor.htm
4 Vazquez v. Jan-Pro Franchising International, Inc. that Dynamex
5 https://www.prnewswire.com/news-releases/ca-edd-confirms-it-has-resumed-tax-audits-relating-to-the-misclassification-of-10-million-contractors-301125947.html
6 https://the-innovator.club/index.php/2020/11/09/what-happens-when-what-you-deliver-is-not-what-the-customer-expected/

The Difficulty of Managing AWS Security Groups with Terraform

JeremySecurity & Compliance, AnnouncementsLeave a Comment

16 min read

Cloud Posse recently overhauled its Terraform module for managing security groups and rules. We rely on this module to provide a consistent interface for managing AWS security groups and associated security group rules across our Open Source Terraform modules.

This new module can be used very simply, but under the hood, it is quite complex because it is attempting to handle numerous interrelationships, restrictions, and a few bugs in ways that offer a choice between zero service interruption for updates to a security group not referenced by other security groups (by replacing the security group with a new one) versus brief service interruptions for security groups that must be preserved. Another enhancement is now you can provide the ID of an existing security group to modify, or, by default, this module will create a new security group and apply the given rules to it.

Avoiding Service Interruptions

It is desirable to avoid having service interruptions when updating a security group. This is not always possible due to the way Terraform organizes its activities and the fact that AWS will reject an attempt to create a duplicate of an existing security group rule. There is also the issue that while most AWS resources can be associated with and disassociated from security groups at any time, there remain some that may not have their security group association changed, and an attempt to change their security group will cause Terraform to delete and recreate the resource.

The 2 Ways Security Group Changes Cause Service Interruptions

Changes to a security group can cause service interruptions in 2 ways:

  1. Changing rules may be implemented as deleting existing rules and creating new ones. During the period between deleting the old rules and creating the new rules, the security group will block traffic intended to be allowed by the new rules.
  2. Changing rules may be implemented as creating a new security group with the new rules and replacing the existing security group with the new one (then deleting the old one). This usually works with no service interruption when all resources referencing the security group are part of the same Terraform plan. However, if, for example, the security group ID is referenced in a security group rule in a security group that is not part of the same Terraform plan, then AWS will not allow the existing (referenced) security group to be deleted, and even if it did, Terraform would not know to update the rule to reference the new security group.

The key question you need to answer to decide which configuration to use is “will anything break if the security group ID changes”. If not, then use the defaults create_before_destroy = true and preserve_security_group_id = false and do not worry about providing “keys” for security group rules. This is the default because it is the easiest and safest solution when the way the security group is being used allows it.

If things will break when the security group ID changes, then set preserve_security_group_id to true. Also read and follow the guidance below about keys and limiting Terraform security group rules to a single AWS security group rule if you want to mitigate against service interruptions caused by rule changes. Note that even in this case, you probably want to keep create_before_destroy = true because otherwise, if some change requires the security group to be replaced, Terraform will likely succeed in deleting all the security group rules but fail to delete the security group itself, leaving the associated resources completely inaccessible. At least with create_before_destroy = true, the new security group will be created and used where Terraform can make the changes, even though the old security group will still fail to be deleted.

The 3 Ways to Mitigate Against Service Interruptions

Security Group create_before_destroy = true

The most important option is create_before_destroy which, when set to true (the default), ensures that a new replacement security group is created before an existing one is destroyed. This is particularly important because a security group cannot be destroyed while it is associated with a resource (e.g. a load balancer), but “destroy before create” behavior causes Terraform to try to destroy the security group before disassociating it from associated resources so plans fail to apply with the error

Error deleting security group: DependencyViolation: resource sg-XXX has a dependent object

With “create before destroy” set, and any resources dependent on the security group as part of the same Terraform plan, replacement happens successfully:

  1. New security group is created
  2. Resource is associated with the new security group and disassociated from the old one
  3. Old security group is deleted successfully because there is no longer anything associated with it

(If a resource is dependent on the security group and is also outside the scope of the Terraform plan, the old security group will fail to be deleted and you will have to address the dependency manually.)

Note that the module's default configuration of create_before_destroy = true and preserve_security_group_id = false will force the “create before destroy” behavior on the target security group, even if the module did not create it and instead you provided a target_security_group_id.

Unfortunately, creating a new security group is not enough to prevent a service interruption. Keep reading for more on that.

Setting Rule Changes to Force Replacement of the Security Group

A security group by itself is just a container for rules. It only functions as desired when all the rules are in place. If using the Terraform default “destroy before create” behavior for rules, even when using create_before_destroy for the security group itself, an outage occurs when updating the rules or security group because the order of operations is:

  1. Delete existing security group rules (triggering a service interruption)
  2. Create the new security group
  3. Associate the new security group with resources and disassociate the old one (which can take a substantial amount of time for a resource like a NAT Gateway)
  4. Create the new security group rules (restoring service)
  5. Delete the old security group

To resolve this issue, the module's default configuration of create_before_destroy = true and preserve_security_group_id = false causes any change in the security group rules to trigger the creation of a new security group. With that, a rule change causes operations to occur in this order:

  1. Create the new security group
  2. Create the new security group rules
  3. Associate the new security group with resources and disassociate the old one
  4. Delete the old security group rules
  5. Delete the old security group

Preserving the Security Group

There can be a downside to creating a new security group with every rule change. If you want to prevent the security group ID from changing unless absolutely necessary, perhaps because the associated resource does not allow the security group to be changed or because the ID is referenced somewhere (like in another security group's rules) outside of this Terraform plan, then you need to set preserve_security_group_id to true.

The main drawback of this configuration is that there will normally be a service outage during an update because existing rules will be deleted before replacement rules are created. Using keys to identify rules can help limit the impact, but even with keys, simply adding a CIDR to the list of allowed CIDRs will cause that entire rule to be deleted and recreated, causing a temporary access denial for all of the CIDRs in the rule. (For more on this and how to mitigate against it, see The Importance of Keys below.)

Also, note that setting preserve_security_group_id to true does not prevent Terraform from replacing the security group when modifying it is not an option, such as when its name or description changes. However, if you can control the configuration adequately, you can maintain the security group ID and eliminate the impact on other security groups by setting preserve_security_group_id to true. We still recommend leaving create_before_destroy set to true for the times when the security group must be replaced to avoid the DependencyViolation described above.

Defining Security Group Rules

We provide several different ways to define rules for the security group for a few reasons:

  • Terraform type constraints make it difficult to create collections of objects with optional members
  • Terraform resource addressing can cause resources that did not actually change to be nevertheless replaced (deleted and recreated), which, in the case of security group rules, then causes a brief service interruption
  • Terraform resource addresses must be known at plan time, making it challenging to create rules that depend on resources being created during apply and at the same time are not replaced needlessly when something else changes
  • When Terraform rules can be successfully created before being destroyed, there is no service interruption for the resources associated with that security group (unless the security group ID is used in other security group rules outside of the scope of the Terraform plan)

The Importance of Keys

If you are relying on the “create before destroy” behavior for the security group and security group rules, you can skip this section and much of the discussion about keys in the later sections because keys do not matter in this configuration. However, if you are using the “destroy before create” behavior, a full understanding of keys applied to security group rules will help you minimize service interruptions due to changing rules.

When creating a collection of resources, Terraform requires each resource to be identified by a key so that each resource has a unique “address” and Terraform uses these keys to track changes to resources. Every security group rule input to this module accepts optional identifying keys (arbitrary strings) for each rule. If you do not supply keys, then the rules are treated as a list, and the index of the rule in the list will be used as its key. Note that not supplying keys, therefore, has the unwelcome behavior that removing a rule from the list will cause all the rules later in the list to be destroyed and recreated. For example, changing [A, B, C, D] to [A, C, D] causes rules 1(B), 2(C), and 3(D) to be deleted and new rules 1(C) and 2(D) to be created.

We allow you to specify keys (arbitrary strings) for each rule to mitigate this problem. (Exactly how you specify the key is explained in the next sections.) Going back to our example, if the initial set of rules were specified with keys, e.g. [{A: A}, {B: B}, {C: C}, {D: D}], then removing B from the list would only cause B to be deleted, leaving C and D intact.

Note, however, two cautions. First, the keys must be known at terraform plan time and therefore cannot depend on resources that will be created during apply. Second, in order to be helpful, the keys must remain consistently attached to the same rules. For example, if you did the following:

rule_map = { for i, v in rule_list : i => v }

Then you will have merely recreated the initial problem by using a plain list. If you cannot attach meaningful keys to the rules, there is no advantage to specifying keys at all.

Avoid One Terraform Rule = Many AWS Rules

A single security group rule input can actually specify multiple security group rules. For example, ipv6_cidr_blocks takes a list of CIDRs. However, AWS security group rules do not allow for a list of CIDRs, so the AWS Terraform provider converts that list of CIDRs into a list of AWS security group rules, one for each CIDR. (This is the underlying cause of several AWS Terraform provider bugs, such as #25173.) As of this writing, any change to any element of such a rule will cause all the AWS rules specified by the Terraform rule to be deleted and recreated, causing the same kind of service interruption we sought to avoid by providing keys for the rules, or, when create_before_destroy = true, causing a complete failure as Terraform tries to create duplicate rules which AWS rejects. To guard against this issue, when not using the default behavior, you should avoid the convenience of specifying multiple AWS rules in a single Terraform rule and instead create a separate Terraform rule for each source or destination specification.

rules and rules_map inputs

This module provides 3 ways to set security group rules. You can use any or all of them at the same time.

The easy way to specify rules is via the rules input. It takes a list of rules. (We will define a rule a bit later.) The problem is that a Terraform list must be composed of elements of the exact same type, and rules can be any of several different Terraform types. So to get around this restriction, the second way to specify rules is via the rules_map input, which is more complex.

Why the input is so complex?

The rules_map input takes an object.

  • The attribute names (keys) of the object can be anything you want, but need to be known during terraform plan, which means they cannot depend on any resources created or changed by Terraform.
  • The values of the attributes are lists of rule objects, each representing one Security Group Rule. As explained above in “Why the input is so complex“, each object in the list must be exactly the same type. To use multiple types, you must put them in separate lists and put the lists in a map with distinct keys.

Definition of a Rule

For our module, a rule is defined as an object. The attributes and values of the rule objects are fully compatible (have the same keys and accept the same values) as the Terraform aws_security_group_rule resource, except

  • The security_group_id will be ignored, if present
  • You can include an optional key attribute. Its value must be unique among all security group rules in the security group, and it must be known in the Terraform “plan” phase, meaning it cannot depend on anything being generated or created by Terraform.

If provided, the key attribute value will be used to identify the Security Group Rule to Terraform to prevent Terraform from modifying it unnecessarily. If the key is not provided, Terraform will assign an identifier based on the rule's position in its list, which can cause a ripple effect of rules being deleted and recreated if a rule gets deleted from the start of a list, causing all the other rules to shift position. See “Unexpected changes…” below for more details.

Important Notes

Unexpected changes during plan and apply

When configuring this module for “create before destroy” behavior, any change to a security group rule will cause an entirely new security group to be created with all new rules. This can make a small change look like a big one, but is intentional and should not cause concern.

As explained above under The Importance of Keys, when using “destroy before create” behavior, security group rules without keys are identified by their indices in the input lists. If a rule is deleted and the other rules move closer to the start of the list, those rules will be deleted and recreated. This can make a small change look like a big one when viewing the output of Terraform plan, and will likely cause a brief (seconds) service interruption.

You can avoid this for the most part by providing the optional keys, and limiting each rule to a single source or destination. Rules with keys will not be changed if their keys do not change and the rules themselves do not change, except in the case of rule_matrix, where the rules are still dependent on the order of the security groups in source_security_group_ids. You can avoid this by using rules instead of rule_matrix when you have more than one security group in the list. You cannot avoid this by sorting the source_security_group_ids, because that leads to the “Invalid for_each argument” error because of terraform#31035.

Invalid for_each argument

You can supply many rules as inputs to this module, and they (usually) get transformed into aws_security_group_rule resources. However, Terraform works in 2 steps: a plan step where it calculates the changes to be made, and an apply step where it makes the changes. This is so you can review and approve the plan before changing anything. One big limitation of this approach is that it requires that Terraform be able to count the number of resources to create without the benefit of any data generated during the apply phase. So if you try to generate a rule based on something you are creating at the same time, you can get an error like

Error: Invalid for_each argument
The "for_each" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created.

This module uses lists to minimize the chance of that happening, as all it needs to know is the length of the list, not the values in it, but this error still can happen for subtle reasons. Most commonly, using a function like compact on a list will cause the length to become unknown (since the values have to be checked and nulls removed). In the case of source_security_group_ids, just sorting the list using sort will cause this error. (See terraform#31035.) If you run into this error, check for functions like compact somewhere in the chain that produces the list and remove them if you find them.

WARNINGS and Caveats

Setting inline_rules_enabled is not recommended and NOT SUPPORTED: Any issues arising from setting inlne_rules_enabled = true (including issues about setting it to false after setting it to true) will not be addressed because they flow from fundamental problems with the underlying aws_security_group resource. The setting is provided for people who know and accept the limitations and trade-offs and want to use it anyway. The main advantage is that when using inline rules, Terraform will perform “drift detection” and attempt to remove any rules it finds in place but not specified inline. See this post for a discussion of the difference between inline and resource rules and some of the reasons inline rules are not satisfactory.

KNOWN ISSUE (#20046): If you set inline_rules_enabled = true, you cannot later set it to false. If you try, Terraform will complain and fail. You will either have to delete and recreate the security group or manually delete all the security group rules via the AWS console or CLI before applying inline_rules_enabled = false.

Objects not of the same type: Any time you provide a list of objects, Terraform requires that all objects in the list must be the exact same type. This means that all objects in the list have exactly the same set of attributes and that each attribute has the same type of value in every object. So while some attributes are optional for this module, if you include an attribute in any of the objects in a list, you have to include that same attribute in all of them. In rules where the key would otherwise be omitted, including the key with a value of null, unless the value is a list type, in which case set the value to [] (an empty list), due to #28137.

How to Pick Your Primary AWS Region?

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

3 min read

While your company might operate in multiple regions, one region should typically be selected as the primary region. Certain resources will not be geographically distributed, and these should be provisioned in this default region.

When building out your AWS infrastructure from scratch, it's a good time to revisit decisions that might have been made decades ago. Many new AWS regions might be better suited for the business.

Customer Proximity

One good option is picking a default region that is closest to where the majority of end-users reside.

Business Headquarters

Frequently we see the default region selected that is closest to where the majority of business operations take place. This is especially true if most of the services in the default region will be consumed by the business itself.

Stability

When operating on AWS, selecting a region other than us-east-1 is advisable as this is the default region (or used to be) for most AWS users. It has historically had the most service interruptions presumably because it is one of the most heavily-used regions and operates at a scale much larger than other AWS regions. Therefore we advise using us-east-2 over us-east-1 and the latencies between these regions are very minimal.

High Availability / Availability Zones

Not all AWS regions support the same number of availability zones. Many regions only offer (2) availability zones when a minimum of (3) is recommended when operating kubernetes to avoid “split-brain” problems.

Cost

Not all regions cost the same to operate. On the other hand, if you have significant resources deployed in an existing region, migrating to a new region could be cost-prohibitive; data transfer costs are not cheap, and petabyte-scale S3 buckets would be costly to migrate.

Service Availability

Not all regions offer the full suite of AWS services or receive new services at the same rate as others. The newest regions frequently lack many of the newer services. Other times, certain regions receive platform infrastructure updates slower than others. Also, AWS now offers Local Zones (e.g. us-west-2-lax-1a) which operate a subset of AWS services.

Instance Types

Not all instance types are available in all regions

Latency

The latency between infrastructure across regions could be a factor. See cloudping.co/grid for more information.

References

Should You Run Stateful Systems via Container Orchestration?

Erik OstermanCloud Architecture & Platforms, DevOpsLeave a Comment

4 min read

Recently it was brought up that ThoughtWorks now says that:

We recommend caution in managing stateful systems via container orchestration platforms such as Kubernetes. Some databases are not built with native support for orchestration — they don’t expect a scheduler to kill and relocate them to a different host. Building a highly available service on top of such databases is not trivial, and we still recommend running them on bare metal hosts or a virtual machine (VM) rather than to force-fit them into a container orchestration platform

https://www.thoughtworks.com/radar/techniques/managing-stateful-systems-via-container-orchestration

This is just more FUD that deserves to be cleared up. First, not all container management platforms are the same. I can only address from experience, what it means for Kubernetes. Kubernetes is ideally suited to run these kinds of workloads when used properly.

NOTE: Just so we're clear–our recommendation for production-grade infrastructure is to always use a fully-managed service like RDS, Kinesis, MSK, Elasticache, etc rather than self-hosting it, whether it be on Kubernetes or bare-metal/VMs. Of course, that only works if these services meet your requirements.

To set the record straight, Kubernetes won't randomly kill Pods and relocate them to a different host if configured correctly. First, by setting requested resources equal to the limits, the pods will have a Guaranteed QoS (Quality of Service) – the highest scheduling priority and be the last ones evicted. Then by setting a PodDisruptionBudget, we can be very explicit on what sort of “SLA” we want on our pods.

The other recommendation is to use the appropriate replication controller for the Pods. For databases, it's typically recommended to use StatefulSets (formerly called PetSets for a good reason!). With StatefulSets, we get the same kinds of lifecycle semantics when working with discrete VMs. We can get static IPs, assurances that there won't ever be 2 concurrent pods (“Pets”) with the same name, etc. We've experienced first hand how some applications like Kafka hate it when their IP changes. StatefulSets solve that.

If StatefulSets are not enough of a guarantee, we can provision dedicated node pools. These node pools can even run on bare-metal to assuage even the staunchest critics of virtualization. Using taints and tolerations, we can ensure that the databases on run exactly where want them. There's no risk that the “spot instance” will randomly nuke the pod. Then using affinity rules, we can ensure that the Kubernetes scheduler places the workloads as best as possible on different physical nodes.

Lastly, Kubernetes above all else is a framework for consistent cloud operations. It exposes all the primitives that developers need to codify the business logic required to operate even the most complex business applications. Contrast this to ThoughtWorks' recommendation of running applications on bare metal hosts or a virtual machine (VM) rather than to “force-fit” into a container orchestration platform: when you “roll your own”, almost no organization posses the in-house skillsets to orchestrate and automate this system effectively. In fact, this kind of skillset used to only be posses by technology like Google and Netflix. Kubernetes has leveled the playing field.

Using Kubernetes Operators, the business logic of how to operate a highly available legacy application or cloud-native application can be captured and codified. There's an ever-growing list of operators. Companies have popped up whose whole business model is around building robust operators to manage databases in Kubernetes. Because this business logic is captured in code, it can be iterated and improved upon. As companies encounter new edge-cases those can be addressed by the operator, so that everyone benefits. With the traditional “snowflake” approach where every company implements its own kind of Rube Goldberg apparatus. Hard lessons learned are not shared and we're back in the dark ages of cloud computing.

As with any tool, it's the operator's responsibility to know how to operate it. There are a lot of ways to blow one's leg off using Kubernetes. Kubernetes is a tool that when used the right way, will unlock the superpowers your organization needs.

Fun Facts About the Kubernetes Ingress

JeremyDevOpsLeave a Comment

2 min read

Here are some important things to know about the Kubernetes Ingress resource as implemented by ingress-nginx (which is one of the Ingress controllers we support at Cloud Posse).

  • An Ingress must send traffic to a Service in the same Namespace as the Ingress
  • An Ingress, if it uses a TLS secret, must use a Secret from the same Namespace as the Ingress
  • It is completely legal and supported to have multiple ingresses defined for the same host, and this is how you can have one host refer to 2 services in different namespaces
  • Multiple ingresses for the same host are mostly merged together as if they were one ingress, with some exceptions:
    • While the ingresses must refer to resources in their own namespaces, the multiple ingresses can be in different namespaces
    • Annotations that can be applied to only one ingress are applied only to that ingress
    • In case of conflicts, such as multiple TLS certificates or server-scoped annotations, the oldest rule wins
    • These rules are defined more rigorously in the documentation
  • Because of the way multiple ingresses are merged, you can have an ingress in one namespace that defines the TLS secret and external DNS name target and not have that defined at all in the other ingresses and yet they will all appear with the same TLS certificate

The paths section of the Ingress deserves some special attention, too:

  • The interpretation of the path is implementation-dependent. GCE ingress treats the path as an exact match, while Nginx treats it as a prefix. Starting with Kubernetes 1.18, there is a pathType field that can be either Exact or Prefix, but the default remains implementation-dependent. Generally, helm charts appear to expect the path to be interpreted the way Nginx does.
  • The general rule is that the longest matching path wins, but it gets complicated with regular expressions (more below)
  • Prior to Kubernetes 1.18 (and maybe even then), there is no way for an Ingress to specify the native Nginx exact path match. The closest you can come is to use a regex match, but regex matches are case-independent. Furthermore, adding a regex path to an ingress makes all the paths of that ingress case-independent regexes, by default rooted as prefixes.
  • The catch here is that it is still the longest rule that wins, even over an exact match: /[abc]pi/ will take precedence over /api/
  • There is a simple explainer of priorities and gotchas with multiple path rules in the ingress-nginx documentation and a fuller explanation in this tutorial from Digital Ocean.
  • With Nginx, if a path ends in /, then it creates an implied 301 permanent redirect from the path without the trailing / unless that path is also defined. path: /api/ will cause https://host/api to redirect to https://host/api/. (Not sure if this applies to regex paths.)