Foundational Infrastructure

The foundational infrastructure for a business determines what it will or will not be capable of achieving. Like in the physical world, a weak foundation^[1]https://en.wikipedia.org/wiki/Millennium_Tower_(San_Francisco)#Sinking_and_tilting_problem will hamper even the most well-built applications. It's incredibly difficult to make fundamental changes to live foundations, which is why getting it right the first time is so important.

When we terraform your AWS organization, we implement the concept of “Landing Zones” using pure Terraform to organize your infrastructure across a dozen or more accounts ^[2]https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-migration/aws-landing-zone.html. Once this is provisioned, we can start to deploy workloads and applications. A landing zone is a concept from the AWS well-architected framework, which is the prescriptive guidance provided by AWS on how to structure a multi-account AWS environment properly. It provides a baseline to get started with a multi-account architecture, identity and access management, governance, data security, network design, and logging. These accounts are grouped by logical “Organizational Units” (OUs) that when combined with Service Control Policies (SCPs) enable tight controls over what can or cannot happen within the account. Organizational Cloud Trails ensure that audit logs from all accounts flow into an audit account with strict IAM access controls to prevent tampering.

We lay a solid IAM foundation with an SSO integration backed by your Identity Provider (IdP).

A hierarchical DNS architecture ensures that AWS accounts can leverage DNS without risking impacting other zones. We further distinguish between Service Discovery domains and Vanity Domains (aka Branded Domains) used for marketing purposes. Service Discovery domains follow programmatic naming conventions that need to consider things like cloud provider, region, and stage. While vanity domains are the ones used by marketing for landing pages or SEO.

VPCs are the last part we provision as part of the foundation. Once the VPCs are up, everything else can start to be built, such as the Foundational Platform. The critical thing with provisioning VPCs is determining the subnet allocation strategy to ensure non-overlapping CIDRs across accounts, but also attempting to future-proof you for obvious things like growth (running out of IPs sucks), but also less apparent considerations like acquisitions and third-party network peering.

Everything we build is with multi-region and multi-tenancy in mind. Through many hard lessons, we've adopted patterns to make it easier to build large-scale distributed systems since many of our customers serve enterprise businesses with complex regulatory requirements requiring account-level isolation.

Example Implementation

Provision AWS Multi-Account Architecture with Terraform

Raise all necessary account limits via AWS Support
Implement a multi-region capable operating model using multi-region naming conventions
Provision Member accounts. Additional accounts may be added easily later.
- root
- core
  - identity
  - audit
  - security
  - network
  - dns
  - auto
  - artifacts
- plat
  - prod
  - staging
  - dev
  - sandbox
Provision centralized Cloudtrail bucket in the audit account with mandatory encryption, private ACLs, and lifecycle rules to reduce ongoing costs
Provision Organizational Cloudtrail Audit Logs to log to the centralized S3 bucket in audit account
Provision account settings, including account aliases, account password policies and account S3 bucket policies
Provision account budgets (optional)
Provision centralized ECR registry to host docker images (e.g., for infrastructure and applications)
Provision a combination of ASM+SSM for platform infrastructure secrets (KMS encryption)

Provision New AWS Organization with Terraform

Provision net-new top-level root account for the organization
Provision Organizational Units (OUs)
- core: for core governance accounts
- plat: for platform accounts (e.g., dev, staging, prod)
Provision example Service Control Policies (SCP)
Provision Terraform State Backend architecture, using a combination of S3 buckets with mandatory encrypted objects, DynamoDB tables for state locking

Terraform Support

Our reference architecture includes native terraform support for the following related services.

Frequently Asked Questions

Foundational Infrastructure

Do you only work with US-based companies?

We work with companies anywhere in the world. While most of our customers are based in the United States, we've worked with companies in the United Kingdom, Germany, Australia, Hong Kong, India, Argentina, etc. Our team is distributed across the US and Eastern Europe.

When can we get started?

We can start as soon as you sign our Statement of Work. Typically we see this process take 2-3 weeks from the first introductory call to the start of our engagement.

Here's our checklist we'll need to complete before we can start.

Execute Mutual NDA (ours or yours)
Collaborate on Engagement Workbook via Google Docs
Execute Statement of Work, and Master Services Agreement
Deposit Payment
Kick-off!

We can kick off the initial introductory call immediately, so please make sure that you schedule it today.

After talking with you and assessing if we're a proper fit, we'll execute a Mutual NDA and then send over an Engagement Workbook so we can gather all the requirements for your project and estimate the cost.

How did you get the idea to offer this solution?

“Rising tide floats all boats”

After working with so many startups over the years, it became very apparent that a lot of what tech companies need is repeatable. Also, figuring out how to get all the Open Source components working together was always a big challenge.

As consultants, we needed to find a way to consistently deliver the results our customers expect. Starting from scratch is simply not feasible if we want to scale our business.

Therefore we decided to use an Open Source business model whereby all reusable pieces of infrastructure code are released on our GitHub under the permissive Apache 2 software license. This ensures that we can continue to iterate on everything we develop for our customers. Everyone wins.

If everything is open sourced, why don't teams just do it themselves instead of work with Cloud Posse?

Anyone is free to fork our repositories and try themselves, but our support eliminates the guesswork and shortens the time it takes to implement correctly.

Think of it like this: anyone can walk into a hardware store and pick up the materials to build a house. Very few people can build a house that won't fall down if they don't have the experience of using all the tools and hardware correctly. We fill the gap by providing the knowledge and experience to get you where you want to be faster than doing it yourself.

Can we get a live demo of it in action?

Of course! We can't wait to show you what it can do.

Book your appointment today.

Do you provide documentation?

Cloud Posse does offer documentation as part of the engagements but the audience is for experienced developers, so if different documentation is required, these can be created upon request.

What will be the composition of the team?

It really depends on when a contract begins and who on our team is on the bench. Generally, we like to put (2) engineers on a project so we have cross-training and continuity in the event a member needs to take time off. Our team is geographically distributed across the continental US as well as Eastern Europe. Throughout the course of a project, we may move team members between projects depending on their subject matter expertise.

Do you provide ongoing support?

We provide entirely optional ongoing support for customers who've gone through our DevOps Accelerator.

By in large, most of our customers take over the day to day management of their infrastructure.

We're here though to help out anywhere you need it.

We do not provide 24×7 “on-call” (aka PagerDuty) support.

How will we interact with the team?

Slack. You will have direct access to the team via a shared Slack channel between our respective teams.
Zoom. We'll have weekly scheduled cadence calls via Zoom to review the current progress, blockers and give product demos in your environment. These calls can be recorded and shared with your team.
Google Drive. We also recommend creating a shared Team Drive folder via Google Docs for the sharing of relevant design docs, agendas or other materials.
Trello. We manage the project via a Trello Team created specifically for each engagement. We invite your team and our team to this team and create (1) board per sprint. This allows us to standardize our process while providing transparency along the way.
Office Hours. Most engagements include a “Documentation & Training” sprint, we arrange a weekly “Office Hours” via Zoom (recorded) to answer any questions your team may have as they begin to kick the tires.

What exactly do we get when we buy?

We'll deliver the end-to-end solution you've seen in all of our demos. It will be preconfigured for your environments under your AWS accounts. We'll create new GitHub repos that will contain all the infrastructure code you need.

Along the way, we'll show you the ropes and how to operate it. In the long run, you'll be responsible for operating it but we'll stick around for as long as you need our help.

Will this really work for our startup?

It depends. Your best bet is to schedule a discovery call with us so we can go over your specific concerns. Assuming your software runs on Linux and that you're able to make any necessary code changes to ensure your applications are “12 Factor App” compliant, there's a very high likelihood we'll be able to help you out.

What happens if we wait to pull the trigger?

What’s it costing your business if you wait?

The longer you wait, the more time & effort you'll waste on maintenance rather than innovation. The more tech debt you'll amass. The more opportunities you'll miss.

Your developers will be less productive, which means you'll be paying more while getting less done in return.

The sooner you streamline your operations the faster you will move:

Reduce your opportunity cost and capitalize on the investment sooner
Release more features to customers faster
Control operating costs to do more for less

Not to mention, your developers will love you for making their lives easier. The last thing developers want is to do things by hand.

How is Cloud Posse different from Gruntwork?

Gruntwork doesn't provide open access to all their modules, they are a subscription service. Cloud Posse open sources everything.
All of our code is in GitHub and can be forked and used with no concerns about licensing issues (APACHE2).
Gruntwork's Reference Architecture requires Terragrunt
Gruntwork is not a consulting company. They do not help with hands-on implementation. That's left up to you.
We provide a comprehensive project plan consisting of hundreds of implementation tasks and design decisions that we execute together with your team.
Our Slack community is free for anyone to join, not just paying customers.
Because our work is Open Source, there's a lower barrier to getting started. That's why it's in use by thousands and thousands of companies. We receive dozens of Pull Requests every week enhancing our modules and fixing bugs.

What kind of companies do you generally work with?

We work with companies who need to own their infrastructure as their competitive advantage.

Our customers are typically post-Series A technology startups who are seeing success in the market and need to accelerate their DevOps adoption in order to take their company to the next level.

They are backed by some of the biggest names in the industry and are solving really difficult problems with technology.

What is a DevOps Accelerator?

We help companies own their infrastructure in record time by building it together with your team and then showing them the ropes. We stick around for as long as it takes for you to become successful.

Our SweetOps™ process eliminates the guesswork so you get everything you need for a successful cloud migration from the bottom up.

We Build Your Infrastructure. We implement everything you need from your cloud platform using Infrastructure as Code.
You Own It. You achieve it in just a few months. We show you how to ride it along the way.
You Drive It. Customize everything or anything you want. It's your infrastructure.

You get a predictable outcome that is delivered on time and within budget.

There are no long term commitments. No license fees. No strings attached.

Plus, we stick around for as long as you need our help.

Sounds like pretty good deal, right?

Is a subscription required?

No, it's absolutely FREE for anyone to attend.

What is our responsibility?

Can you help me understand where the boundaries of CloudPosse's responsibilities end, and where ours would start?

Cloud Posse's mission is to help companies own their infrastructure. We accelerate this journey by architecting your 4 layers with you and by taking the lead on the implementation. Since we have an opinionated framework, customers will need to learn how to leverage everything for their use cases. This will sometimes mean altering how you build and deploy your services.

Getting Started With Us

We always prefer to start with a green-field approach, where we build your infrastructure from the ground up together with your team. As part of our process, we'll walk you through all of the required design decisions, ensuring you have sufficient context to make informed decisions. This is why we expect our customers to have someone on their engineering team invested in the outcome. This part is absolutely critical, as it ensures what we deliver suits your business needs. Everything we do is delivered by pull request for your review and we will happily provide documentation on anything you want. Along the way, we'll assign homework exercises and provide ample documentation. This approach provides the best opportunity to gain a deep hands-on understanding of our solution.

We encourage you to ask as many questions as you want and challenge our assumptions. You also can volunteer for any task you want to take on as “homework” and we'll help you out as needed.

When You Own It

Once our job is done, this is where you take the driver's seat. We'll help you get everything set up for a smooth transition from your heritage environment to your shiny new infrastructure. Rest assured that we'll stick around until your team is confident and has the know-how to operate these platforms in production. We don't expect teams to pick this up overnight, that's why we'll stay engaged for as long as you need. We're happy to answer questions and jump on Zoom for pair programming sessions.

Day-2 Operations

After our engagement, you will have a solid foundation powering your apps, and all the tools you need for infrastructure operations. This means your team is responsible for the ongoing maintenance, including upgrades (e.g. EKS clusters, and all open-source software), patching systems, incident response, triaging, SRE (e.g. adding monitors and alerts), as well as security operations (responding to incidents, staying on top of vulnerabilities/ CVEs). Cloud Posse is continuously updating its Open Source module ecosystem, but it's your responsibility to regularly update your infrastructure. Staying on top of these things is critical for a successful long-term outcome, with minimal technical debt.

For companies that want to focus more on their business and less on maintenance, we provide ongoing support engagements exclusively for customers that have completed our accelerator.

Check out our approach to learn more!

What is the typical lifecycle of a small change?

Can you walk through the typical lifecycle of a small change that you might help us with, specifically with how it relates to coordinating changes between your team and ours?

Every change in your environment starts with submitting a pull request as our solution is built with a fully GitOps driven approach. Depending on the CODEOWNERS configuration for the repository, branch protections will require all pull requests to have approvals by specific stakeholders, in addition to requiring all checks to pass. We also try to automate as much of the review process as possible. For example, when the pull request is opened, it automatically kicks off a job to validate the change against your environment so you can see the impact of any change.

The coordination needed is simply about figuring out who will be responsible for each part of the release process. The tooling handles the rest and we have a policy-driven approach (Open Policy Agent) to enforce it.

This includes:

Who will submit the pull request, which is entirely dependent on your comfort level with the change, or if you prefer us to take the lead.
Reviewing the pull request and applying changes to it as needed.
Approving and merging the pull request.
Validating and confirming the changes.

The toolchain in your CI/CD process provides Slack notifications and full audit history of everything that happens to give you optimal visibility and traceability.

Lastly, where applicable we implement blue/green rollout strategies for releases, but there are edge cases where a change could be disruptive to active development or live services. In such cases, these would be carefully coordinated to be released at an approved time.

Can we use our existing AWS organization?

What are the risks/gotchas/benefits are of having everything be fully in a separate AWS Organization with a net-new top-level “root” account versus just adding some new OU to our existing root account? Could you share some details of why you recommend starting fresh to help us decide what we should do? For example, are there security issues, terraform issues, etc. with reusing the existing root account?

Cloud Posse strongly recommends building out a new AWS Organization (also called “root” account). While ultimately, the choice is up to the customer, we do not recommend it.

Here's why:

Liability. In order for us to provision any accounts and setup the IAM architecture from the ground up means we need to be organizational admins (aka “God” in your existing infrafructure). This means we have the liability to break your existing environments (never fun), but also you have the added liability that we have access to production systems. We cannot sign a BAA.
Fresh start. Establish a clear divide between the old world and the new world. Let's treat the old environment as tainted, then treat the new environment as hermetically sealed, where every change was introduced via Infrastructure-as-Code and Continuous Delivery. By declaring techdebt bankruptcy on your legacy infrastructure, it frees your company to focus on innovation over toil.
Mitigation. Old accounts can eventually be transferred into the new organization as-is, although we only recommend doing that under extenuating circumstances. Using VPN Peering or a Transit Gateway the old-world can be bridged with the new-world until the transition is complete.

What if we have existing Terraform code?

Chances are we can reuse what you have if it makes sense. This is especially true of any code that is hyperspecific to your applications.

On the other hand, where we can leverage our existing service catalog, it will reduce your tech debt by offloading the ongoing maintenance of those components to the community. We'll also be much better equipped to support any questions you have related to our reusable modules or extend any functionality.

Common examples of reusable building blocks are things like VPCs, Subnets, RDS Postgres/MySQL Databases, Redis/Memcache clusters, ActiveMQ/RabbitMQ clusters, EKS/ECS Clusters, etc. We have over 150+ reusable Terraform modules that are well-maintained, actively tested with terratest, and community-supported.

Who will be the Tech Lead/Architect for our project and assurance that the lead will be fully allocated throughout the project (for continuity)?

We'll embed 1-2 engineers to work with your team through your project. Our preferred approach is to have multiple leads working in parallel so that we can ensure continuity throughout the engagement. Working with Cloud Posse, it is our responsibility to ensure continuity throughout this engagement. We have various subject matter experts that we'll swap in and out of the project and I'll be directly involved through the entire process. The way we achieve greater continuity is by ensuring everything is well documented as we go, opening pull requests for all work, synchronizing branches regularly, and the tasks are all well-defined in Jira. Every single call is recorded and shared with our team (via Gong), in addition to this, all design decisions are recorded in Jira issues and referenced throughout the project for context. Typically we have one engineer allocated to each Sprint and parallelize work by commencing multiple concurrent sprints. You can expect 4-6 Cloud Posse engineers to be involved and contributing.

Why do you recommend Spacelift?

Spacelift checks off all the boxes for managing extremely large environments with a lot of state management. Since Cloud Posse's focus is on deploying large-scale loosely coupled infrastructure components with Terraform, it's common to have several hundred terraform states under management.

Every successful business in existence uses accounting software to manage its finances and understand the health of its business. The sheer number of transactions makes it infeasible to reconcile the books by hand. The same is true of modern infrastructure. With hundreds of states managed programmatically with terraform, and modified constantly by different teams or individuals, the same kind of state reconciliation is required to know the health of its infrastructure. This need goes far beyond continuous delivery and few companies have solved it. With Spacelift, you have an up-to-date view of your assets, liabilities & tech debt across all environments.

Major benefits

Drift Detection runs on a customizable schedule surfaces inconsistencies with what's deployed and what's in git.
Reconciliation helps you know what's deployed, what's failing, and what's queued.
Plan Approvals ensures changes are released when you expect them
Policy Driven Framework based on OPA (open source standard) is used to trigger runs and enforce permissions. This is like IAM for GitOps.
Terraform Graph Visualization makes it easier to visualize the entire state across components
Audit Logs of every change traced back to the commit and filterable by time
Affordable alternative to other commercial offerings
Works with more than Terraform (e.g. Pulumi)
Pull Request Previews show what the proposed changes are before committing them
Decoupling of Deploy from Release ensures we can merge to trunk and still control when those changes are propogated to environments
Ephemeral Environments (Auto Deployment, Auto Destruction) enables us to bring up infrastructure with terraform and destroy it when it's no longer needed
Self-hosted Runners ensure we're in full control over what is executed in our own VPC, with no public endpoints

What level of access do the Spacelift worker pools have?

Spacelift Workers are deployed in your environment with the level of permission that we grant them via IAM instance profiles. When provisioning any infrastructure that requires modifying IAM, the minimum permission is administrative. Thus, workers are provisioned with administrative permissions in all accounts that we grant access to since the terraform we provision requires creating IAM roles and policies. Note, this is not a constraint of Spacelift; this is required regardless of the platform that performs the automation.

What happens if Spacelift as a product goes away?

First off, while Spacelift might be a newer brand in the infrastructure space, it's used by publicly traded companies, Healthcare companies, banks, institutions, Fortune 500 companies, etc. So, Spacelift is not going away.

But just to entertain the hypothetical, let's consider what would happen. Since we manage all terraform states in S3, we have the “break glass” capability to leave the platform at any time and can always run terraform manually. Of course, we would lose all the benefits.

How tough would it be to move everything to a different platform?

Fortunately, with Spacelift, we can still use S3 as our standard state backend. So if at any time we need to move off of the platform, it's easy. Of course, we'd give up all the benefits but the key here is we're not locked into it.

Why not just use Atlantis?

We used to predominately recommend Atlantis but stopped doing so a number of years ago. The project was more or less dormant for 2-3 years, and only recently started accepting any Pull Requests. Atlantis was the first project to define a GitOps workflow for Terraform, but it's been left in the dust compared to newer alternatives.

With Alantis, there is no regular reconcilation of what terraform state has been applied or not applied. So we really have no idea in atlantis the actual state of anything. With a recent customer, we helped migrate them from Atlantis to Spacelift and it took 2 months to reconcile all the infrastructure that had drifted.
With Atlantis, there's no drift detection, but with spacelift, we detect it nightly (or as frequently as we want)
With Atlantis, there's no way to manage dependencies of components, so that when one component changes, any other components that depend on it should be updated.
With Atlantis, there's no way to setup OPA policies to trigger runs. The OPA support in atlantis is very basic.
With Atlantis, anyone who can run a plan, can exfiltrate your root credentials. This talked about by others and was recently highlighted at the Defcon 2021 conference.
With Atlantis, there's no way to limit who can run terraform plan or apply. If you have access to the repo, you can run a terraform plan. If your plan is approved, you can run terraform apply. Cloud Posse even tried to fix it (and maintained our own fork for some time), but the dicussion went no where and we moved on.
With Atlantis, there's no way to restrict who has access to unlock workspaces via the web GUI. The only way is to install your own authetnication proxy in front of it or restrict it in your load balancer.
With Atlantis, you have to expose the webhook endpoint publically to GitHub.

What about using GitHub Actions?

We provide a suitable alternative to Spacelift using GitHub Actions for companies looking to unify their deployments under one common platform.

It's an entirely free and open-source alternative that uses atmos and works with self-hosted GitHub Runners as well as GitHub Cloud.

What about using GitLab/Jenkins/etc?

There are plenty of examples of using other tools to implement continuous delivery for Terraform. However, it's solving for all the edge cases which makes it so complicated and therefore seldom, if ever handled by these approaches.

Where will you store the plan files which are required for approvals? (plan → approve → apply workflow) Note, these planfiles may contain root-level credentials to things like RDS databases, which cannot be avoided.
How will you clean up those planfiles? Should they persist after a terraform apply succeeds or crashes?
How will you implement approval steps? If the approval is denied, how will you clean up the terraform planfile?
If you have multiple open PRs (e.g. many plans) for one workspace, after applying one, all other plans need to be invalidated. How will you implement that invalidation?
Git is only one source of truth for infrastructure as code. Data sources are another (e.g. terraform remote state). How will you reconcile that your state is current and update it when it drifts? When it drifts, how will you be notified?
How will you know that your infrastructure changes are applied everywhere? If a build fails, but the code is already merged, how do you escalate and ensure it's resolved?
If you need to lock an environment from being updated, how will you do it?
How will you suggest the changes? If the plan is to comment on the PR, that gets VERY noisy, and everyone subscribed will receive the notification. Runs may also accidentally leak secrets in the output. GitHub comments are limited to 65K bytes, which means large plans will need to be split across multiple comments.
What happens if multiple PRs are merged that want to modify the same environment? How will you enforce ordered consistency?
How will you restrict who can run terraform plans and applies? Furthermore, how will you restrict it to specific environments?
How will you provide the short-lived IAM credentials to the terraform processes? e.g. any hardcoded credentials exposed will be a major liability

Why not use Terraform Cloud?

Terraform Cloud is prohibitively expensive for most non-enterprise customers we work with, and possibly 10x the cost of Spacelift. Terraform Cloud for Teams doesn't permit self-hosted runners and requires hardcoded IAM credentials in each workspace. That's insane and we cannot recommend it. Terraform Cloud for Business (and higher) support self-hosted runners, which can leverage AWS IAM Instance profiles, but the number of runners is a significant factor of the cost. When leveraging several hundred loosely-coupled terraform workspaces, there is a significant need for a lot of workers for short periods of time. Unfortunately, even if those are only online for a short period of time, you need to commit to paying for them for the full month on an annualized basis. Terraform Cloud also requires that you use their state backend, which means there's no way to “break glass” and run Terraform if they are down. If you want to migrate off of Terraform Cloud, you need to migrate the state of hundreds of workspaces out of the platform and into another state backend.

What Distinguishes us from All the Rest?

Based on Open Source. Everything we do is available for free today on our GitHub. This is our proof we know what we're talking about. “What You See is What You Get” – no other company can provide such a comprehensive solution based on Open Source.
Free Weekly Office Hours Our commitment to helping others is in our DNA. We want to ensure you get the maximum value out of your investment.
Massive Community Adoption ensures our projects get regular updates and bug fixes.

References[+]

References
↑1	https://en.wikipedia.org/wiki/Millennium_Tower_(San_Francisco)#Sinking_and_tilting_problem
↑2	https://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-migration/aws-landing-zone.html

Related Terms: