Your foundational platform is how you consistently deliver your applications. For everything else, we recommend using a fully managed service.
We support two popular options for platforms, including Kubernetes using EKS and ECS Fargate. While we love Kubernetes, we fully recognize it is not the right option for every business. Therefore, as part of our Engagement Workbook process, we work with you to figure out what is the right fit for you at the stage you're at.
For Kubernetes, we provision clusters across multiple account stages, along with deploying all the backing services required by the platform. We support Spot Instances, GPUs, Auto Scaling, and pretty much everything else you would expect from Kubernetes.
These are a few notable controllers we provision by default:
- We deploy the
amazon-load-balancer-controllerwhich provides native integration with Amazon's Application Load Balancers (ALBs) as well as Network Load Balancers (NLBs). One important thing to note is that on Kubernetes, it's not uncommon to have multiple Ingress controllers deployed at the same time.
- We deploy
cert-managerto manage TLS certificates; for example, it can manage AWS PrivateCA certificates that work with ALBs.
- We deploy
external-dnsto automatically manage Route53 records.
Everything we deploy on Kubernetes is backed by Infrastructure as Code. Our standard approach to packaging applications for Kubernetes is with Helm and is typically deployed by either Terraform, Helmfile or ArgoCD (or a combination of these tools). IAM Service Account Roles (IRSA) are used everywhere possible to eliminate hardcoded credentials.
Elastic Container Service (ECS) Fargate
For ECS, we'll typically provision both public and private ECS clusters. ALBs and API Gateways are then provisioned for each cluster respectively. We use GitHub Actions in combination with Spacelift to manage all updates to the clusters.
Common platform-level services like AWS WAF, AWS Global Accelerator and AWS Client VPN are popular options for our customers. We have all the terraform modules ready to go.
Building out the platform also involves establishing a service catalog consisting of the most common building blocks—things like your RDS Aurora Databases, S3 buckets, Redis clusters, etc. We leverage our vast catalog of 200+ terraform modules and pull in only what you need.
Our reference architecture includes native terraform support for the following related services.
Frequently Asked Questions
We can start as soon as you sign our Statement of Work. Typically we see this process take 2-3 weeks from the first introductory call to the start of our engagement.
Here's our checklist we'll need to complete before we can start.
- Execute Mutual NDA (ours or yours)
- Collaborate on Engagement Workbook via Google Docs
- Execute Statement of Work, and Master Services Agreement
- Deposit Payment
We can kick off the initial introductory call immediately, so please make sure that you schedule it today.
After talking with you and assessing if we're a proper fit, we'll execute a Mutual NDA and then send over an Engagement Workbook so we can gather all the requirements for your project and estimate the cost.
- Based Open Source. Everything we do is available for free today on our GitHub. This is our proof we know what we're talking about. “What You See is What You Get” – no other company can provide such a comprehensive solution based on Open Source.
- Free Weekly Office Hours Our commitment to helping others is in our DNA. We want to make sure you get the maximum value out of your investment.
- Massive Community Adoption ensures our projects get regular updates and bug fixes.
If everything is open sourced, why don't teams just do it themselves instead of work with Cloud Posse?
Anyone is free to fork our repositories and try themselves, but our support eliminates the guesswork and shortens the time it takes to implement correctly.
Think of it like this: anyone can walk into a hardware store and pick up the materials to build a house. Very few people can build a house that won't fall down if they don't have the experience of using all the tools and hardware correctly. We fill the gap by providing the knowledge and experience to get you where you want to be faster than doing it yourself.
Cloud Posse does offer documentation as part of the engagements but the audience is for experienced developers, so if different documentation is required, these can be created upon request.
It really depends on when a contract begins and who on our team is on the bench. Generally, we like to put (2) engineers on a project so we have cross-training and continuity in the event a member needs to take time off. Our team is geographically distributed across the continental US as well as Eastern Europe. Throughout the course of a project, we may move team members between projects depending on their subject matter expertise.
We provide entirely optional ongoing support for customers who've gone through our DevOps Accelerator.
By in large, most of our customers take over the day to day management of their infrastructure.
We're here though to help out anywhere you need it.
We do not provide 24×7 “on-call” (aka PagerDuty) support.
- Slack. You will have direct access to the team via a shared Slack channel between our respective teams.
- Zoom. We'll have weekly scheduled cadence calls via Zoom to review the current progress, blockers and give product demos in your environment. These calls can be recorded and shared with your team.
- Google Drive. We also recommend creating a shared Team Drive folder via Google Docs for the sharing of relevant design docs, agendas or other materials.
- Trello. We manage the project via a Trello Team created specifically for each engagement. We invite your team and our team to this team and create (1) board per sprint. This allows us to standardize our process while providing transparency along the way.
- Office Hours. Most engagements include a “Documentation & Training” sprint, we arrange a weekly “Office Hours” via Zoom (recorded) to answer any questions your team may have as they begin to kick the tires.
We'll deliver the end-to-end solution you've seen in all of our demos. It will be preconfigured for your environments under your AWS accounts. We'll create new GitHub repos that will contain all the infrastructure code you need.
Along the way, we'll show you the ropes and how to operate it. In the long run, you'll be responsible for operating it but we'll stick around for as long as you need our help.
It depends. Your best bet is to schedule a discovery call with us so we can go over your specific concerns. Assuming your software runs on Linux and that you're able to make any necessary code changes to ensure your applications are “12 Factor App” compliant, there's a very high likelihood we'll be able to help you out.
What’s it costing your business if you wait?
The longer you wait, the more time & effort you'll waste on maintenance rather than innovation. The more tech debt you'll amass. The more opportunities you'll miss.
Your developers will be less productive, which means you'll be paying more while getting less done in return.
The sooner you streamline your operations the faster you will move:
- Reduce your opportunity cost and capitalize on the investment sooner
- Release more features to customers faster
- Control operating costs to do more for less
Not to mention, your developers will love you for making their lives easier. The last thing developers want is to do things by hand.
- Gruntwork doesn't provide open access to all their modules, they are a subscription service. Cloud Posse open sources everything.
- All of our code is in GitHub and can be forked and used with no concerns about licensing issues (APACHE2).
- Gruntwork's Reference Architecture requires Terragrunt
- Gruntwork is not a consulting company. They do not help with hands-on implementation. That's left up to you.
- We provide a comprehensive project plan consisting of hundreds of implementation tasks and design decisions that we execute together with your team.
- Our Slack community is free for anyone to join, not just paying customers.
- Because our work is Open Source, there's a lower barrier to getting started. That's why it's in use by thousands and thousands of companies. We receive dozens of Pull Requests every week enhancing our modules and fixing bugs.
We'll answer this based on our experience.
For Terraform Continuous Integration (CI), we use GitHub Actions with all of our modules. This works very well for us since we rely on GitHub. Then on a nightly basis, we run aws-nuke to clean up our environments, since failing tests frequently orphan resources that cost money and can conflict with other tests.
For a proper Terraform Continuous Delivery (CD) workflow, we think your best bet is to start with a SaaS solution and learn from that. Your options are Terraform Cloud, Scalr, Spacelift. Terraform CD is non-trivial to do well. You can easily stick Terraform into any pipeline, but a well-built terraform CD pipeline will have a
terraform plan → planfile → approval →
apply workflow. You'll need to stash the planfile somewhere and the planfile may contain secrets.
We work with companies who need to own their infrastructure as their competitive advantage.
Our customers are typically post-Series A technology startups who are seeing success in the market and need to accelerate their DevOps adoption in order to take their company to the next level.
They are backed by some of the biggest names in the industry and are solving really difficult problems with technology.
We help companies own their infrastructure in record time by building it together with your team and then showing them the ropes. We stick around for as long as it takes for you to become successful.
Our SweetOps™ process eliminates the guesswork so you get everything you need for a successful cloud migration from the bottom up.
- We Build Your Infrastructure. We implement everything you need from your cloud platform using Infrastructure as Code.
- You Own It. You achieve it in just a few months. We show you how to ride it along the way.
- You Drive It. Customize everything or anything you want. It's your infrastructure.
You get a predictable outcome that is delivered on time and within budget.
There are no long term commitments. No license fees. No strings attached.
Plus, we stick around for as long as you need our help.
Sounds like pretty good deal, right?
No, it's absolutely FREE for anyone to attend.
Can you help me understand where the boundaries of CloudPosse's responsibilities end, and where ours would start?
Cloud Posse's mission is to help companies own their infrastructure. We accelerate this journey by architecting your 4 layers with you and by taking the lead on the implementation. Since we have an opinionated framework, customers will need to learn how to leverage everything for their use cases. This will sometimes mean altering how you build and deploy your services.
Getting Started With Us
We always prefer to start with a green-field approach, where we build your infrastructure from the ground up together with your team. As part of our process, we'll walk you through all of the required design decisions, ensuring you have sufficient context to make informed decisions. This is why we expect our customers to have someone on their engineering team invested in the outcome. This part is absolutely critical, as it ensures what we deliver suits your business needs. Everything we do is delivered by pull request for your review and we will happily provide documentation on anything you want. Along the way, we'll assign homework exercises and provide ample documentation. This approach provides the best opportunity to gain a deep hands-on understanding of our solution.
We encourage you to ask as many questions as you want and challenge our assumptions. You also can volunteer for any task you want to take on as “homework” and we'll help you out as needed.
When You Own It
Once our job is done, this is where you take the driver's seat. We'll help you get everything set up for a smooth transition from your heritage environment to your shiny new infrastructure. Rest assured that we'll stick around until your team is confident and has the know-how to operate these platforms in production. We don't expect teams to pick this up overnight, that's why we'll stay engaged for as long as you need. We're happy to answer questions and jump on Zoom for pair programming sessions.
After our engagement, you will have a solid foundation powering your apps, and all the tools you need for infrastructure operations. This means your team is responsible for the ongoing maintenance, including upgrades (e.g. EKS clusters, and all open-source software), patching systems, incident response, triaging, SRE (e.g. adding monitors and alerts), as well as security operations (responding to incidents, staying on top of vulnerabilities/ CVEs). Cloud Posse is continuously updating its Open Source module ecosystem, but it's your responsibility to regularly update your infrastructure. Staying on top of these things is critical for a successful long-term outcome, with minimal technical debt.
For companies that want to focus more on their business and less on maintenance, we provide ongoing support engagements exclusively for customers that have completed our accelerator.
Check out our approach to learn more!
Can you walk through the typical lifecycle of a small change that you might help us with, specifically with how it relates to coordinating changes between your team and ours?
Every change in your environment starts with submitting a pull request as our solution is built with a fully GitOps driven approach. Depending on the
CODEOWNERS configuration for the repository, branch protections will require all pull requests to have approvals by specific stakeholders, in addition to requiring all checks to pass. We also try to automate as much of the review process as possible. For example, when the pull request is opened, it automatically kicks off a job to validate the change against your environment so you can see the impact of any change.
The coordination needed is simply about figuring out who will be responsible for each part of the release process. The tooling handles the rest and we have a policy-driven approach (Open Policy Agent) to enforce it.
- Who will submit the pull request, which is entirely dependent on your comfort level with the change, or if you prefer us to take the lead.
- Reviewing the pull request and applying changes to it as needed.
- Approving and merging the pull request.
- Validating and confirming the changes.
The toolchain in your CI/CD process provides Slack notifications and full audit history of everything that happens to give you optimal visibility and traceability.
Lastly, where applicable we implement blue/green rollout strategies for releases, but there are edge cases where a change could be disruptive to active development or live services. In such cases, these would be carefully coordinated to be released at an approved time.
Who will be the Tech Lead/Architect for our project and assurance that the lead will be fully allocated throughout the project (for continuity)?
We'll embed 1-2 engineers to work with your team through your project. Our preferred approach is to have multiple leads working in parallel so that we can ensure continuity throughout the engagement. Working with Cloud Posse, it is our responsibility to ensure continuity throughout this engagement. We have various subject matter experts that we'll swap in and out of the project and I'll be directly involved through the entire process. The way we achieve greater continuity is by ensuring everything is well documented as we go, opening pull requests for all work, synchronizing branches regularly, and the tasks are all well-defined in Jira. Every single call is recorded and shared with our team (via Gong), in addition to this, all design decisions are recorded in Jira issues and referenced throughout the project for context. Typically we have one engineer allocated to each Sprint and parallelize work by commencing multiple concurrent sprints. You can expect 4-6 Cloud Posse engineers to be involved and contributing.
Spacelift checks off all the boxes for managing extremely large environments with a lot of state management. Since Cloud Posse's focus is on deploying large-scale loosely coupled infrastructure components with Terraform, it's common to have several hundred terraform states under management.
Every successful business in existence uses accounting software to manage its finances and understand the health of its business. The sheer number of transactions makes it infeasible to reconcile the books by hand. The same is true of modern infrastructure. With hundreds of states managed programmatically with terraform, and modified constantly by different teams or individuals, the same kind of state reconciliation is required to know the health of its infrastructure. This need goes far beyond continuous delivery and few companies have solved it. With Spacelift, you have an up-to-date view of your assets, liabilities & tech debt across all environments.
- Drift Detection runs on a customizable schedule surfaces inconsistencies with what's deployed and what's in git.
- Reconciliation helps you know what's deployed, what's failing, and what's queued.
- Plan Approvals ensures changes are released when you expect them
- Policy Driven Framework based on OPA (open source standard) is used to trigger runs and enforce permissions. This is like IAM for GitOps.
- Terraform Graph Visualization makes it easier to visualize the entire state across components
- Audit Logs of every change traced back to the commit and filterable by time
- Affordable alternative to other commercial offerings
- Works with more than Terraform (e.g. Pulumi)
- Pull Request Previews show what the proposed changes are before committing them
- Decoupling of Deploy from Release ensures we can merge to trunk and still control when those changes are propogated to environments
- Ephemeral Environments (Auto Deployment, Auto Destruction) enables us to bring up infrastructure with terraform and destroy it when it's no longer needed
- Self-hosted Runners ensure we're in full control over what is executed in our own VPC, with no public endpoints
What level of access do the Spacelift worker pools have?
Spacelift Workers are deployed in your environment with the level of permission that we grant them via IAM instance profiles. When provisioning any infrastructure that requires modifying IAM, the minimum permission is administrative. Thus, workers are provisioned with administrative permissions in all accounts that we grant access to since the terraform we provision requires creating IAM roles and policies. Note, this is not a constraint of Spacelift; this is required regardless of the platform that performs the automation.
What happens if Spacelift as a product goes away?
First off, while Spacelift might be a newer brand in the infrastructure space, it's used by publicly traded companies, Healthcare companies, banks, institutions, Fortune 500 companies, etc. So, Spacelift is not going away.
But just to entertain the hypothetical, let's consider what would happen. Since we manage all terraform states in S3, we have the “break glass” capability to leave the platform at any time and can always run terraform manually. Of course, we would lose all the benefits.
How tough would it be to move everything to a different platform?
Fortunately, with Spacelift, we can still use S3 as our standard state backend. So if at any time we need to move off of the platform, it's easy. Of course, we'd give up all the benefits but the key here is we're not locked into it.
Why not just use Atlantis?
We used to predominately recommend Atlantis but stopped doing so a number of years ago. The project was more or less dormant for 2-3 years, and only recently started accepting any Pull Requests. Atlantis was the first project to define a GitOps workflow for Terraform, but it's been left in the dust compared to newer alternatives.
- With Alantis, there is no regular reconcilation of what terraform state has been applied or not applied. So we really have no idea in atlantis the actual state of anything. With a recent customer, we helped migrate them from Atlantis to Spacelift and it took 2 months to reconcile all the infrastructure that had drifted.
- With Atlantis, there's no drift detection, but with spacelift, we detect it nightly (or as frequently as we want)
- With Atlantis, there's no way to manage dependencies of components, so that when one component changes, any other components that depend on it should be updated.
- With Atlantis, there's no way to setup OPA policies to trigger runs. The OPA support in atlantis is very basic.
- With Atlantis, anyone who can run a plan, can exfiltrate your root credentials. This talked about by others and was recently highlighted at the Defcon 2021 conference.
- With Atlantis, there's no way to limit who can run terraform plan or apply. If you have access to the repo, you can run a terraform plan. If your plan is approved, you can run terraform apply. Cloud Posse even tried to fix it (and maintained our own fork for some time), but the dicussion went no where and we moved on.
- With Atlantis, there's no way to restrict who has access to unlock workspaces via the web GUI. The only way is to install your own authetnication proxy in front of it or restrict it in your load balancer.
- With Atlantis, you have to expose the webhook endpoint publically to GitHub.
What about using GitHub Actions/GitLab/Jenkins/etc?
There are plenty of examples of using other tools to implement continuous delivery for Terraform. However, it's solving for all the edge cases which makes it so complicated and therefore seldom, if ever handled by these approaches.
- Where will you store the plan files which are required for approvals? (plan → approve → apply workflow) Note, these planfiles may contain root-level credentials to things like RDS databases, which cannot be avoided.
- How will you clean up those planfiles? Should they persist after a terraform apply succeeds or crashes?
- How will you implement approval steps? If the approval is denied, how will you clean up the terraform planfile?
- If you have multple open PRs (e.g. many plans) for one workspace, after applying one, all other plans need to be invalidated. How will you implement that invalidation?
- Git is only one source of truth for infrastructure as code. Data sources is another (e.g. terraform remote state). How will you reconcile that your state is current and update it when it drifts? When it drifts, how will you be notified?
- How will you know that your infrastructure changes are applied everywhere? If a build fails, but the code is already merged, how do you escalate and ensure it's resolved?
- If you need to lock an environment from being updated, how will you do it?
- How will you suggest the changes? If the plan is to comment on the PR, that gets VERY noisy and everyone subscribed will receive the notification. Runs may also accidentally leak secrets in the output. GitHub comments are limited to 65K bytes, which means large plans will need to be split across multiple comments.
- What happens if you have multiple PRs merged that want to modify the same environment? How will you enforce an ordered consistency?
- How will you restrict who can run terraform plans and applies? Further more, how will you restrict it to specific environments?
- How will you provide the short-lived IAM credentials to the terraform processes? e.g. any hardcoded credentials exposed will be a major liablility
Why not use Terraform Cloud?
Terraform Cloud is prohibitively expensive for most non-enterprise customers we work with, and possibly 10x the cost of Spacelift. Terraform Cloud for Teams doesn't permit self-hosted runners and requires hardcoded IAM credentials in each workspace. That's insane and we cannot recommend it. Terraform Cloud for Business (and higher) support self-hosted runners which can leverage AWS IAM Instance profiles, but the number of runners is a significant factor of the cost. When leveraging several hundred loosely-coupled terraform workspaces, there is a significant need for a lot of workers for short periods of time. Unfortunately, even if those are only online for a short period of time, you need to commit to paying for them for the full month on an annualized basis. Terraform Cloud also requires that you use their state backend, which means there's no way to “break glass” and run terraform if they are down. If you want to migrate off of Terraform Cloud, you need to migrate the state of hundreds of workspaces out of the platform and into another state backend.