AWS Cost Controls

AWS Cost Controls

We've all been there: that dreaded email notification from AWS with our new cloud bill. We resolve to reduce our costs next month, but another month rolls around and not much has changed.

Here are some of the tools and tricks at your disposal for controlling your AWS cloud spend and reducing your costs. There's an overwhelming number of resources out there to reduce costs. Knowing which ones to prioritize will largely depend on your unique situation and architecture.

“Free” Money

  • AWS Activate Credits
    • VC backed startups are eligible to receive up to $100,000 in usage credits (one time only)
  • Credits for Testing & POC Implementation

Tricks to Reduce Network Transit Costs

  • Move Data Transfer Costs to the Edge (e.g. CDN)
    • Use Cloudflare to radically reduce transfer costs (Cloudfront has zero bandwidth fees)
    • Use CloudFront
  • Configure an S3 Endpoint in your VPC
  • Optimize NAT Traffic
    • Use NAT instances rather than NAT gateway appliances for dev/test environments
  • Use internal private IPs rather than public IPs or Elastic IPs
    • Use CNAMEs to AWS hostnames for resources that need to be accessed internally and externally to take advantage of the split-horizon DNS
    • Use A records to Private IPs for hostnames that should be accessed strictly internally
  • Use a Docker Pull-Thru Registry Cache
    • The pull-thru cache will reduce the ingress traffic, but only really effective if hitting one per AZ
  • Reduce Availability Zones where suitable to reduce cross-zone traffic
    • For dev/test environments, use a single AZ to reduce cross-az transit costs.
    • Traffic within an AZ is free.
  • Reduce Inter-zone/region traffic

Shave Compute Costs

  • Invest in an AWS Savings Plan
  • Buy (convertible) Reserved EC2 Instances
    • Reserved Instances are complicated, but “convertible” RIs are a no-brainer.
    • Pro tip: Use to manage this process for you
  • Move to a cheaper AWS region
  • Upgrade to newer generations of EC2 instance families
  • Use EC2 AMD Instances
  • Setup Spot Instances
    • Use to get almost immediate gratification
    • Use spot “fleets”
  • Leverage (or Tune) Auto-scaling Capabilities
    • Use the SpotInst Ocean controller for Kubernetes to “right-size” your EC2 instances
  • Right Size Resources
  • Optimize Kubernetes Workloads
    • Ensure proper memory & CPU requests and limits are set
    • Ensure pods can be rescheduled by setting proper annotations
  • Shutdown unused instances
    • Don’t just stop instances, but terminate instances to prevent racking up EBS costs

Reduce Database Costs

  • Use Aurora Serverless where appropriate
  • Use containerized databases for dev/test environments
  • “Right Size” Compute and Storage

Reduce Storage Costs

  • Backup your data in S3 rather than EBS or EFS to save big
  • Add Lifecycle Rules for Retention Policies on S3 buckets
    • Automatically rotate logs to Glacier or delete after N days
  • Use reduced redundancy for less mission-critical artifacts such as logs
  • Optimize EBS Volumes
    • Reduce oversized volumes
  • Optimize EBS Colume Snapshots

Use the various tools at your disposal

  • Komiser (Free/Open Source) [youtube]
  • KubeCost (Free/Open Source) – Tool to gain visibility into operating costs inside of your Kubernetes clusters
  • Goldilocks (Free/Open Source) – Tool to help “Right Size” your Kubernetes Pods

    Goldilocks Screenshot

  • (SaaS)
  • CloudChecker (SaaS)
Author Details
Erik Osterman is a technical evangelist and insanely passionate DevOps guru with over a decade of hands-on experience architecting systems for AWS. After leading major cloud initiatives at CBS Interactive as the Director of Cloud Architecture, he founded Cloud Posse, a DevOps Accelerator that helps high-growth Startups and Fortune 500 Companies own their infrastructure in record time by building it together with customers and showing them the ropes.

Change Management

Change Control Management

Here are some of the tools and tricks at your disposal for enforcing a modern change control process.

  • Use a Version Control System
    • With a VCS like GitHub, you'll be able to point to every change
    • With branches, you'll be able to keep pending changes out of masteruntil they are ready
  • Use Infrastructure as Code
    • Define the business logic of infrastructure in reusable modules
    • Separate the business logic from the configuration
    • Stick all the code & configuration into VCS
  • Use Automation (E.g. “Operations by Pull Request”)
    • Eliminate humans from running commands by hand
    • Pipelines promote code and configuration changes through environments
  • Use a Pull Request workflow with Code Reviews & Approvals
  • Use Pipeline Approval Steps
  • Use Notifications
    • Send a slack notification for every deployment
    • Comment on GitHub Commit SHA for every deployment of that commit
  • Use Branch Protections
    • Require Pull Request Approvals
    • Dismiss approvals if changes are pushed
    • Require status checks to pass
    • Enforce CODEOWNERS
    • Use teams to connote stakeholders (e.g. @secops or @dba or @qaor @frontend) to ensure approvals from subject matter experts
    • Use narrowly scoped paths for teams (E.g. terraform/iam/* @secops)
  • Use Policy Enforcement
    • Tools like Open Policy Agent, conftesttfsec help to define contracts and enforce them
    • Integrate the tools with your CI/CD pipelines (execute pipelines from  master branch to enforce pipelines are not bypassed in the PR/branch)
  • Use Multiple Accounts, Stages
    • Test changes in isolation and use a formal process to promote changes
  • Use Version Pinning
    • Always pin your dependencies to a version (e.g. using semver or commit SHAs)
    • Never overwrite any version of the software, always create a new release/tag
    • Pinning to master or latest does not count!
  • Use Feature Flags
    • Feature flags can ensure that functionality is only turned on when it's ready and easily disabled
    • Controls around feature flag access limit who can toggle it and a change log of when it was modified



Author Details
Erik Osterman is a technical evangelist and insanely passionate DevOps guru with over a decade of hands-on experience architecting systems for AWS. After leading major cloud initiatives at CBS Interactive as the Director of Cloud Architecture, he founded Cloud Posse, a DevOps Accelerator that helps high-growth Startups and Fortune 500 Companies own their infrastructure in record time by building it together with customers and showing them the ropes.

12 Factor App Checklist

12 Factor App Checklist

Here are some of the practical considerations to help determine the 12-factor readiness of an application for deployment on Kubernetes. We like the 12-factor app pattern because these are the ideal types of applications to run on Kubernetes. While pretty much anything can run on Kubernetes (e.g. worst-case as a StatefulSet), applications that match the characteristics below will make best use of the High Availability and Scaling features Kubernetes has to offer. For example, these types of apps are typically suitable for Deployments that can leverage the HorizontalPodAutoscaler functionality. These actionable recommendations are based on our “real world” experience of deploying apps and microservices to kubernetes. These may not follow the precise official/canonical definitions of the “12-factor” app.

  • Codebase
    • One-repo-per-service (poly-repo) architectures (preferred, not required)
    • Use a Git-based workflow with PRs & approval process
    • Dockerized with a Dockerfile
    • Automated tests exist to validate application works
  • Dependencies
    • Service dependencies are explicitly declared in the configuration (e.g. DB_HOST)
    • Services are loosely coupled so that they can be started in any order
    • Application dependencies are explicitly pinned in manifests (E.g. package.json);
    • Use semver rather than commit SHAs for pinning, where possible
  • Config
    • All configuration settings are passed via environment variables and not hardcoded.
    • Services can be dynamically reconfigured without recompilation (e.g. by changing settings)
    • Use DNS-based service discovery (instead of IPs or depend on consul); use short-dns names with search domains rather than FQHN
    • Use AWS SDK's automatic AWS configuration (E.g. do not validate AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY variables are configured)
  • Backing Services
    • Use object storage where files are needed (not local storage)
    • Use external databases (e.g. Postgres, MySQL, Redis, etc) to persist state
    • Use environment variables for configuration (e.g. timeouts, endpoints, etc)
    • Use configurable timeouts on connections, responses from backends
  • Build, Release, Run
    • Well-defined process to build (e.g. compile) the application and start it (e.g. a Makefile)
    • Dockerfile defines ENTRYPOINT to run the application
    • Docker composition (docker-compose.yml) can bring up the environment for automated testing
    • Cut releases on merge to master (preferred, not required); use semver
  • Processes
    • All processes must expose health check endpoints (/healthz)
    • Should not depend on a process manager (e.g. pm2)
    • Should exit non-zero upon fatal errors
    • Must respond to SIGTERM and exit gracefully
    • Health checks should not depend on the health of the backing services
    • Does not require privileged execution (e.g. root)
  • Port binding
    • Dockerfiles define a PORT definition
    • Services should listen on a preconfigured bind-address and port (e.g.
    • Should listen on non-privileged ports  (> 1024)
  • Concurrency
    • Application can be run any number of times in parallel (e.g. no expectation of locking)
    • Application does not maintain a large pool of persistent database connections (e.g. configurable pool size)
    • Application uses database transactions, if applicable; avoids deadlocks
    • Application does not depend on sticky sessions; requests can hit any process
  • Disposability
    • Should be entirely stateless (e.g. not maintain any local state, all state offloaded to backing services)
    • Processes can be easily created or destroyed without any orchestrated shutdown process
    • No POSIX filesystem required for persistence (local caching/buffering okay)
  • Dev/prod parity
    • All environments function the same way when configured with the same settings
    • Flags should enable/disable functionality without knowledge of stage or environment (e.g. do not use if ($environment == 'dev') { ... })
    • Do not use hostnames for conditional/routing logic (that's the job of Ingress)
  • Logs
    • Logs are emitted to stdout
    • Events are structured event streams (e.g. JSON)
    • Do not write logs to disk (to mitigate the need for log rotation)
  • Admin processes
    • Database migrations should be automated and run as a separate container
    • Cronjobs can be run as a separate container
    • Batch processing should run as a separate container

CI/CD Platform Requirements

Considerations for CI/CD Platforms

Must Support

  • Must support “pipelines as code” and auto-discover branches (ideally repos too)
    • E.g. Jenkinsfile, codefresh.yml, .circle.yml, travis.yml
  • Must support shared secrets and settings
    • E.g. GitHub Actions does not support shared secrets
  • Must support easy parallelization
    • Speed is critical. One way to speed up tests is to parallelize steps.
  • Must support easy integration with Kubernetes (I don’t want to manage the setup)
    • First-class Kubernetes support is essential
    • Minimal extra tooling should be required for Kubernetes-backed deployments
  • Must use container-backed steps
  • Must support webhook events from PRs originating from untrusted forks (E.g. open-source projects)
  • Must support ChatOps style requests
    • E.g. comments on PRs can trigger pipelines, slack commands can retry builds
  • Must support SSO
  • Must integrate with Slack
    • Slack notifications should be customizable
    • Slack should replace the need for most email notifications
  • Must be affordable
    • Platform should not require long-term commitments (SaaS)
    • Cost $10-20/user max
    • Support “unlimited” builds or have a pay-per-build model
    • Startup pricing preferred

Should Support

  • Should support approval steps with Slack notifications (ideally use slack buttons)
  • Should support RBAC/ABAC ACLs (e.g. for approval steps)
  • Should support local execution of pipelines for debugging
  • Should make it easy to discover all open PRs so it’s easy to re-trigger
  • Should support remote debugging (basically drop into any step via remote shell and poke around)
  • Should make it easy tag multiple versions of a docker image
  • Should support a “library” of pipelines or pipeline steps
  • Should support GitHub deployment notifications
  • Should support multiple OS-build platforms (e.g. iOS, OSX, Windows, Linux, Android, etc)


Go Live Checklist

Go Live Checklist

Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.

Preflight Checklist

  • Rollback Plan
    • Verify Backup Integrity and Recency
    • Test Restore Process
    • Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
  • External Availability Monitoring
    • Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
    • Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
  • Tune Kubernetes/ECS
    • Ensure memory and CPU constraints are suitable for production
    • Ensure that all pods have limits and requests set, because this is how the scheduler knows where to put things
    • Ensure services gracefully recovery when killed (e.g. nodes, and pods)
  • Alert Escalations
    • Ensure on-call engineers have mobile devices properly configured to alert
    • Review on-call schedule to ensure continuity
    • Prepare run books and link them to alerts
  • Performance Load Tests
    • Replicate production workloads to ensure systems handle as expected
  • Exception Logging
    • Ensure you have frontend/javascript exception logging enabled (E.g. Sentry, Datadog, NewRelic)
  • Reduce DNS TTLs
    • Set TTLs to 60 seconds on branded domains
    • Set all SOAs for TLDs to 60 seconds to mitigate effects of negative DNS caching
  • Prepare Maintenance Page
    • Provide a means to display a maintenance page (if necessary)
    • Should be a static page (e.g. hosted on S3)
  • Schedule Cut Over
    • Identify all relevant parties, stakeholders
    • Communicate scope of migration and any expected downtime
  • Perform End-to-End Tests
    • Verify deployments are working
    • Verify software rollbacks are working
    • Verify autoscaling is working (pods and nodes)
    • Verify TLS certificates are in working order (non-staging)
    • Verify TLD redirects are working
  • Establish Status Page
    • Integrate with all status checks for dependent services

Post-Cut-over Checklist

  • Review exception logs
  • Monitor customer support tickets
  • Monitor non-200 status codes for anomalies
    • Spikes in 401 and 403 could indicate authorization problems
    • Spikes in 404 could indicate missing assets
    • Spikes in 30x could indicate redirect problems
    • Spikes in 500 could indicate server misconfigurations
    • Spikes in 503 could indicate problems with the platform (insufficient resources, or low limits)
  • Ensure robots.txt is permitting indexing
    • Ensure all mainstream crawlers are happy
    • Ensure sitemap is reachable
  • Check Real End User Data
    • Verify that end-user experience has not degraded