AWS Cost Controls

We've all been there: that dreaded email notification from AWS with our new cloud bill. We resolve to reduce our costs next month, but another month rolls around and not much has changed.

Here are some of the tools and tricks at your disposal for controlling your AWS cloud spend and reducing your costs. There's an overwhelming number of resources out there to reduce costs. Knowing which ones to prioritize will largely depend on your unique situation and architecture.

“Free” Money

AWS Activate Credits
- VC backed startups are eligible to receive up to $100,000 in usage credits (one time only)
Credits for Testing & POC Implementation

Tricks to Reduce Network Transit Costs

Move Data Transfer Costs to the Edge (e.g. CDN)
- Use Cloudflare to radically reduce transfer costs (Cloudfront has zero bandwidth fees)
- Use CloudFront
Configure an S3 Endpoint in your VPC
- Reduces Egress traffic over your NAT gateways (or instances) – one of the biggest costs
- Use VPC Endpoints with AWS Services (E.g. S3) where possible
Optimize NAT Traffic
- Use NAT instances rather than NAT gateway appliances for dev/test environments
Use internal private IPs rather than public IPs or Elastic IPs
- Use CNAMEs to AWS hostnames for resources that need to be accessed internally and externally to take advantage of the split-horizon DNS
- Use A records to Private IPs for hostnames that should be accessed strictly internally
Use a Docker Pull-Thru Registry Cache
- The pull-thru cache will reduce the ingress traffic, but only really effective if hitting one per AZ
Reduce Availability Zones where suitable to reduce cross-zone traffic
- For dev/test environments, use a single AZ to reduce cross-az transit costs.
- Traffic within an AZ is free.
Reduce Inter-zone/region traffic

Shave Compute Costs

Invest in an AWS Savings Plan
Buy (convertible) Reserved EC2 Instances
- Reserved Instances are complicated, but “convertible” RIs are a no-brainer.
- Pro tip: Use spot.io to manage this process for you
Move to a cheaper AWS region
Upgrade to newer generations of EC2 instance families
Use EC2 AMD Instances
- https://cloudonaut.io/6-new-ways-to-reduce-your-AWS-bill-with-little-effort/
Setup Spot Instances
- Use SpotInstance.com to get almost immediate gratification
- Use spot “fleets”
Leverage (or Tune) Auto-scaling Capabilities
- Use the SpotInst Ocean controller for Kubernetes to “right-size” your EC2 instances
Right Size Resources
Optimize Kubernetes Workloads
- Ensure proper memory & CPU requests and limits are set
- Ensure pods can be rescheduled by setting proper annotations
Shutdown unused instances
- Don’t just stop instances, but terminate instances to prevent racking up EBS costs

Reduce Database Costs

Use Aurora Serverless where appropriate
Use containerized databases for dev/test environments
“Right Size” Compute and Storage

Reduce Storage Costs

Backup your data in S3 rather than EBS or EFS to save big
Add Lifecycle Rules for Retention Policies on S3 buckets
- Automatically rotate logs to Glacier or delete after N days
Use reduced redundancy for less mission-critical artifacts such as logs
Optimize EBS Volumes
- Reduce oversized volumes
Optimize EBS Colume Snapshots

Use the various tools at your disposal

Komiser (Free/Open Source)
[youtube https://www.youtube.com/watch?v=DDWf2KnvgE8&w=640&h=385]
KubeCost (Free/Open Source) – Tool to gain visibility into operating costs inside of your Kubernetes clusters
Goldilocks (Free/Open Source) – Tool to help “Right Size” your Kubernetes Pods
SpotInst.com (SaaS)
CloudChecker (SaaS)

Author Details

Erik Osterman

CEO

Erik Osterman is a technical evangelist and insanely passionate DevOps guru with over a decade of hands-on experience architecting systems for AWS. After leading major cloud initiatives at CBS Interactive as the Director of Cloud Architecture, he founded Cloud Posse, a DevOps Accelerator that helps high-growth Startups and Fortune 500 Companies own their infrastructure in record time by building it together with customers and showing them the ropes.

https://osterman.com

[email protected]

Change Management

Change Control Management

Here are some of the tools and tricks at your disposal for enforcing a modern change control process.

Use a Version Control System
- With a VCS like GitHub, you'll be able to point to every change
- With branches, you'll be able to keep pending changes out of masteruntil they are ready
Use Infrastructure as Code
- Define the business logic of infrastructure in reusable modules
- Separate the business logic from the configuration
- Stick all the code & configuration into VCS
Use Automation (E.g. “Operations by Pull Request”)
- Eliminate humans from running commands by hand
- Pipelines promote code and configuration changes through environments
Use a Pull Request workflow with Code Reviews & Approvals
Use Pipeline Approval Steps
Use Notifications
- Send a slack notification for every deployment
- Comment on GitHub Commit SHA for every deployment of that commit
Use Branch Protections
- Require Pull Request Approvals
- Dismiss approvals if changes are pushed
- Require status checks to pass
- Enforce CODEOWNERS
Use CODEOWNERS
- Use teams to connote stakeholders (e.g. @secops or @dba or @qaor @frontend) to ensure approvals from subject matter experts
- Use narrowly scoped paths for teams (E.g. terraform/iam/* @secops)
Use Policy Enforcement
- Tools like Open Policy Agent, conftest, tfsec help to define contracts and enforce them
- Integrate the tools with your CI/CD pipelines (execute pipelines from master branch to enforce pipelines are not bypassed in the PR/branch)
Use Multiple Accounts, Stages
- Test changes in isolation and use a formal process to promote changes
Use Version Pinning
- Always pin your dependencies to a version (e.g. using semver or commit SHAs)
- Never overwrite any version of the software, always create a new release/tag
- Pinning to master or latest does not count!
Use Feature Flags
- Feature flags can ensure that functionality is only turned on when it's ready and easily disabled
- Controls around feature flag access limit who can toggle it and a change log of when it was modified

Author Details

Erik Osterman

CEO

https://osterman.com

[email protected]

12 Factor App Checklist

Here are some of the practical considerations to help determine the 12-factor readiness of an application for deployment on Kubernetes. We like the 12-factor app pattern because these are the ideal types of applications to run on Kubernetes. While pretty much anything can run on Kubernetes (e.g. worst-case as a StatefulSet), applications that match the characteristics below will make best use of the High Availability and Scaling features Kubernetes has to offer. For example, these types of apps are typically suitable for Deployments that can leverage the HorizontalPodAutoscaler functionality. These actionable recommendations are based on our “real world” experience of deploying apps and microservices to kubernetes. These may not follow the precise official/canonical definitions of the “12-factor” app.

Codebase
- One-repo-per-service (poly-repo) architectures (preferred, not required)
- Use a Git-based workflow with PRs & approval process
- Dockerized with a Dockerfile
- Automated tests exist to validate application works
Dependencies
- Service dependencies are explicitly declared in the configuration (e.g. DB_HOST)
- Services are loosely coupled so that they can be started in any order
- Application dependencies are explicitly pinned in manifests (E.g. package.json);
- Use semver rather than commit SHAs for pinning, where possible
Config
- All configuration settings are passed via environment variables and not hardcoded.
- Services can be dynamically reconfigured without recompilation (e.g. by changing settings)
- Use DNS-based service discovery (instead of IPs or depend on consul); use short-dns names with search domains rather than FQHN
- Use AWS SDK's automatic AWS configuration (E.g. do not validate AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY variables are configured)
Backing Services
- Use object storage where files are needed (not local storage)
- Use external databases (e.g. Postgres, MySQL, Redis, etc) to persist state
- Use environment variables for configuration (e.g. timeouts, endpoints, etc)
- Use configurable timeouts on connections, responses from backends
Build, Release, Run
- Well-defined process to build (e.g. compile) the application and start it (e.g. a Makefile)
- Dockerfile defines ENTRYPOINT to run the application
- Docker composition (docker-compose.yml) can bring up the environment for automated testing
- Cut releases on merge to master (preferred, not required); use semver
Processes
- All processes must expose health check endpoints (/healthz)
- Should not depend on a process manager (e.g. pm2)
- Should exit non-zero upon fatal errors
- Must respond to SIGTERM and exit gracefully
- Health checks should not depend on the health of the backing services
- Does not require privileged execution (e.g. root)
Port binding
- Dockerfiles define a PORT definition
- Services should listen on a preconfigured bind-address and port (e.g. 0.0.0.0:8000)
- Should listen on non-privileged ports (> 1024)
Concurrency
- Application can be run any number of times in parallel (e.g. no expectation of locking)
- Application does not maintain a large pool of persistent database connections (e.g. configurable pool size)
- Application uses database transactions, if applicable; avoids deadlocks
- Application does not depend on sticky sessions; requests can hit any process
Disposability
- Should be entirely stateless (e.g. not maintain any local state, all state offloaded to backing services)
- Processes can be easily created or destroyed without any orchestrated shutdown process
- No POSIX filesystem required for persistence (local caching/buffering okay)
Dev/prod parity
- All environments function the same way when configured with the same settings
- Flags should enable/disable functionality without knowledge of stage or environment (e.g. do not use if ($environment == 'dev') { ... })
- Do not use hostnames for conditional/routing logic (that's the job of Ingress)
Logs
- Logs are emitted to stdout
- Events are structured event streams (e.g. JSON)
- Do not write logs to disk (to mitigate the need for log rotation)
Admin processes
- Database migrations should be automated and run as a separate container
- Cronjobs can be run as a separate container
- Batch processing should run as a separate container

CI/CD Platform Requirements

Considerations for CI/CD Platforms

Must Support

Must support “pipelines as code” and auto-discover branches (ideally repos too)
- E.g. Jenkinsfile, codefresh.yml, .circle.yml, travis.yml
Must support shared secrets and settings
- E.g. GitHub Actions does not support shared secrets
Must support easy parallelization
- Speed is critical. One way to speed up tests is to parallelize steps.
Must support easy integration with Kubernetes (I don’t want to manage the setup)
- First-class Kubernetes support is essential
- Minimal extra tooling should be required for Kubernetes-backed deployments
Must use container-backed steps
Must support webhook events from PRs originating from untrusted forks (E.g. open-source projects)
Must support ChatOps style requests
- E.g. comments on PRs can trigger pipelines, slack commands can retry builds
Must support SSO
- ideally without enterprise tax
Must integrate with Slack
- Slack notifications should be customizable
- Slack should replace the need for most email notifications
Must be affordable
- Platform should not require long-term commitments (SaaS)
- Cost $10-20/user max
- Support “unlimited” builds or have a pay-per-build model
- Startup pricing preferred

Should Support

Should support approval steps with Slack notifications (ideally use slack buttons)
Should support RBAC/ABAC ACLs (e.g. for approval steps)
Should support local execution of pipelines for debugging
Should make it easy to discover all open PRs so it’s easy to re-trigger
Should support remote debugging (basically drop into any step via remote shell and poke around)
Should make it easy tag multiple versions of a docker image
Should support a “library” of pipelines or pipeline steps
Should support GitHub deployment notifications
Should support multiple OS-build platforms (e.g. iOS, OSX, Windows, Linux, Android, etc)

Go Live Checklist

Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.

Preflight Checklist

Rollback Plan
- Verify Backup Integrity and Recency
- Test Restore Process
- Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
External Availability Monitoring
- Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
- Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
Tune Kubernetes/ECS
- Ensure memory and CPU constraints are suitable for production
- Ensure that all pods have limits and requests set, because this is how the scheduler knows where to put things
- Ensure services gracefully recovery when killed (e.g. nodes, and pods)
Alert Escalations
- Ensure on-call engineers have mobile devices properly configured to alert
- Review on-call schedule to ensure continuity
- Prepare run books and link them to alerts
Performance Load Tests
- Replicate production workloads to ensure systems handle as expected
Exception Logging
- Ensure you have frontend/javascript exception logging enabled (E.g. Sentry, Datadog, NewRelic)
Reduce DNS TTLs
- Set TTLs to 60 seconds on branded domains
- Set all SOAs for TLDs to 60 seconds to mitigate effects of negative DNS caching
Prepare Maintenance Page
- Provide a means to display a maintenance page (if necessary)
- Should be a static page (e.g. hosted on S3)
Schedule Cut Over
- Identify all relevant parties, stakeholders
- Communicate scope of migration and any expected downtime
Perform End-to-End Tests
- Verify deployments are working
- Verify software rollbacks are working
- Verify autoscaling is working (pods and nodes)
- Verify TLS certificates are in working order (non-staging)
- Verify TLD redirects are working
Establish Status Page
- Integrate with all status checks for dependent services

Post-Cut-over Checklist

Review exception logs
Monitor customer support tickets
Monitor non-200 status codes for anomalies
- Spikes in 401 and 403 could indicate authorization problems
- Spikes in 404 could indicate missing assets
- Spikes in 30x could indicate redirect problems
- Spikes in 500 could indicate server misconfigurations
- Spikes in 503 could indicate problems with the platform (insufficient resources, or low limits)
Ensure robots.txt is permitting indexing
- Ensure all mainstream crawlers are happy
- Ensure sitemap is reachable
Check Real End User Data
- Verify that end-user experience has not degraded