Change Management

Change Control Management

Here are some of the tools and tricks at your disposal for enforcing a modern change control process.

  • Use a Version Control System
    • With a VCS like GitHub, you'll be able to point to every change
    • With branches, you'll be able to keep pending changes out of masteruntil they are ready
  • Use Infrastructure as Code
    • Define the business logic of infrastructure in reusable modules
    • Separate the business logic from the configuration
    • Stick all the code & configuration into VCS
  • Use Automation (E.g. “Operations by Pull Request”)
    • Eliminate humans from running commands by hand
    • Pipelines promote code and configuration changes through environments
  • Use a Pull Request workflow with Code Reviews & Approvals
  • Use Pipeline Approval Steps
  • Use Notifications
    • Send a slack notification for every deployment
    • Comment on GitHub Commit SHA for every deployment of that commit
  • Use Branch Protections
    • Require Pull Request Approvals
    • Dismiss approvals if changes are pushed
    • Require status checks to pass
    • Enforce CODEOWNERS
  • Use CODEOWNERS
    • Use teams to connote stakeholders (e.g. @secops or @dba or @qaor @frontend) to ensure approvals from subject matter experts
    • Use narrowly scoped paths for teams (E.g. terraform/iam/* @secops)
  • Use Policy Enforcement
    • Tools like Open Policy Agent, conftesttfsec help to define contracts and enforce them
    • Integrate the tools with your CI/CD pipelines (execute pipelines from  master branch to enforce pipelines are not bypassed in the PR/branch)
  • Use Multiple Accounts, Stages
    • Test changes in isolation and use a formal process to promote changes
  • Use Version Pinning
    • Always pin your dependencies to a version (e.g. using semver or commit SHAs)
    • Never overwrite any version of the software, always create a new release/tag
    • Pinning to master or latest does not count!
  • Use Feature Flags
    • Feature flags can ensure that functionality is only turned on when it's ready and easily disabled
    • Controls around feature flag access limit who can toggle it and a change log of when it was modified

 

 

About the Author
CEO & Founder / Cloud Posse, LLC

Erik Osterman is a technical evangelist and insanely passionate DevOps guru with over 12 years of hands-on experience architecting systems for AWS. After leading major cloud initiatives at CBS Interactive as the Director of Cloud Architecture, he founded Cloud Posse, a DevOps Accelerator that helps high-growth Startups and Fortune 500 Companies succeed in the cloud by leveraging Terraform and Kubernetes.

12 Factor App Checklist

12 Factor App Checklist

Here are some of the practical considerations to help determine the 12-factor readiness of an application for deployment on Kubernetes. We like the 12-factor app pattern because these are the ideal types of applications to run on Kubernetes. While pretty much anything can run on Kubernetes (e.g. worst-case as a StatefulSet), applications that match the characteristics below will make best use of the High Availability and Scaling features Kubernetes has to offer. For example, these types of apps are typically suitable for Deployments that can leverage the HorizontalPodAutoscaler functionality. These actionable recommendations are based on our “real world” experience of deploying apps and microservices to kubernetes. These may not follow the precise official/canonical definitions of the “12-factor” app.

  • Codebase
    • One-repo-per-service (poly-repo) architectures (preferred, not required)
    • Use a Git-based workflow with PRs & approval process
    • Dockerized with a Dockerfile
    • Automated tests exist to validate application works
  • Dependencies
    • Service dependencies are explicitly declared in the configuration (e.g. DB_HOST)
    • Services are loosely coupled so that they can be started in any order
    • Application dependencies are explicitly pinned in manifests (E.g. package.json);
    • Use semver rather than commit SHAs for pinning, where possible
  • Config
    • All configuration settings are passed via environment variables and not hardcoded.
    • Services can be dynamically reconfigured without recompilation (e.g. by changing settings)
    • Use DNS-based service discovery (instead of IPs or depend on consul); use short-dns names with search domains rather than FQHN
    • Use AWS SDK's automatic AWS configuration (E.g. do not validate AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY variables are configured)
  • Backing Services
    • Use object storage where files are needed (not local storage)
    • Use external databases (e.g. Postgres, MySQL, Redis, etc) to persist state
    • Use environment variables for configuration (e.g. timeouts, endpoints, etc)
    • Use configurable timeouts on connections, responses from backends
  • Build, Release, Run
    • Well-defined process to build (e.g. compile) the application and start it (e.g. a Makefile)
    • Dockerfile defines ENTRYPOINT to run the application
    • Docker composition (docker-compose.yml) can bring up the environment for automated testing
    • Cut releases on merge to master (preferred, not required); use semver
  • Processes
    • All processes must expose health check endpoints (/healthz)
    • Should not depend on a process manager (e.g. pm2)
    • Should exit non-zero upon fatal errors
    • Must respond to SIGTERM and exit gracefully
    • Health checks should not depend on the health of the backing services
    • Does not require privileged execution (e.g. root)
  • Port binding
    • Dockerfiles define a PORT definition
    • Services should listen on a preconfigured bind-address and port (e.g. 0.0.0.0:8000)
    • Should listen on non-privileged ports  (> 1024)
  • Concurrency
    • Application can be run any number of times in parallel (e.g. no expectation of locking)
    • Application does not maintain a large pool of persistent database connections (e.g. configurable pool size)
    • Application uses database transactions, if applicable; avoids deadlocks
    • Application does not depend on sticky sessions; requests can hit any process
  • Disposability
    • Should be entirely stateless (e.g. not maintain any local state, all state offloaded to backing services)
    • Processes can be easily created or destroyed without any orchestrated shutdown process
    • No POSIX filesystem required for persistence (local caching/buffering okay)
  • Dev/prod parity
    • All environments function the same way when configured with the same settings
    • Flags should enable/disable functionality without knowledge of stage or environment (e.g. do not use if ($environment == 'dev') { ... })
    • Do not use hostnames for conditional/routing logic (that's the job of Ingress)
  • Logs
    • Logs are emitted to stdout
    • Events are structured event streams (e.g. JSON)
    • Do not write logs to disk (to mitigate the need for log rotation)
  • Admin processes
    • Database migrations should be automated and run as a separate container
    • Cronjobs can be run as a separate container
    • Batch processing should run as a separate container

CI/CD Platform Requirements

Considerations for CI/CD Platforms

Must Support

  • Must support “pipelines as code” and auto-discover branches (ideally repos too)
    • E.g. Jenkinsfile, codefresh.yml, .circle.yml, travis.yml
  • Must support shared secrets and settings
    • E.g. GitHub Actions does not support shared secrets
  • Must support easy parallelization
    • Speed is critical. One way to speed up tests is to parallelize steps.
  • Must support easy integration with Kubernetes (I don’t want to manage the setup)
    • First-class Kubernetes support is essential
    • Minimal extra tooling should be required for Kubernetes-backed deployments
  • Must use container-backed steps
  • Must support webhook events from PRs originating from untrusted forks (E.g. open-source projects)
  • Must support ChatOps style requests
    • E.g. comments on PRs can trigger pipelines, slack commands can retry builds
  • Must support SSO
  • Must integrate with Slack
    • Slack notifications should be customizable
    • Slack should replace the need for most email notifications
  • Must be affordable
    • Platform should not require long-term commitments (SaaS)
    • Cost $10-20/user max
    • Support “unlimited” builds or have a pay-per-build model
    • Startup pricing preferred

Should Support

  • Should support approval steps with Slack notifications (ideally use slack buttons)
  • Should support RBAC/ABAC ACLs (e.g. for approval steps)
  • Should support local execution of pipelines for debugging
  • Should make it easy to discover all open PRs so it’s easy to re-trigger
  • Should support remote debugging (basically drop into any step via remote shell and poke around)
  • Should make it easy tag multiple versions of a docker image
  • Should support a “library” of pipelines or pipeline steps
  • Should support GitHub deployment notifications
  • Should support multiple OS-build platforms (e.g. iOS, OSX, Windows, Linux, Android, etc)

 

Go Live Checklist

Go Live Checklist

Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.

Preflight Checklist

  • Rollback Plan
    • Verify Backup Integrity and Recency
    • Test Restore Process
  • External Availability Monitoring
    • Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline.
  • Alert Escalations
    • Ensure on-call engineers have mobile devices properly configured to alert
    • Review on-call schedule to ensure continuity
    • Prepare run books and link them to alerts
  • Performance Load Tests
    • Replicate production workloads to ensure systems handle as expected
  • Exception Logging
  • Reduce DNS TTLs
    • Set TTLs to 60 seconds on branded domains
  • Prepare Maintenance Page
    • Provide a means to display a maintenance page (if necessary)
    • Should be a static page (e.g. hosted on S3)
  • Schedule Cut Over
    • Identify all relevant parties, stakeholders
    • Communicate scope of migration and any expected downtime
  • Perform End-to-End Tests
    • Verify deployments are working
    • Verify software rollbacks are working
    • Verify autoscaling is working (pods and nodes)
    • Verify TLS certificates are in working order (non-staging)
    • Verify TLD redirects are working
  • Establish Status Page
    • Integrate with all status checks

Post-Cut-over Checklist

  • Review exception logs
  • Monitor customer support tickets
  • Monitor non-200 status codes for anomalies
    • Spikes in 404 could indicate missing assets
    • Spikes in 30x could indicate redirect problems
    • Spikes in 500 could indicate server misconfigurations
  • Ensure robots.txt is permitting indexing
    • Ensure all mainstream crawlers are happy
    • Ensure sitemap is reachable
  • Check Real End User Data
    • Verify that end-user experience has not degraded