Go Live Checklist

Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.

Preflight Checklist

  • Rollback Plan
    • Verify Backup Integrity and Recency
    • Test Restore Process
    • Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
  • External Availability Monitoring
    • Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
    • Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
  • Tune Kubernetes/ECS
    • Ensure memory and CPU constraints are suitable for production
    • Ensure that all pods have limits and requests set, because this is how the scheduler knows where to put things
    • Ensure services gracefully recovery when killed (e.g. nodes, and pods)
  • Alert Escalations
    • Ensure on-call engineers have mobile devices properly configured to alert
    • Review on-call schedule to ensure continuity
    • Prepare run books and link them to alerts
  • Performance Load Tests
    • Replicate production workloads to ensure systems handle as expected
  • Exception Logging
    • Ensure you have frontend/javascript exception logging enabled (E.g. Sentry, Datadog, NewRelic)
  • Reduce DNS TTLs
    • Set TTLs to 60 seconds on branded domains
    • Set all SOAs for TLDs to 60 seconds to mitigate effects of negative DNS caching
  • Prepare Maintenance Page
    • Provide a means to display a maintenance page (if necessary)
    • Should be a static page (e.g. hosted on S3)
  • Schedule Cut Over
    • Identify all relevant parties, stakeholders
    • Communicate scope of migration and any expected downtime
  • Perform End-to-End Tests
    • Verify deployments are working
    • Verify software rollbacks are working
    • Verify autoscaling is working (pods and nodes)
    • Verify TLS certificates are in working order (non-staging)
    • Verify TLD redirects are working
  • Establish Status Page
    • Integrate with all status checks for dependent services

Post-Cut-over Checklist

  • Review exception logs
  • Monitor customer support tickets
  • Monitor non-200 status codes for anomalies
    • Spikes in 401 and 403 could indicate authorization problems
    • Spikes in 404 could indicate missing assets
    • Spikes in 30x could indicate redirect problems
    • Spikes in 500 could indicate server misconfigurations
    • Spikes in 503 could indicate problems with the platform (insufficient resources, or low limits)
  • Ensure robots.txt is permitting indexing
    • Ensure all mainstream crawlers are happy
    • Ensure sitemap is reachable
  • Check Real End User Data
    • Verify that end-user experience has not degraded