Go Live Checklist

Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.

Preflight Checklist

Rollback Plan
- Verify Backup Integrity and Recency
- Test Restore Process
- Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
External Availability Monitoring
- Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
- Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
Tune Kubernetes/ECS
- Ensure memory and CPU constraints are suitable for production
- Ensure that all pods have limits and requests set, because this is how the scheduler knows where to put things
- Ensure services gracefully recovery when killed (e.g. nodes, and pods)
Alert Escalations
- Ensure on-call engineers have mobile devices properly configured to alert
- Review on-call schedule to ensure continuity
- Prepare run books and link them to alerts
Performance Load Tests
- Replicate production workloads to ensure systems handle as expected
Exception Logging
- Ensure you have frontend/javascript exception logging enabled (E.g. Sentry, Datadog, NewRelic)
Reduce DNS TTLs
- Set TTLs to 60 seconds on branded domains
- Set all SOAs for TLDs to 60 seconds to mitigate effects of negative DNS caching
Prepare Maintenance Page
- Provide a means to display a maintenance page (if necessary)
- Should be a static page (e.g. hosted on S3)
Schedule Cut Over
- Identify all relevant parties, stakeholders
- Communicate scope of migration and any expected downtime
Perform End-to-End Tests
- Verify deployments are working
- Verify software rollbacks are working
- Verify autoscaling is working (pods and nodes)
- Verify TLS certificates are in working order (non-staging)
- Verify TLD redirects are working
Establish Status Page
- Integrate with all status checks for dependent services

Post-Cut-over Checklist

Review exception logs
Monitor customer support tickets
Monitor non-200 status codes for anomalies
- Spikes in 401 and 403 could indicate authorization problems
- Spikes in 404 could indicate missing assets
- Spikes in 30x could indicate redirect problems
- Spikes in 500 could indicate server misconfigurations
- Spikes in 503 could indicate problems with the platform (insufficient resources, or low limits)
Ensure robots.txt is permitting indexing
- Ensure all mainstream crawlers are happy
- Ensure sitemap is reachable
Check Real End User Data
- Verify that end-user experience has not degraded