Go Live Checklist
Here's a typical example of a “Go Live” checklist to run through before greenlighting a cut-over to a new production environment.
Preflight Checklist
- Rollback Plan
- Verify Backup Integrity and Recency
- Test Restore Process
- Ensure ability to perform software rollbacks with automation (E.g. CI/CD)
- External Availability Monitoring
- Enable “Real User Monitoring” (RUM). Establish a 1-2 week baseline before launch
- Enable external synthetic tests 2-4 weeks before launch to identify any potential stability problems (e.g. during deployments)
- Tune Kubernetes/ECS
- Ensure memory and CPU constraints are suitable for production
- Ensure that all pods have limits and requests set, because this is how the scheduler knows where to put things
- Ensure services gracefully recovery when killed (e.g. nodes, and pods)
- Alert Escalations
- Ensure on-call engineers have mobile devices properly configured to alert
- Review on-call schedule to ensure continuity
- Prepare run books and link them to alerts
- Performance Load Tests
- Replicate production workloads to ensure systems handle as expected
- Exception Logging
- Ensure you have frontend/javascript exception logging enabled (E.g. Sentry, Datadog, NewRelic)
- Reduce DNS TTLs
- Set TTLs to 60 seconds on branded domains
- Set all SOAs for TLDs to 60 seconds to mitigate effects of negative DNS caching
- Prepare Maintenance Page
- Provide a means to display a maintenance page (if necessary)
- Should be a static page (e.g. hosted on S3)
- Schedule Cut Over
- Identify all relevant parties, stakeholders
- Communicate scope of migration and any expected downtime
- Perform End-to-End Tests
- Verify deployments are working
- Verify software rollbacks are working
- Verify autoscaling is working (pods and nodes)
- Verify TLS certificates are in working order (non-staging)
- Verify TLD redirects are working
- Establish Status Page
- Integrate with all status checks for dependent services
Post-Cut-over Checklist
- Review exception logs
- Monitor customer support tickets
- Monitor non-
200
status codes for anomalies- Spikes in
401
and403
could indicate authorization problems - Spikes in
404
could indicate missing assets - Spikes in
30x
could indicate redirect problems - Spikes in
500
could indicate server misconfigurations - Spikes in
503
could indicate problems with the platform (insufficient resources, or low limits)
- Spikes in
- Ensure
robots.txt
is permitting indexing- Ensure all mainstream crawlers are happy
- Ensure sitemap is reachable
- Check Real End User Data
- Verify that end-user experience has not degraded