April 2016 – Cloud Posse

Heroku has deployed more services in a cloud environment than probably any other company. They operate a massive “Platform-as-a-Service” that enables someone to deploy most kinds of apps just by doing a simple git push. Along the way, they developed a pattern for how to write applications so that they can be easily and consistently deployed in cloud environments. Their platform abides by this pattern, but it can be implemented in many ways.

The 12-factor pattern can be summed up like this:

Treat all micro-services as disposable services that receive their configuration via environment variables and rely on backing services to provide durability. Any time you need to make a change it should be scripted. Treat all environments (dev, prod, qa, etc) as identical.

Of course, this assumes that the cloud-architecture plays along with this methodology for it to work. For a cloud-architecture to be “12 factor app” compliant, here are some recommended criteria.

1. Codebase

Applications can be pinned to a specific version or branch
All deployments are versioned
Multiple concurrent versions can be deployed at the same time (e.g. prod, dev, qa)

2. Dependencies

Service dependencies are explicitly declared and loosely coupled
Dependencies can be isolated between services
Services can be logically grouped together

3. Config

All configuration is passed via environment variables and not hardcoded.
Services can announce availability and discover other services
Services can be dynamically reconfigured (E.g. using feature flags or changing environment)

4. Backing Services

Services depend on object stores to store assets (if applicable)
Services use environment variables to find backing services
Platform supports backing services like MySQL, Redis or Memcache

5. Build, release, run (PaaS)

Automation of deployments (build, release, run)
All builds produce immutable images
Deployment should result in zero down-time

6. Processes

Micro-services should consist of a single process
Processes are stateless and share-nothing
Ephemeral filesystem can be used for temporary storage

7. Port binding

Services should be able to run on any port defined by the platform
Service discovery should incorporate ports
Some sort of routing layer handles requests and distributes them to port-bound processes

8. Concurrency

Concurrency is easily achieved by replicating the micro-service
Scale automatically without human intervention
Only sends traffic to healthy services

9. Disposability

Services are entirely disposable (not maintain any local state)
They can be easily created or destroyed
They are not upgraded or patched (just redeploy!)

10. Dev/prod parity

All environments function the same way
Guarantees that all services in an environment are identical
Code runs the same way in all places

11. Logs

Logs treated as structured event streams (e.g. JSON) that can be subscribed to by multiple consumers
Logs collected from all nodes in cluster and centrally aggregated for auditing all activity
Alerts can be generated from log events

12. Admin processes

It should never be necessary to login to servers to manually make changes
APIs exist so that everything can be scripted
Regular tasks can be scheduled to run automatically

A question that often comes from well-established organizations with “mature” infrastructures is the following:

How can an organization instill a new engineering culture where developers and operations are working with each other and not against each other?

We affectionally call this “DevOps” movement; it's a culture where developers and ops work together and not against each other. Often their distinct roles are indistinguishable. Developers are confident on the command-line just as much as ops are confident in the IDE.

The key to succeeding with Devops is demonstrating to developers that it will actually make their job easier, such as when they can debug issues. Likewise, Ops needs to see the developers as a resource who can reduce the number of sleepless nights they experience as a result of failed deployments and buggy code. The role of ops in a DevOps culture is to enable devs to operate more efficiently. The role of devs in the organization is to build applications which are easily deployable.

Here’s what needs to happen. Ops needs to take a first step in standardizing the way software gets deployed. Start with taking a look at the current open source tools available. Choose one. Then they take those recipes and build a local development environment using something like Docker Compose or Vagrant so that developers can start getting familiar with it. Next, developers need to embrace the local dev environment over “native” environments (e.g. those that take a day of configuration and downloading packages). Through this process, they build up operational competency in debugging issues in environments that mirror production. After several months of operating like this, developers should shadow ops in certain roles, such as deploying software.

Once the above system is in place, the next step is to increase monitoring coverage and make this transparent to everyone in the org. In addition to your standard checks (like Nagios), you’re going to need to deploy something like DataDog or NewRelic. This gives your developers insights into how their apps are function in production and the necessary information to diagnose bugs. Tasks get created for every warning or error in production that is not handled in the proper manner. Once these alerts are calibrated to less than a few critical incidents per week, they should get wired up with PagerDuty. This is where the most resistance is usually encountered, but this is what keeps everything going. Without skin in the game, developers have no incentive make sure their code is highly reliable and ops are tied in what they can do to fix problems. But with only a few critical incidents, devs will be quickly motivated to fix the bugs and silence the alarms.

The last triumph in this conversion is to achieve “Continuous Integration” (CI) coupled with “Continuous Deployment”. Before this can even be remotely considered, considerable unit test coverage is needed. Stay tuned for more posts on these topics.

The 12-Factor Pattern Applied to Cloud Architecture