devops Archives – Cloud Posse

A question that often comes from well-established organizations with “mature” infrastructures is the following:

How can an organization instill a new engineering culture where developers and operations are working with each other and not against each other?

We affectionally call this “DevOps” movement; it's a culture where developers and ops work together and not against each other. Often their distinct roles are indistinguishable. Developers are confident on the command-line just as much as ops are confident in the IDE.

The key to succeeding with Devops is demonstrating to developers that it will actually make their job easier, such as when they can debug issues. Likewise, Ops needs to see the developers as a resource who can reduce the number of sleepless nights they experience as a result of failed deployments and buggy code. The role of ops in a DevOps culture is to enable devs to operate more efficiently. The role of devs in the organization is to build applications which are easily deployable.

Here’s what needs to happen. Ops needs to take a first step in standardizing the way software gets deployed. Start with taking a look at the current open source tools available. Choose one. Then they take those recipes and build a local development environment using something like Docker Compose or Vagrant so that developers can start getting familiar with it. Next, developers need to embrace the local dev environment over “native” environments (e.g. those that take a day of configuration and downloading packages). Through this process, they build up operational competency in debugging issues in environments that mirror production. After several months of operating like this, developers should shadow ops in certain roles, such as deploying software.

Once the above system is in place, the next step is to increase monitoring coverage and make this transparent to everyone in the org. In addition to your standard checks (like Nagios), you’re going to need to deploy something like DataDog or NewRelic. This gives your developers insights into how their apps are function in production and the necessary information to diagnose bugs. Tasks get created for every warning or error in production that is not handled in the proper manner. Once these alerts are calibrated to less than a few critical incidents per week, they should get wired up with PagerDuty. This is where the most resistance is usually encountered, but this is what keeps everything going. Without skin in the game, developers have no incentive make sure their code is highly reliable and ops are tied in what they can do to fix problems. But with only a few critical incidents, devs will be quickly motivated to fix the bugs and silence the alarms.

The last triumph in this conversion is to achieve “Continuous Integration” (CI) coupled with “Continuous Deployment”. Before this can even be remotely considered, considerable unit test coverage is needed. Stay tuned for more posts on these topics.