
If you’ve worked in software for any significant amount of time, you’ve probably (at the very least) heard the term DevOps. If you’re reading this post, I’m guessing it’s because you’re trying to figure out if it’s just a buzzword or something that can help your business. I can promise that it does have real meaning with significant implications for how you organize your teams and coordinate your software development efforts.
I’m confident that once you internalize these concepts, not only will you appreciate them, you’ll become a DevOps evangelist like me.
What is DevOps and what are its benefits?
Development Operations, or DevOps for short, is a methodology that encourages collaboration between software development and IT operations on the same team to increase efficiency, reduce software lead times, and increase code quality. The goal of this methodology is to build and deploy applications faster with fewer errors. This is accomplished by establishing a collaborative and mutually beneficial relationship between what would traditionally be two separate teams; software development, and IT operations.
The benefits of implementing DevOps in your organization are numerous. First, it can help you achieve a shorter time-to-market for your applications. DevOps heavily leverages automation, eliminating lots of human errors while also improving efficiency. And by having both development and IT operations working together closely, DevOps engineers can identify and fix errors earlier in the cycle, using metrics and visibility tools, with rare occurrences of handoffs or finger-pointing.
DevOps can also help improve morale and team cohesion. By bringing software development and IT operations together under one roof, you create an environment where people are working towards a common goal and can share knowledge and best practices. This leads to more productive, collaborative software development and operations teams.
The history of DevOps and how it has evolved
You’re probably familiar with the workflows of traditionally siloed teams. They go something like this:
The development side has a set of initiatives. Things like adding features and fixing bugs. They put these things on a roadmap, they estimate when they think they can deliver it, and they schedule a release for that date.
IT operations have another set of initiatives. Things like infrastructure upgrades to servers, databases, switches. They also handle one-off requests that come in like DNS and firewall changes. And lastly, and the most fun and exciting, is they put out fires.
The software development team is always late with features. After pushing a release out for multiple quarters, begging customers for forgiveness each time, they finally get a release working with the magic set of dependencies that isn’t complete trash. So they start their journey through merge conflict hell. When they get their final build, they throw it over the fence to ops.
Ops has issues getting the release running in production, and with lots of finger-pointing and proclamations of “works on my machine”, folks start to table flip and rage quit. Good times.
I’m here to tell you people; there’s a better way.
Let’s take a quick look at the history of the DevOps model to understand how it will help you gain operational efficiencies by using agile principles borrowed from the manufacturing industry.
The Toyota Production System and Lean Thinking
The roots of the DevOps model can be traced back to the Toyota Production System (TPS) and its focus on continuous improvement and small batches. The strategies of the TPS were eventually abstracted into Lean Thinking, with a few of those values being applicable to software development.
-
Customer satisfaction needs to be built into the enterprise’s process with built in quality at every step.
-
Work needs to be completed on a line to make it easy to identify bottlenecks and non-value add tasks.
-
Batch sizes should be small to facilitate flow.
-
Work should be prioritized and pulled from an upstream authority
Lean thinking blazed right through the manufacturing industry.
When athletic shoe company New Balance implemented Lean, they were able to reduce the time to make a pair of shoes from 9 days to 4 hours.
When the Mayo Clinic implemented Lean, cancellations and no-shows dropped from 30% to 10%, and physician fill rates went from 70% to 92%.
When Oral-B implemented Lean as a last-ditch effort to save a failing Iowa City plant, it reduced costs by 18%, reduced headcount by 38%, increased productivity by 55%, reduced lead times to 1 week, transforming it into one of the best-performing plants in the company.
You probably see where I’m going with this. By repurposing some of these agile practices from Lean thinking with DevOps, software companies can gain a similar competitive advantage.
I’ll dive deeper into exactly how the DevOps model accomplishes this in the next section.
How does DevOps work?
The DevOps model is known for helping companies achieve rapid delivery to stakeholders, rapid deployment to production environments, and continuous improvement of code quality.
There are three main ways DevOps achieves this.
-
It moves operations teams from an active role to a passive role in the release process.
-
It helps operations teams eliminate wasted effort.
-
It provides a quick feedback loop that plugs directly into the agile software development lifecycle.
Moving Operations Teams from an Active Role to a Passive Role
With traditional IT operations, system administrators follow an imperative procedure to build, test, and deploy software. Instructions are given in a specific order, and the system must be configured step-by-step in that order. This can lead to errors if a step is missed or misinterpreted. It’s a very labor-intensive process. It’s also difficult to track changes or revert to an earlier version.
DevOps teams take a more passive role in the build, test, and deploy phases. They instead maintain automated software pipelines. Testing and production environments are defined using declarative systems. DevOps teams often manage infrastructure as code on cloud computing platforms, which define what the system should look like in a configuration management tool. In the DevOps model, the desired state of the system is described, and the configuration management system is responsible for making the necessary changes to reflect the desired state. This eliminates the possibility of errors due to misinterpretation or missed steps. It also makes it easy to track changes or revert to an earlier version.
Eliminate Wasted Effort
When development and operations teams are siloed, they have traditionally taken direction from different authorities. Development teams pull work from the product owners. DevOps teams often have 3 types of work; project work, maintenance, and fires.
Project work is often aligned with the development team goals, but not always. It could be projects initiated internally, or walk-ups, or random pet projects from various business units.
Agreeing to take on any conceivable work related to IT is a recipe for burning through time and resources.
DevOps treats the software development initiatives as a single source of truth for work, which ensures that DevOps initiatives always have strong justifications in well-defined business value.
This doesn’t mean DevOps engineers stop doing maintenance. It means that maintenance is scoped to what’s needed to keep the production environment running smoothly, and optimizations are performed only at the bottleneck, which is a DevOps practice taken directly from Lean thinking.
Fast Feedback Loops
Agile practices emphasize breaking down software projects into small scopes of work and iterating over the development process. The DevOps model supplements agile development with data obtained from metrics collection and monitoring tools. This allows DevOps teams to provide feedback to the other stages of the software development lifecycle, and drive a system of continuous improvement with pragmatic, data-driven decisions. In turn, the development teams use this data to improve software quality.
CI/CD automation allows the appropriate data to be collected at the appropriate time ensuring the DevOps teams always have a good baseline and visibility into the system.
The keys to an effective DevOps culture.
From what I’ve seen, there are a few DevOps principles that are key to building a highly effective DevOps culture.
-
Strong collaboration between software development and IT operations (hence the name)
-
Automation of all the things
-
Monitoring and metrics
In this section, I’ll dig into each of these and show how they work and why they’re important.
Communication and Collaboration between Software Development and Operations Teams
Both software development and IT operations teams need to be working towards the same goal. Both teams need to share ideas and solve problems together.
Both teams should be pulling in work from the same upstream authority. For software engineers, this won’t change much. For systems engineers, this may require adopting a DevOps mindset.
For example, traditionally if your storage vendor released a firmware update, your storage engineer would schedule some downtime and apply the update. The business justification for this work would be something along the lines of improving performance or hardening a security vulnerability.
In a DevOps culture, that firmware update would most likely be deprioritized. From Lean Thinking; any optimization outside of the bottleneck is an illusion. So, unless storage performance is a bottleneck for the application, that firmware update is wasted effort.
By contrast, with a team organized around DevOps, the engineer would be pulling work from the Kanban board to allocate a new storage partition for a new microservice that’s ready to be deployed to production. Or better yet, they’d set up a storage API that the microservice could use to provision its own storage through a declarative configuration. Either way, the software development and IT operations goals are aligned.
CI/CD Automation and Infrastructure Provisioning
One of the most powerful components of the DevOps model is the embracing of automation. Automating the deployment of software and provisioning of infrastructure can help to speed up the software development process and reduce human error.
Continuous Integration and Continuous Delivery
Continuous integration and continuous delivery (CI/CD) refers to a set of automation practices and tools which helps your organization ship higher quality software, with lower lead times, by enforcing code quality standards and eliminating error prone manual processes.
Let’s dig deeper into each of these concepts.
Continuous Integration
To achieve the goal of continuous delivery, it’s important to have a Continuous Integration (CI) process in place. This is where code is checked in multiple times a day and automated builds and tests are run to ensure that the code is not breaking any existing functionality.
Continuous integration complements agile development in that the commits should be small and frequent. To leverage the full power of continuous integration, DevOps teams should put in the time to include static code analysis, linting, and fully comprehensive unit tests, integration tests, and end-to-end tests in the continuous integration pipeline.
Continuous integration uses the concept of a release pipeline; a series of tasks run against the new code as it works its way through testing, staging, and production. The tasks in a release pipeline usually include; compiling code, testing code, packaging code, deploying code, and monitoring the system as the new versions run through testing to production.
The release pipeline typically kicks off in response to some kind of source code management event; like pushing or tagging a code commit. Next, a series of tests and checks are run against the new code. I try to organize these tasks so that the fastest ones run first, and the longest-running tasks run last. Typically I start with linters, static code analysis, and unit tests. Next comes integration and end-to-end tests, followed by some level of fuzz testing if applicable to the software project.
The DevOps team is responsible for building the release pipeline and configuring all of the tasks. I recommend building highly comprehensive checking and testing. You only have to set it up once and it will be run each time a developer commits code. Because of the automated nature of continuous integration, this effort provides outstanding value.
Continuous Delivery
Once you have a continuous integration process in place, you’re ready for continuous delivery. Continuous delivery is about deploying those build artifacts to the production environment. Small code changes are pushed into production continuously, rather than large-scale deployment where the entire code base is updated at once.
Continuous delivery has many benefits, including; helping development teams get a faster ROI from software development efforts, getting real-time feedback from users on new features, and reducing the frequency and blast radius of bugs. The most compelling of these benefits are robust capabilities around handling deployment problems.
Blue-Green Deployments
The simplest of these deployment strategies is the blue-green deployment. In a blue-green deployment, two versions of your application are live at the same time. The blue version is the current working version, while the green version is the new version that you are testing.
If everything goes according to plan, the DevOps team can switch all of the traffic over to the green version without any downtime. This is possible because the blue version is still running in parallel, and can take over if there are any problems with the green version.
If they encounter any problems with the green version, the DevOps team can quickly switch back to the blue version without losing any data or impacting the customer experience. This makes it easy to test new features or changes without putting your website or application at risk.
Canary Deployments
Taking the idea of blue-green deployments a step further, the next continuous delivery technique is the “canary release” strategy. In a canary release, new code is first released to a small percentage of users (usually 1-5%), and then monitored closely for any issues. If there are no problems found, the code is then released to the rest of the users. This allows DevOps engineers to roll out new code in a controlled and safe manner, minimizing the risk of any negative impacts on the system.
If there are no issues with the initial canary group, the DevOps pipeline can resume updating the remaining users with the new version. This is handled by various DevOps tools, usually referred to as “schedulers”.
If there are problems, similar to how this is handled in a blue-green deployment, the version is rolled back. Note the difference between a blue-green deployment and a canary deployment isn’t just in the size of the initial test group, but how the migration is performed. Blue-green deployments publish code by updating a traffic control mechanism, like a load balancer. Canary deployments publish code at the server level.
Rolling Deployments
Canary deployments are a great way to test updates before rolling them out to all of your nodes. But what if you have a large application with lots of nodes? Rolling deployments are a better option in this case because they address the challenges that come with scaling software.
Once your software gets to a certain size, you’re going to start to see some kinds of errors. Your DevOps team will move away from alerting on individual errors, and instead, start tracking trends. Site Reliability Engineering will determine what types of errors are acceptable, and how frequently they can occur before the deployment is considered to have a problem.
Rolling deployments are similar to canary deployments, but instead of testing a small group and deploying the remaining nodes on success, rolling updates choose a batch size and gradually deploy updates over time.
For example, with a fleet of 1000 nodes, you could choose a batch size of 10. When the appropriate code management event fires off a deployment, only 10 nodes will be updated. As long as the monitoring system doesn’t detect any changes in the trends of the type and frequency of errors, the scheduler will continue deploying the updates, 10 nodes at a time until all nodes are updated.
I don’t recommend rolling updates for smaller applications because of the added complexity, and the fact that it’s harder to partition a small number of nodes into batches. You end up with configurations like 5 batches of size 1. In this case, you might as well just simplify and use blue-green, or canary deployments.
Monitoring and Metrics
The inputs that these DevOps automation systems use to make decisions are metrics produced from the various testing, staging, and production environments, which are fed back via the DevOps lifecycle.
Metrics are also important because they help your DevOps team understand how the system is performing and identify areas for improvement. Metrics should be collected at a variety of different levels, from individual services, environments, and up to the enterprise level.
The goal is to have visibility into the entire system so that problems can be identified and resolved quickly. DevOps metrics can be divided into two categories:
-
End-user experience
-
Computational resources
Let’s dig into the importance of each one of these.
End-user experience metrics
End-user experience metrics are important because they define the performance of the application from the end user’s perspective. This includes response time, throughput, and error rate.
If you aren’t monitoring your end users’ experience, then you’re relying on them to let you know there is a problem through Twitter or a customer service email. This is one of the worst possible mistakes you can make. Not only does this put a burden on your users to troubleshoot the problem on their end, but it puts a burden on your support team to determine if the issue is in your production environment.
Consider the user experience of two different failure scenarios.
-
A user visits your site. Your back-end throws an error. There are no end-user experience metrics so your system engineers aren’t aware of the issue. Worst case, the customer is turned off of your product, fires off an angry and disparaging tweet, and churns. Best case, they’re able to determine that the problem is on your end, and they email your customer service. Because web applications that get lots of traffic often throw some kind of error, some of the time, it’s hard to determine whether or not there is a problem. It may get resolved quickly or may continue intermittently disrupting users for months.
-
A user visits your site. Your back-end throws an error. This happens from time to time, but this particular error is outside of the normal distribution, so your alert system fires off a notification. A DevOps engineer sees the uptick in the frequency of this particular error and correlates it back to some recent code changes. The DevOps engineer activates a feature flag that displays a banner letting users know that the back-end may be experiencing issues, and to try back in a few minutes. Then the DevOps engineer rolls back the deployment. After they see the errors stabilize, they remove the notification banner and open a bug ticket identifying the error and the deployment that caused it. The bug ticket gets prioritized and pulled into the next sprint. End users retry their operation and resume work, satisfied with your application.
It’s easy to see how the DevOps methods in the second scenario lead to much better business outcomes.
Computational resource metrics
Computational resource metrics are important because they help to identify areas where the system is under stress and may need more resources to support the load. This includes CPU utilization, memory usage, and disk I/O.
Unless you’re completely new to IT, computational resource metrics don’t seem to need much more explanation. Server run out of memory, server no worky, app go boom.
Collecting both sets of metrics is important to have a complete view of the system’s performance. It helps DevOps teams establish an empirical performance baseline for the application. This baseline can then be used to determine when problems are introduced, and to make decisions about when to roll back a change.
This satisfies the Lean thinking tenant regarding customer satisfaction and build-quality.
Case studies of companies that have successfully implemented DevOps
If you’re a pragmatic, data-driven decision-maker like I am, you might be wondering if there is any hard data to backup all of this rhetoric. So here I’ve compiled some noteworthy case studies of companies who’ve tracked the results of their DevOps initiatives.
Goldman Sachs used GitLab with DevOps practices to improve from 1 build every two weeks to over a thousand builds per day, with 52,000 test cases across 11,000 test classes. It took their engineering team by storm with over 1,500 adopters of the new platform in the first two weeks.
REI used New Relic Application Performance Management to track down a code change that was generating excess database calls during their yearly anniversary sale and improve their site’s performance by 20%.
CapitalOne leveraged infrastructure as code, metrics, and monitoring on AWS to reduce the time needed to build new application infrastructure by 99%.
It’s hard to find case studies for a methodology because there is no specific product attached to it. Nobody can sell you “DevOps”. But if you consider the above data points and imagine getting all of these benefits under one roof, the benefits are compounding.
According to the State of DevOps 2021, after surveying over 32,000 professionals, high performing organizations achieved the following benefits:
-
973x more frequent deployments
-
6570x faster recovery from failures
-
3x lower change failure rate
-
6570x shorter lead times (not a typo, but coincidentally the same improvement as recovery time)
These high performer metrics increased from the State of DevOps 2016 which included similar metrics, in addition to:
- 22% less time on unplanned work and rework
By now, I’m sure you see the benefits of this new operations model, the gears are probably turning as to how it can give your business a competitive advantage. So, how do you transition from traditionally structured teams to DevOps?
How to Get Started with DevOps
The shift to DevOps can be a bit overwhelming if you’ve never worked under this model. Here are some low hanging fruit type strategies that are fairly easy to implement, and won’t cost much.
Embed DevOps engineers with the software development team
One of the quickest and easiest ways to start building a bridge between development and operations is to embed an ops engineer in one or more development teams. They should be in the scrum meetings and be available to pull in stories around operational dependencies. Choose a senior-level operations engineer, ideally familiar with one or two programming languages which will help with some of the automation.
If the DevOps people feel like they’re getting too much exposure to code development issues, you can create a separate scrum for just DevOps issues.
Get DevOps to pull work from kanban
Your software engineers are probably familiar with the development processes that agile teams follow. But people with a systems background may need to adjust.
These people are used to being heroes, constantly helping people in need. They need to get comfortable telling people “no”, so that work can be prioritized appropriately.
Prioritize pipeline automation
Choose a CI tool and start with small automation tasks. Linters and code analysis are a good start because they’re easy to set up and present no risk.
Automate test coverage reports, and export the metrics to a monitoring tool. You can use these metrics to show progress as you build out a more robust test suite and prepare for continuous delivery.
The future of DevOps
So what does the future hold for DevOps? What I’m seeing now is the move toward specialized variations that solve problems for specific niches of information technology.
-
Increased focus on security. As more and more businesses move their applications and data to the cloud, security becomes a top priority. Those with established security teams will move to DevSecOps.
-
Increased focus on data. Companies with specific data requirements will further specialize their operations teams. Teams who collect and store big data will move to DataOps. Teams working on machine learning projects will move to MLOps.
-
Increased focus on large-scale applications. Once your application grows to a certain size, you’ll start running into some particularly challenging engineering problems. These shops will stand up Site Reliability Engineering teams to work alongside DevOps.
How we help companies transition to DevOps
Companies often face several challenges making the transition to DevOps, whether it be skill gaps or cultural issues. Making this transition takes considerable effort in itself, and attempting this while juggling your team’s existing responsibilities can be a real challenge. Compounding the problem, when your existing processes demand that you cut corners to do more with less it often results in a negative feedback loop.
BitLightning can help you make this transition. Our team has years of experience with DevOps technologies and proving the business case by enabling your teams to deliver higher quality software in less time. We provide consultation, proof of concepts, training, and support so you can make the most of readily available open-source tools and automation techniques. We’re owned and operated by the engineers, so when we promise you something, the buck stops with us.
To get started, schedule a risk-free consultation with a principal engineer today.
Start your DevOps transformation today!
Book a free consultation and learn how to increase velocity while improving code quality by aligning development and operations teams.
If you’d like to be notified when I publish more articles like this, sign up for our newsletter here below.