How to Safely Manage Change in a CI/CD World
Change management exists because it ensures the attention of many eyes and that much care is taken before modifying production systems, hopefully creating some reliability. The reason that change management processes don’t exist in modern systems is that our systems change so rapidly that no review panel could possibly keep up. This article proposes a way to keep the reliability factor alive while moving to a new, better methodology.
What do Continuous Integration, Continuous Delivery, and Continuous Deployment Mean?
In software engineering and development, continuous integration (CI) means that code is checked out from the main central repository frequently before small changes are made, then those changes are quickly checked in to that central repository. This makes the new code merges smaller, meaning that they are easier to integrate, easier to test, and easier to roll back if a problem is discovered.
Most development teams, especially in the DevOps and site reliability engineering (SRE) realms, are pairing continuous integration with continuous delivery (CD). Continuous delivery aims to create an automated pipeline that frequently builds and tests software with far greater speed and accuracy than was possible before the process of pairing continuous integration and continuous delivery (CI/CD) existed. In this version of CI/CD, releases are done manually after human eyes check the automated builds.
Some teams are moving even further with a different CD, continuous deployment. Here even the deployment of releases to production is a part of the automated pipeline. Deployments are gated by a requirement that all tests pass before automatically releasing. Automation ultimately makes the release cadence even faster and more predictable. It also means that we need to be sure that our test suite coverage is complete and accurate to what we need. We may also want to add some new testing options to our current plans, like Chaos Engineering, for more complete assurance.
Sometimes people stumble over the use of the word continuous in these terms. What may ease some minds is that what these terms refer to is just another software development methodology. This is not something to fear, but rather to thoughtfully investigate and consider for your use case.
Much of what this article covers is part of a software development and deployment model called DevOps, which is exploding in popularity. DevOps takes traditional development and operations, two things that have historically worked in silos, interacting only when necessary, and merges them. It combines a set of cultural norms, perspectives, tools, and practices in a way that enhances an organization's velocity when delivering applications, security updates, new features, and ability to compete with rivals. DevOps teams work across an application’s lifecycle, automating traditionally tedious and slow practices to enable faster, yet still safe, change.
How is CI/CD Different from What We Did in the Past?
What do we all want? What have we always wanted? Working code in production that is trustworthy, stable, and makes customers happy. We want to avoid downtime.
Consider the IT Infrastructure Library (ITIL), which is a widely accepted set of detailed practices for managing change, specifically related to information technology (IT). This subset of IT service management (ITSM) prescribes procedures, check lists, processes, tasks, and so on for the purpose of recording strategy and delivering a baseline for planning and measurement of progress. This framework of “best practices” manages risk and promotes stability. Some fields of operation require it or something similar to be used.
While change management is especially useful for organizational change, enabling enterprises to adapt more easily to necessary changes while attempting to remove obstacles to change, it is also a more tedious method than is useful in IT today for software development.
What did a Typical Change Management Process Include?
In the past, our change management method for trying to ensure our software code was production-worthy worked something like this, with variations depending on team needs and ignoring the approval process for emergency changes:
- Users submit a request for a change or bug fix using a defined change request process.
- Software teams determine whether the change is possible and desirable.
- Teams plan how to write, when to code, and how to implement the change.
- Developers and engineers write code.
- The code is reviewed using a detailed, careful change management evaluation process.
- That code is saved up until a time designated by the change advisory board for new merges into the main repository, generally after representatives from every business unit that might be affected by the change have signed off, each giving their approval.
- During the appropriate window, the change is implemented by merging it into the mainline code. Often, many changes are merged nearly simultaneously, requiring engineering time and effort to get all of the changes to merge and deploy successfully.
- The merged changes are pushed to a test or stage environment where tests are performed by humans and with automated tests. This takes time and any problems found mean that the process has to loop back to a previous step. Which step depends on how the process was implemented.
- When everything is working well in the testing or staging environment, the code is deployed to production. This is typically a big event, sometimes taking a full day or longer (especially when deployment problems occur for the operations team that were not anticipated by the development team, as they often worked in isolation).
Even some software that is created using a service-oriented architecture (SOA) is still being delivered using this traditional method. Testing and deploying services individually is not permitted. Changes take a long time to implement in production. Problems take a long time to fix, even when the solutions are simple.
What is the Advantage of Moving Away From Change Management?
Simply put, a loosely coupled architecture enables scaling.
One frequent problem when using traditional change management practices with service-oriented architectures is that the services in the code are not loosely coupled and therefore cannot be properly tested individually. A change in one service is currently able to impact another service.
This must be fixed before you can effectively use CI/CD and before you can safely benefit from modern cloud-based microservice designs. The UNIX Philosophy still matters. Use it.
When your services are each designed to run independently, a service can be replaced by better code that does the same thing.
Services can be updated with fixes more easily.
Services can be scaled by inserting load balancing and multiple instances of the service when needed. You can even deploy only a small number of new version releases of a service as a canary release or canary deployment while keeping current version service instances as the main load carrier.
With traditional change management practices there is a constant struggle between trying to do things safely and trying to do things quickly. CI/CD promises to ease that pain by allowing us to do both, when implemented properly.
Where is Change Management in CI/CD?
The real question that we should be asking is whether we can write and deploy software to production using CI/CD that is just as safe and just as stable as the software released using a traditional change management methodology.
CI/CD promises much faster velocity using small changes that are easy to roll back. If we pair that with microservices that are properly written as isolated and encapsulated functions with clearly defined inputs and outputs, we can. Here’s how:
- Start by testing your services individually. Make sure they are working as designed and are decoupled from the rest of your code. Automate the tests.
- Use unit tests to determine whether individual component services work. Automate the tests.
- Use integration tests to determine whether service components fit together. Automate the tests.
- Use chaos experiments to determine whether the system works when components fail (this is covered in greater depth later in the article). Automate the tests.
- Enforce the small changes rule. Small changes to stable code are easy to merge, test, and deploy. Everyone starts a change by pulling the latest code from main/master. They only make one change, whether it is a feature addition, bug fix, or just a cleaning up to pay back technical debt. They then check in that change as an individual merge request. This does not mean that long new code is not permitted, nor that you can’t make changes to multiple files. It means that any and all changes are tied to only one purpose.
- All merge requests get tested immediately. Automatically. By your pipeline. Write automated tests for each service that test the inputs and outputs for proper functioning and error handling. Write tests that cover everything you can think of (everything that is actually important…learn the difference as it applies to each service). This is where some types of Chaos Engineering experiments/tests could be extremely useful.
- Employ peer review. Before a merge request is accepted, it must also be reviewed and approved by another developer or engineer on the team who is trusted to know the code base and to be able to accurately evaluate the submission.
- Deploy code that you know works. It passed all the tests. It performed appropriately. It was stable and reliable even while being pressured by chaos and reliability testing schemes. Whether you use continuous delivery and a human presses a virtual button to deploy to production or you use continuous deployment and your pipeline automatically deploys to production, you already feel good about the quality of your code, that it works as designed, and that it will work with the rest of your architecture.
Follow these guidelines and enable development teams to reach higher performance than you dreamed possible.
We Should Accelerate Deployment of Quality Code
In their book, Accelerate, Nicole Forsgren, PhD; Jez Humble; and Gene Kim present research about what works and what doesn’t in software development. They have some strong advice on this topic. Here are some of their research findings:
- Companies that are higher performers in deploying software are twice as likely to exceed objectives in quantity of goods and services, operating efficiency, customer satisfaction, quality of products or services, and achieving organization or mission goals.
- High performing companies have on average:
- 46 times more frequent code deployments
- 440 times faster lead time from commit to deploy
- 170 times faster mean time to recover from downtime
- 5 times lower change failure rate
- Enabling services to be independently tested and deployed is the biggest contributor to continuous delivery.
- Teams that reported no approval process or used peer review achieved higher software performance.
The overall recommendation from the book is that companies and teams wanting to release the best quality software implement or transition to a lightweight change approval process based on peer review combined with a deployment pipeline designed to detect and reject bad changes.
What we need is less process overhead and more direct, automated testing of the code.
Based on research data, it seems that this methodology is quantitatively producing safer and better code!
How Do We Know that the Code We Deploy Will Work?
The first priority is to hire software engineers and developers you can trust and give them an atmosphere in which to work successfully together. Junior engineers working in teams should be paired with senior engineers for training and support.
Create a culture where asking questions and working together is valued and encouraged. Trust them to come up with good solutions and to properly review one another’s work in a way that mistakes are simply problems to fix together and not blamed or shamed.
Write good test definitions for the automated pipeline. Those working on the code know what it should do and what it should not do. Test and make sure that what you think works actually does. Do not disable testing just to get a build out the door.
Start implementing simple rules around frequently checking out code, writing small (atomic) changes, and checking in merge requests frequently.
Gremlin can help with this. Our focus is on helping create reliable, resilient software through Chaos Engineering. To do this, we frequently help enterprise customers accelerate the change to CI/CD processes. We also help provide confidence in the process of fostering a good DevOps culture while lowering deployment pain as customers implement good Chaos Engineering practices in their pipeline, such as with the Spinnaker CI/CD tool or the Puppet host management tool.
Introducing GameDay scenarios into some of these Web-scale companies has initiated a difficult cultural shift from a steadfast belief that systems should never fail - and if they do, focusing on who's to blame - to actually forcing systems to fail. Rather than expending resources on building systems that don't fail, the emphasis has started to shift to how to deal with systems swiftly and expertly once they do fail - because fail they will.
From a moderator comment in acmqueue, Volume 10, issue 9, September 13, 2012 in the article Resilience Engineering: Learning to Embrace Failure: A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli: GameDay Exercises Case Study.
(Author’s note: I added the GameDay link to the quote, which is otherwise unchanged.)
Gremlin helps with continuous delivery in three main areas:
- By quickly validating the impact that engineering work has on dependencies as they are decoupled, preventing unplanned work or rework.
- By giving engineers confidence in their deployments by allowing them to control and observe the cause and effect relationship that failure has on their service and on their environment as a whole.
- By giving engineers the opportunity to proactively find and fix failure, creating a culture that appreciates failure as a means of learning so that we can ultimately create and enhance reliability.
We recognize that there are places where change management is a hard and firm requirement. We help these customers streamline processes while making software more reliable by implementing Chaos Engineering practices in this setting.
Making the Decision to Use Chaos Engineering with CI/CD
One of the hardest decisions we must make in managing software development and life cycles is change. It is easy to keep doing what we are doing, even if we know that our current process is inefficient or problematic.
Combined with that, learning to distinguish which software is strategic and which isn’t is of enormous importance. This helps us prioritize our efforts on making change where it is most useful first.
Moving from waterfall methods to agile methods was a big change. It didn’t necessarily transform our production deployment schedules, but it did change how quickly we could respond to customer input on code we are creating and testing.
Moving from a monolithic architecture to a service-oriented architecture and even to microservices is a big change. Those who have made it are testifying that it solves some big problems, especially in the area of being able to isolate functions and ultimately our code’s reliability and responsiveness under load.
Moving from a change management process to a CI/CD pipeline is also a big change. The idea of monthly or weekly builds and deployments scares some. Others are moving even faster with nightly builds or even more frequent builds throughout a single day.
Ultimately, implementation matters most. Doing things well and testing what we do, including testing for things that we might not have anticipated by injecting a little Chaos Engineering into our build pipeline, will help us succeed.
Gremlin's automated reliability platform empowers you to find and fix availability risks before they impact your users. Start finding hidden risks in your systems with a free 30 day trial.
sTART YOUR TRIALWhat is Failure Flags? Build testable, reliable software—without touching infrastructure
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to...
Read moreIntroducing Custom Reliability Test Suites, Scoring and Dashboards
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization.
Read more