Explore the realm of Change Failure Rate, a crucial metric in software development that gauges the success or failure of implemented changes. In our upcoming blog post, delve into why understanding this rate is vital for ensuring the stability and reliability of your software releases.
In today’s dynamic environment - technology companies struggle to evaluate the performance of their development teams. Different studies have been conducted aiming to define objective criteria that will make engineering teams more agile and efficient in the software development process.
In the past 10 years – more than 30000 IT professionals worldwide have participated in the global DevOps acceleration survey. As a result – in 2015 Google established the program called DORA (DevOps Research and Assessment) group. In 2018 they published their study Accelerate introducing the DORA Metrics – a concept that defines 4 key metrics that can distinguish the high-performance development teams from the average ones. The metrics are listed below and in this article we explain one of them - Change Failure Rate (CFR) as a number that measures the percentage of changes to production that result in degraded service (or incidents). We will clarify the role of CFR in tech companies and the need to identify and manage the change failure rate in a proper way.
Quick Introduction of DORA Metrics
The development team creates changes to production or are released to users. Some of these changes will cause incidents (i.e. service outage or hotfix), so the Change Failure Rate is the percentage of changes that result in service degradation.
Along with the CFR - DORA group defines three more metrics (called DORA metrics):
- Deployment Frequency (DF) – measures the frequency at which code is deployed successfully to a production environment. Engineering teams want to deliver new features to clients as soon as possible, so this metric is useful to understand how frequently that happens.
- Lead Time For Change (LTFC) – this metric shows how long it takes to bring a new change to a production environment by looking at the elapsed time between the first commit and when that code is successfully running in production.
- Failed Deployment Recovery Time (MTTR) – measures the time needed for the system to restore service when a service incident or a defect that impacts users occurs.
DORA metrics give the team leaders truthful insights, as they can analyze the metrics to assess the team performance. Then they can improve their development processes. By looking at CFR and MTTR, you can ensure that your code is robust and stable, while reducing the failures.
You need to work on speed AND stability at the same time. You can't only improve stability, and you can't only improve speed. For example, you could go faster but it will likely increase incidents. And you could take more time to code to increase quality, but you'll sacrifice speed and you'll not have a fast enough pace.
What is Change Failure Rate - A Definition
Change failure rate measures the percentage of deployments that cause a failure in production and need to be fixed or rollback after they are deployed. Change failure rate determines how many changes that we deployed to production resulted in an incident. For example, a high change failure rate indicates that your team didn’t pay enough attention to testing, or that their deployment process is not efficient enough. Therefore, CFR measures the stability and quality of your team's deployments.
What is a Failure?
Different organizations have different criteria of what failure means in their operations, since there is no universal definition of a failure. Thus, a need emerges to define the term failure on a high level at each organization, so all the teams will respect and intercept the failures when they occur. Here we present some cases that are treated as failures by the engineering teams:
- Incidents captured by the incident management tool (e.g. PagerDuty, Zenduty). Usually, the teams need a high uptime of their product/ service, so such a tool can be a good way to track failures.
- System incident (you would capture it in DataDog, Aws CloudWatch, NewRelic, etc.) Basically your system is degraded or down.
- Error produced from the application (captured in an Error tracker like Sentry for example).
- Bugs that affect users from an Issue Tracker (like Jira).
- Incident severity: some teams have incident classification by level, and only react to deployments that caused major outages, such as when the application is not accessible to the clients.
- Need for rollback: this is a rather simple and often used manner to define failures, although not always the most complete. In this case, teams denote failures as deployments that needed a rollback.
Examples of failures include bugs introduced into production, system downtime, or any event that requires a 'hotfix' or rollback.
Not all issues that occur post-change are classified as failures. For instance, minor bugs that do not affect the system's overall performance or that do not impact the user (ex: fixing a label) are not considered failures.
Why is Change Failure Rate Important?
Identify Weak Points in Your Deployment Process
Change failure rate is a great way to determine how often releases lead to problems. A high value indicates that your development or testing process has room for improvement. For example, if 30% of your deployments require rollbacks, it may be time to rethink your testing environments to catch issues earlier.
Reduce Downtime And Improve Reliability
Frequent failures mean more disruptions for users, which can damage customer trust and cost you money. You can ensure smoother releases and more reliable services by tracking and taking steps to reduce change failure rate.
Improve Team Accountability And Feedback Loops
Change failure rate provides concrete data on the quality of deployments. This can motivate teams to write more robust code, improve testing, and communicate better across development and operations.
Make Continuous Improvements
Measuring change failure rate is critical to DevOps practices like continuous improvement. It gives teams a clear metric to work on reducing, which helps them get better with each release. For example, an e-commerce site that gradually reduces its CFR may see fewer cart abandonment issues caused by release-related bugs.
Make Informed Decisions on Future Changes
Monitoring the metric over time can help your team decide when it’s safe to release more often or when to hold back and address issues. For example, a sudden spike in change failure rate to 15% after an architectural revamp may indicate that the new architecture is introducing unforeseen risks or incompatibilities. In this case, it would be prudent to pause deployments, investigate the root causes of the failures, and make necessary adjustments before resuming releases.
How to Calculate Change Failure Rate
If you measure the correct metrics – it will be easy to compute the change failure rate. You just divide the number of deployments that cause failures by the total number of deployments. As example - if your team has made 6 deployments today, and 2 of them caused issues that require fixes, then the team’s CFR rate is 33%. In CFR calculation - you only consider the changes/ failures deployed in production, not the failures that appeared in the testing phase, as they were never deployed.
According to DORA group recommendations – you should also observe the three other metrics: deployment frequency, lead time for changes, and time to recover. To achieve better performance and optimize the right things - along with measuring change failure rate, you should establish a procedure that defines what failure is.
What is a Good Change Failure Rate
When the team is making a deployment to production, the work has been integrated to trunk, all tests and CI passed and should work as expected. In reality, many unplanned actions can occur, and some of the changes will result in failure or deprived service. A certain rate of failures is expected, but you need to know what is an indication of a low team performance.
According to the DevOps acceleration Report (2022), there are 3 classifications of CFR: high performers are between 0-15% , medium performers between 16-30% and low performers between 46-60%. Still, this classification changes every year. In companies - 11% of the teams are high performers, 69% are medium performers, 19% are low performers. Measuring CFR gives you an insight about your engineering teams’ development quality and helps you see where changes are needed, before this becomes a serious issue.
*Classification of Change Failure Rate, According to the DevOps acceleration Report (2022)
4 Advices to Follow For an Accurate Change Failure Rate
Even though the change failure rate formula is straightforward, there are some caveats to consider to avoid inaccurate or misleading values. Below, we will discuss these caveats, along with some advice on how to calculate accurate values each time:
1- Define What Failure Means and Establish a Formal Process to Track Failures
As discussed above, different teams and organizations can define failure differently. Be clear on what counts as a failure (e.g., bugs, downtime, rollbacks) and ensure everyone on your team is aligned. Use automated tools like Axify to track and log these failures consistently so they don’t slip through the cracks.
2- Exclude “Failures to Deploy” From the Number of Failed Deployments
Not all deployment failures are related to the code changes being released. For example, a deployment job may fail due to infrastructure, network, or configuration errors. These should be tracked separately and not counted as part of your Change Failure Rate.
3- Disregard Failures Due to External Factors
If a failure is caused by something outside your control—such as a third-party service outage or network issue—it does not reflect the quality of your code changes. Exclude these from the CFR, too, to keep the metric focused on your team’s performance.
4- Regularly Review And Adjust Your Definition of Failure
As your system evolves, so might your definition of what constitutes a failure. Establish a policy to periodically revisit your criteria and keep the change failure rate metric relevant and accurate.
Relation Between Change Failure Rate and MTTR
Failed deployment recovery time is defined as the time the system takes to restore the service when an incident occurs. Together with the change failure rate, MTTR is a measure of the quality and stability of your team’s delivery process. If the Change Failure Rate is high, MTTR usually tends to be high as well. This means a longer-than-desirable time is required to restore services to their normal functioning state, which hampers the overall software reliability and business productivity.
By improving the engineering culture, DevOps practices will lead you to smaller, but more frequent changes. With smaller changes, incidents are easier to spot and manage, hence reduce the time it takes to fix them. It will be safer to deploy more changes. It is less risky for a business to have a team that can deploy more changes with fewer failures and can fix a mistake quickly (eg. in less than 30 minutes). Also, high CFR and high MTTR means more time spent on fixing bugs and service degradation, which is a dollar spent to not create more business value.
Cost / Impact of Failure on Businesses
In order to understand where there are flaws in the deployment process, CTO-s usually keep track of the team's change failure rate. A large CFR may cause multiple consequences in terms of both financial, operational and product costs, so it is crucial to identify this rate and manage it. More bugs result in losing customers, due to the trust and reputation impact. Here is a list of the main issues that can arise from having a high CFR:
- Decreased productivity – frequent interruptions of the working process to fix change failures will take a toll on your team’s workflow. Frequent context switch creates delay in developing features (and reduces time to market). Multiple setbacks could also demotivate developers and reduce efficiency.
- Increased maintenance costs – rollbacking change failures means a waste of resources in terms of time and money for your team. Time spent on bugs is an opportunity cost (in developing features that generate more revenue) because you would have been developing a feature instead.
- Lower competitiveness – system downtime and constant fixing of features can be frustrating for the end-users and make you a less competitive player on the market.
- Security risks – when there’s no sufficient testing on new features (as you try to go faster), your product becomes vulnerable to cyber-attacks and security breaches.
In financial terms, the cost of change failure rate can vary a lot depending on the severity of the issues, the type of the failures, and the size and complexity of the system. It also depends on the size of your organization, the number of customers affected, and the specific industry or sector. It is difficult to define an acceptable monetary cost for change failure rate as this varies a lot.
In the ROI of DevOps transformation there are costs of downtime based on different factors: e.g. for Fortune 1000 companies, the average hourly cost of an infrastructure failure is $100,000, and average cost of a critical application failure per hour is around $500,000.
In order to reduce the change failure rate and their associated financial costs, it is good practice to conduct regular assessments of the CFR costs and compare them with similar organizations within your industry. By doing this, you will understand the relative costs and can set realistic goals and objectives.
How to Minimize Change Failure Rate
To reduce CFR - teams must embrace best practices such as rigorous automated testing, continuous integration and continuous deployment (CI/CD), performing diligent code reviews, and deploying efficient system monitoring solutions. Reducing the change failure rate is crucial to improve your team performance, and to minimize it - the best approach is to implement a Continuous Delivery, including the following practices:
- Test automation - automation tools to conduct and analyze the tests (finding real failures).
- Deployment automation - fully automated deployments that don’t need manual intervention.
- Trunk-based development - having fewer than three active branches in a code repository; with branches and forks having very short lifetimes.
- Continuous integration - creating canonical builds and packages that are ultimately deployed and released.
- Continuous testing - testing throughout the software delivery lifecycle rather than as a separate phase after development cycle.
Other best practices to minimize CFR include:
- For bugs detection - implement automated monitoring and testing features.
- You will better track the failures and fix them quickly if you make small deployments at a frequent pace.
- It is wiser to identify and address the causes of the failed deployments, rather than lowering the number of deployments to reduce the failures.
Along with CFR, you need to track other associated details like - duration of the outage or service degradation due to the failure, and the steps needed to restore the service. Tracking the outage duration helps the team to prioritize its efforts and improve the processes. Conversely, tracking the restoration steps helps the team to understand the main cause of the failures.
Simple Way to Measure Change Failure Rate
Manual calculations can provide you useful insights, but an automated tool like Axify.io can accurately model system behaviour and track change failure rate effortlessly. Axify is a single platform to observe all the key performance indicators that will help you improve your development and delivery processes. It is equipped with superior dashboards and provides constant tracking of DORA metrics in real-time, hence simplifying the whole process and empowering teams to concentrate on making improvements. There is no manual calculation, so less possibility for human error. With Axify - each team calculates their CFR and it is not normalized between them.
Some of the key Axify features regarding the DORA metrics include:
- Deployment frequency - measures how often an organization successfully deploys to production. This important metric makes it easier to test, deploy, provide feedback, and roll back problems, in addition to increasing perceived value for your customers.
- Lead Time for changes – time it takes from the first commit to successfully executed code in production. This metric allows us to assess the efficiency of software development cycles and initiatives. It tends to lead to organizational, human and technical changes.
- Reliability - this new metric was introduced to address the importance of operational excellence to a high-performing software organization. This metric tells you how well you meet your user’s expectations, such as availability and performance. It doesn’t have a defined high, medium, or low clustering, as the way teams measure reliability can vary widely depending on the service-level indicators or service-level objectives (SLI/SLO). Instead, teams are asked to rate their ability to meet their own reliability targets.
- Custom metrics, like: issue tracking, ongoing pull requests, service-level expectations (workflow predictability), throughput of issues per sprint, Git repository, team morale etc.
Keeping your finger on the pulse of the Change Failure Rate can yield revealing insights about the overall system health. By making diligent efforts to lower this rate, stability is dramatically improved, fostering a streamlined operation and a satisfied team, resulting in a win-win situation for everyone involved.
To find out more, read our article | Software development metrics: to rely on your projections with confidence