Change Failure Rate Explained (DORA metric)

 

undraw_server_down_s-4-lk

In today’s dynamic environment - technology companies struggle to evaluate the performance of their development teams. Different studies have been conducted aiming to define objective criteria that will make engineering teams more agile and efficient in the software development process.

In the past 10 years – more than 30000 IT professionals worldwide have participated in the global DevOps acceleration survey. As a result – in 2015 Google established the program called DORA (DevOps Research and Assessment) group. In 2018 they published their study Accelerate introducing the DORA Metrics – a concept that defines 4 key metrics that can distinguish the high-performance development teams from the average ones. The metrics are listed below and in this article we explain one of them - Change Failure Rate (CFR) as a number that measures the percentage of changes to production that result in degraded service (or incidents). We will clarify the role of CFR in tech companies and the need to identify and manage the change failure rate in a proper way.

Quick introduction of DORA Metrics

The development team creates changes to production or are released to users. Some of these changes will cause incidents (i.e. service outage or hotfix), so the Change Failure Rate is the percentage of changes that result in service degradation. 

Along with the CFR - DORA group defines three more metrics (called DORA metrics): 

  • Deployment Frequency (DF) – measures the frequency at which code is deployed successfully to a production environment. Engineering teams want to deliver new features to clients as soon as possible, so this metric is useful to understand how frequently that happens. 
  • Lead Time For Change (LTFC) – this metric shows how long it takes to bring a new change to a production environment by looking at the elapsed time between the first commit and when that code is successfully running in production. 
  • Mean Time to Recovery (MTTR) – measures the time needed for the system to restore service when a service incident or a defect that impacts users occurs. 

DORA metrics give the team leaders truthful insights, as they can analyze the metrics to assess the team performance. Then they can improve their development processes. By looking at CFR and MTTR, you can ensure that your code is robust and stable, while reducing the failures. 

You need to work on speed AND stability at the same time. You can't only improve stability, and you can't only improve speed. For example, you could go faster but it will likely increase incidents. And you could take more time to code to increase quality, but you'll sacrifice speed and you'll not have a fast enough pace.

To learn more, check our DORA metrics complete guide here.

What is Change Failure Rate - definition

Change failure rate measures the percentage of deployments that cause a failure in production and need to be fixed or rollback after they are deployed. Change failure rate determines how many changes that we deployed to production resulted in an incident. For example, a high change failure rate indicates that your team didn’t pay enough attention to testing, or that their deployment process is not efficient enough. Therefore, CFR measures the stability and quality of your team's deployments.

What is a Failure?

Different organizations have different criteria of what failure means in their operations, since there is no universal definition of a failure. Thus, a need emerges to define the term failure on a high level at each organization, so all the teams will respect and intercept the failures when they occur. Here we present some cases that are treated as failures by the engineering teams:

  • Incidents captured by the incident management tool (e.g. PagerDuty, Zenduty). Usually, the teams need a high uptime of their product/ service, so such a tool can be a good way to track failures. 
  • System incident (you would capture it in DataDog, Aws CloudWatch, NewRelic, etc.) Basically your system is degraded or down.
  • Error produced from the application (captured in an Error tracker like Sentry for example).
  • Bugs that affect users from an Issue Tracker (like Jira).
  • Incident severity: some teams have incident classification by level, and only react to deployments that caused major outages, such as when the application is not accessible to the clients.
  • Need for rollback: this is a rather simple and often used manner to define failures, although not always the most complete. In this case, teams denote failures as deployments that needed a rollback. 

Examples of failures include bugs introduced into production, system downtime, or any event that requires a 'hotfix' or rollback.

Not all issues that occur post-change are classified as failures. For instance, minor bugs that do not affect the system's overall performance or that do not impact the user (ex: fixing a label) are not considered failures.

How to calculate Change Failure Rate

If you measure the correct metrics – it will be easy to compute the change failure rate. You just divide the number of deployments that cause failures by the total number of deployments. As example - if your team has made 6 deployments today, and 2 of them caused issues that require fixes, then the team’s CFR rate is 33%. In CFR calculation - you only consider the changes/ failures deployed in production, not the failures that appeared in the testing phase, as they were never deployed.

According to DORA group recommendations – you should also observe the three other metrics: deployment frequency, lead time for changes, and time to recover. To achieve better performance and optimize the right things - along with measuring change failure rate, you should establish a procedure that defines what failure is.

CFR formula Axify-2

What is a good Change Failure Rate

When the team is making a deployment to production, the work has been integrated to trunk, all tests and CI passed and should work as expected. In reality, many unplanned actions can occur, and some of the changes will result in failure or deprived service. A certain rate of failures is expected, but you need to know what is an indication of a low team performance. 

According to the DevOps acceleration Report (2022), there are 3 classifications of CFR: high performers are between 0-15% , medium performers between 16-30% and low performers between 46-60%. Still, this classification changes every year. In companies - 11% of the teams are high performers, 69% are medium performers, 19% are low performers. Measuring CFR gives you an insight about your engineering teams’ development quality and helps you see where changes are needed, before this becomes a serious issue.

CFR classification Axify

*Classification of Change Failure Rate, According to the DevOps acceleration Report (2022)

Relation between Change Failure Rate and MTTR

Mean time to recovery or MTTR is defined as the time the system takes to restore the service when an incident occurs. Together with the change failure rate, MTTR is a measure of the quality and stability of your team’s delivery process. If the Change Failure Rate is high, MTTR usually tends to be high as well. This means a longer-than-desirable time is required to restore services to their normal functioning state, which hampers the overall software reliability and business productivity.

By improving the engineering culture, DevOps practices will lead you to smaller, but more frequent changes. With smaller changes, incidents are easier to spot and manage, hence reduce the time it takes to fix them. It will be safer to deploy more changes. It is less risky for a business to have a team that can deploy more changes with fewer failures and can fix a mistake quickly (eg. in less than 30 minutes). Also, high CFR and high MTTR means more time spent on fixing bugs and service degradation, which is a dollar spent to not create more business value.

More value cycle Axify

Cost / impact of Failure on Businesses

In order to understand where there are flaws in the deployment process, CTO-s usually keep track of the team's change failure rate. A large CFR may cause multiple consequences in terms of both financial, operational and product costs, so it is crucial to identify this rate and manage it. More bugs result in losing customers, due to the trust and reputation impact. Here is a list of the main issues that can arise from having a high CFR:

  • Decreased productivity – frequent interruptions of the working process to fix change failures will take a toll on your team’s workflow. Frequent context switch creates delay in developing features (and reduces time to market). Multiple setbacks could also demotivate developers and reduce efficiency. 
  • Increased maintenance costs – rollbacking change failures means a waste of resources in terms of time and money for your team. Time spent on bugs is an opportunity cost (in developing features that generate more revenue) because you would have been developing a feature instead.
  • Lower competitiveness – system downtime and constant fixing of features can be frustrating for the end-users and make you a less competitive player on the market.
  • Security risks – when there’s no sufficient testing on new features (as you try to go faster), your product becomes vulnerable to cyber-attacks and security breaches.

In financial terms, the cost of change failure rate can vary a lot depending on the severity of the issues, the type of the failures, and the size and complexity of the system. It also depends on the size of your organization, the number of customers affected, and the specific industry or sector. It is difficult to define an acceptable monetary cost for change failure rate as this varies a lot. 

In the ROI of DevOps transformation there are costs of downtime based on different factors: e.g. for Fortune 1000 companies, the average hourly cost of an infrastructure failure is $100,000, and average cost of a critical application failure per hour is around $500,000. 

In order to reduce the change failure rate and their associated financial costs, it is good practice to conduct regular assessments of the CFR costs and compare them with similar organizations within your industry. By doing this, you will understand the relative costs and can set realistic goals and objectives.

How to minimize Change Failure Rate

To reduce CFR - teams must embrace best practices such as rigorous automated testing, continuous integration and continuous deployment (CI/CD), performing diligent code reviews, and deploying efficient system monitoring solutions. Reducing the change failure rate is crucial to improve your team performance, and to minimize it - the best approach is to implement a Continuous Delivery, including the following practices:

  • Test automation - automation tools to conduct and analyze the tests (finding real failures). 
  • Deployment automation - fully automated deployments that don’t need manual intervention.
  • Trunk-based development - having fewer than three active branches in a code repository; with branches and forks having very short lifetimes.
  • Continuous integration - creating canonical builds and packages that are ultimately deployed and released.
  • Continuous testing - testing throughout the software delivery lifecycle rather than as a separate phase after development cycle. 

Other best practices to minimize CFR include:

  • For bugs detection - implement automated monitoring and testing features.
  • You will better track the failures and fix them quickly if you make small deployments at a frequent pace.
  • It is wiser to identify and address the causes of the failed deployments, rather than lowering the number of deployments to reduce the failures.

Along with CFR, you need to track other associated details like - duration of the outage or service degradation due to the failure, and the steps needed to restore the service. Tracking the outage duration helps the team to prioritize its efforts and improve the processes. Conversely, tracking the restoration steps helps the team to understand the main cause of the failures. 

Simple way to measure Change Failure Rate

Manual calculations can provide you useful insights, but an automated tool like Axify.io can accurately model system behaviour and track change failure rate effortlessly. Axify is a single platform to observe all the key performance indicators that will help you improve your development and delivery processes. It is equipped with superior dashboards and provides constant tracking of DORA metrics in real-time, hence simplifying the whole process and empowering teams to concentrate on making improvements. There is no manual calculation, so less possibility for human error. With Axify - each team calculates their CFR and it is not normalized between them. 

Some of the key Axify features regarding the DORA metrics include:

  • Deployment frequency - measures how often an organization successfully deploys to production. This important metric makes it easier to test, deploy, provide feedback, and roll back problems, in addition to increasing perceived value for your customers.
  • Lead Time for changes – time it takes from the first commit to successfully executed code in production. This metric allows us to assess the efficiency of software development cycles and initiatives. It tends to lead to organizational, human and technical changes.
  • Reliability - this new metric was introduced to address the importance of operational excellence to a high-performing software organization. This metric tells you how well you meet your user’s expectations, such as availability and performance. It doesn’t have a defined high, medium, or low clustering, as the way teams measure reliability can vary widely depending on the service-level indicators or service-level objectives (SLI/SLO). Instead, teams are asked to rate their ability to meet their own reliability targets.
  • Custom metrics, like: issue tracking, ongoing pull requests, service-level expectations (workflow predictability), throughput of issues per sprint, Git repository, team morale etc.

Keeping your finger on the pulse of the Change Failure Rate can yield revealing insights about the overall system health. By making diligent efforts to lower this rate, stability is dramatically improved, fostering a streamlined operation and a satisfied team, resulting in a win-win situation for everyone involved. 

Software delivery performance

Understanding DORA metrics: your complete guide

Discover the significance of DORA metrics and their role in enhancing software delivery performance. Learn how these metrics can be utilized to gauge and enhance the efficiency of your development teams, fostering more streamlined and effective software development processes.





Software delivery performance

Lead Time for Changes Explained (DORA metric)

Lead time for changes measures how swiftly a modification goes from commit to master and then to production. In feature development, it encompasses sprint, developer work, merge request, review, and integration into the master branch. Minimizing this duration is key for efficiency.




Software delivery performance

Teamwork visibility vs. individual performance: a new way of thinking about productivity

Although it might seem appealing to associate individual activity levels with productivity, such a perspective overlooks the elements crucial for sustained innovation.