DORA Metrics
15 minutes reading time

Failed Deployment Recovery Time (Time to Restore Service or MTTR)

Failed Deployment Recovery Time (Time to Restore Service or MTTR)

In today’s dynamic environment – it is of crucial importance to detect a failure early and to be able to recover the system as soon as possible. Different practices have been introduced to make engineering teams more efficient and agile in the system recovery process. For the software teams – time spent on bugs and fixing an outage is an opportunity cost to deliver new value.

In this article, we elaborate on the Failed Deployment Recovery Time metric, one of the key DORA metrics.

A quick note

In software development, the terms Mean Time to Repair (MTTR), Mean Time to Restore, and Mean Time to Recovery are often used interchangeably as synonyms with Time to Restore Service. However, their calculations are different, and their interpretation may vary from one team to another. It's important to note that the meaning of these terms can also change depending on the industry, type of system, or service being measured. Defining each term is crucial to avoid confusion within the same team or organization.

In the context of DORA metrics, you might have seen Time to Restore Service and MTTR instead of Failed Deployment Recovery Time. With each new State of DevOps report, DORA refines the definition of these metrics, and their names change accordingly. The latest addition, Failed Deployment Recovery Time, measures specifically deployments that fail in production. It aims to balance the speed of deployments with their quality.

DORA Metrics - Quick Introduction

Based on the DevOps acceleration survey – in 2015 Google established the program called DORA (DevOps Research and Assessment) group. In 2018 they published their study Accelerate introducing the DORA Metrics – a concept that defines 4 key metrics that can distinguish the high-performance development teams from the average ones:

  • Deployment Frequency (DF) – measures the frequency at which code is deployed successfully to a production environment. Engineering teams tend to deliver new features to clients as soon as possible, so this metric is useful to understand how frequently that happens.
  • Lead Time for Changes (LTFC) – this metric shows how long it takes for a change to appear in the production environment by looking at the average time between the first commit made in the dev environment and when that feature is successfully running in production.
  • Failed Deployment Recovery Time – measures the time needed for the system to recover from a deployment that fails in production. To improve Failed Deployment Recovery Time (also known as Time to Restore Service) – DevOps should constantly observe the production environment.
  • Change Failure Rate (CFR) – measures the percent of deployments causing a failure in production, and is calculated by dividing the number of failures by the total number of deployments.

DORA metrics give the team leaders truthful insights, as they can analyze the metrics to estimate the team performance. Then they can improve their internal processes. By looking at CFR and Failed Deployment Recovery Time, leaders can ensure that their code is robust and stable, while reducing the failures. On the other side, monitoring DF and LTTC assures that the team is working at a good pace. Combined, DORA metrics provide crucial info about the team quality and speed.

What is Failed Deployment Recovery Time - Definition

As previously stated - Failed Deployment Recovery Time is a DORA metric that measures the time required to fix the software after a deployment fails in production. In other words, Failed Deployment Recovery Time assesses how quickly your team can identify and resolve issues that affect your software systems. It's an important metric to track since it helps you identify flaws in your software delivery pipeline and optimize it with improved efficiency.

Failed Deployment Recovery Time should be measured and reported regularly, i.e. daily, weekly, or monthly, to track changes in incident response times over time. It involves measuring: time to notify engineers, to diagnose the issue, fix the problem, set-up and test the system, and enabling new deployment for production. An increased Failed Deployment Recovery Time may indicate problems with the incident response process, with system reliability or code review processes.

How to Calculate Failed Deployment Recovery Time?

The formula to calculate Failed Deployment Recovery Time is:

Failed Deployment Recovery Time = Total Recovery Time ÷ Number of Failed Deployments

So, if your system was down for 1 hour in two separate incidents in a 24-hour period – 60 mins divided by two is 30, so your Failed Deployment Recovery Time is 30 minutes.

To calculate Failed Deployment Recovery Time, you should undertake these steps:

  • Record the time when the failed deployment occurred, i.e. the time when the system became unavailable.
  • Record the time of incident resolution, i.e. the time the system was restored to its normal operating state.
  • Calculate the downtime by subtracting the incident start time from the time of recovery.
  • Count the number of failed deployments that occurred during the observed period.
  • To get the average downtime per incident - compute the total downtime for all incidents and divide by the number of incidents.

Failed Deployment Recovery Time Performance Benchmarks

DORA as a global benchmark and performance group estimated that the average incident Failed Deployment Recovery Time is less than one day for high performers, and between one week and one month for low performers. This large gap is caused by several factors, including the tickets backlog, number of users (which impacts ticket’s handling time), and the complexity of tickets. More complex tickets result in longer resolution times and higher Failed Deployment Recovery Time.

Different performance indicators are defined to facilitate the tracking of Failed Deployment Recovery Time values:

  • Average time to acknowledge an incident event – shows how promptly the organization can detect an incident.
  • Average time to resolve the incident – how fast the incident can be resolved and the company can proceed with normal operation.
  • Incident resolution rate – percentage of incidents that are successfully resolved in a given time period.
  • Incident escalation rate - how often the incidents need to be escalated to higher-level management or assistance.

The Limitations of Failed Deployment Recovery Time

Failed Deployment Recovery Time is a useful metric to determine incident response times and identify ways for process improvement, but it still has some limitations:

  • It doesn’t consider the downtime before repair – i.e. the time between incident occurrence and incident detection, which can be significant in some cases.
  • Failed Deployment Recovery Time can highly be affected by deviations, e.g. rare, complex incidents that require significant time to resolve. Such incidents will increase Failed Deployment Recovery Time, making it difficult to assess the real incident response.
  • It doesn’t consider the severity of incidents. As a result, it may not provide you with a complete picture of the effectiveness of the incident response process, or the impact of incidents on users.
  • It doesn’t measure prevention (proactive maintenance) activities that prevent incidents to occur.

The Causes of a Failed Poor Deployment Recovery Time

Your code may be robust and well-tested code, and you have a good change failure rate, but still – you may end up with a low operational performance. If your application breaks, and your team doesn’t have a good process to detect the issue, fix it and deploy it quickly - then your Failed Deployment Recovery Time (operational performance) will be poor. The causes for a poor Failed Deployment Recovery Time can be multiple:

  • No proper tools for monitoring and observability – Time to measure the Failed Deployment Recovery Time begins at the moment your system becomes unavailable, not when you realize that. Slow problem detection will result in slow recovery. Problem detection is tricky for organizations that don’t use tools for monitoring and observability. To avoid end-user dissatisfaction and to successfully complete their tasks - DevOps team needs: uptime monitor, helpdesk tools, testing / alerting tools, etc.
  • Clumsy and slow deployment processes – Even if your team quickly detects an incident and creates a fix for it – if you have poor automation around deployment, and/or manual deployment – your Failed Deployment Recovery Time will suffer. A manual deployment process with steps that require human intervention negatively impacts both deployment frequency and Failed Deployment Recovery Time. So, you must have a smooth and possibly automated deployment process.
  • No plan for incident management - When an incident occurs, the DevOps team perceives the combined stress of a system outage, frustrated end-users, and disappointed business owners. Once an issue is detected and acknowledged, you must have allocated a responsible person and a procedure to resolve the situation. If your team doesn’t have a plan on how to handle an incident, the time needed to draft a plan will add-up into your Failed Deployment Recovery Time.

How to improve Failed Deployment Recovery Time?

To maintain a low Failed Deployment Recovery Time, consider the following best practices:

  • Adopt an effective incident handling procedure that includes clear roles, responsibilities, and escalation steps. Ensure that all team members are trained on the process and that it is regularly reviewed and updated.
  • Set up effective monitoring and alerting for your software systems. This will help you detect issues quickly and proactively before they impact users.
  • Automated testing that enables you to spot the incident quicker, since you have a suite of tests that cover the code, so you know faster where the problem is when you diagnose the bug.
  • Making smaller changes - smaller changes means easier to "detect" the incident since the last change.
  • To identify areas for improvement - regularly review and analyze your incident data. Implement changes to optimize the incident response process and reduce Failed Deployment Recovery Time overtime.
  • Perform deep root cause analysis for all incidents to identify the underlying causes and address them to prevent future incidents.
  • Apply automation to streamline incident response processes, such as automated alerts, diagnostics, and fixes. It will reduce the time required to resolve incidents.
  • Conduct regular tests to identify and resolve issues before they impact users. It will prevent incidents and reduce Failed Deployment Recovery Time.
  • Enable effective communication among your team members during incident response. It ensures that issues are resolved quickly and prevents delays due to miscommunication.

Simple way to analyze Failed Deployment Recovery Time

Manual calculations can provide useful insights, but an automated tool like Axify.io can prognostically model system behaviour and track Failed Deployment Recovery Time effortlessly. Axify is a single platform to observe all the key performance indicators that will help you improve your development and delivery processes. It is equipped with superior dashboards and provides constant tracking of DORA metrics in real-time, hence simplifying the whole process and empowering teams to concentrate on making improvements.

Axify implements all four DORA metrics:

  • Failed Deployment Recovery Time — measures the time needed for the system to recover from a failed deployment in production.
  • Deployment Frequency — measures how often an organization successfully deploys to production. This important metric makes it easier to test, deploy, provide feedback, and roll back problems, in addition to increasing perceived value for your customers.
  • Lead Time for Changes — time it takes from the first commit to successfully executed code in production. This metric allows us to assess the efficiency of software development cycles and initiatives. It tends to lead to organizational, human and technical changes.
  • Change Failure Rate — measures the percent of deployments causing a failure in production, and is calculated by dividing the number of failures by the total number of deployments.

Keeping your eye on the Failed Deployment Recovery Time can yield revealing insights about the overall system health. By making diligent efforts to lower this rate, you will improve the system stability, fostering a streamlined operation and a satisfied team.

To find out more, read our article | Understanding DORA metrics: your complete guide