In today’s dynamic environment – it is of crucial importance to detect a failure early and to be able to recover the system as soon as possible. Different practices have been introduced to make engineering teams more efficient and agile in the system recovery process. For the software teams – time spent on bugs and fixing an outage is an opportunity cost to deliver new value.
In this article, we elaborate on the Time to Restore Service metric, one of the key DORA metrics.
A quick note
In software development, the terms Mean Time to Repair (MTTR), Mean Time to Restore, and Mean Time to Recovery are often used interchangeably as synonyms with Time to Restore Service. However, their calculations are different, and their interpretation may vary from one team to another. It's important to note that the meaning of these terms can also change depending on the industry, type of system, or service being measured. Defining each term is crucial to avoid confusion within the same team or organization.
DORA Metrics - Quick Introduction
Based on the DevOps acceleration survey – in 2015 Google established the program called DORA (DevOps Research and Assessment) group. In 2018 they published their study Accelerate introducing the DORA Metrics – a concept that defines 4 key metrics that can distinguish the high-performance development teams from the average ones:
- Deployment Frequency (DF) – measures the frequency at which code is deployed successfully to a production environment. Engineering teams tend to deliver new features to clients as soon as possible, so this metric is useful to understand how frequently that happens.
- Lead Time for Changes (LTFC) – this metric shows how long it takes for a change to appear in the production environment by looking at the average time between the first commit made in the dev environment and when that feature is successfully running in production.
- Time to Restore Service – measures the time needed for the system to recover from an incident in production. To improve the Time to Restore Service – DevOps should constantly observe the production environment.
- Change Failure Rate (CFR) – measures the percent of deployments causing a failure in production, and is calculated by dividing the number of failures by the total number of deployments.
DORA metrics give the team leaders truthful insights, as they can analyze the metrics to estimate the team performance. Then they can improve their internal processes. By looking at CFR and Time to Restore Service, leaders can ensure that their code is robust and stable, while reducing the failures. On the other side, monitoring DF and LTTC assures that the team is working at a good pace. Combined, DORA metrics provide crucial info about the team quality and speed.
What is Time to Restore Service - Definition
As previously stated - Time to Restore Service is a DORA metric that measures the time required to fix the software after an incident in production. In other words, Time to Restore Service assesses how quickly your team can identify and resolve issues that affect your software systems. It's an important metric to track since it helps you identify flaws in your software delivery pipeline and optimize it with improved efficiency.
Time to Restore Service should be measured and reported regularly, i.e. daily, weekly, or monthly, to track changes in incident response times over time. It involves measuring: time to notify engineers, to diagnose the issue, fix the problem, set-up and test the system, and enabling new deployment for production. An increased Time to Restore Service may indicate problems with the incident response process, or with system reliability.
How to Calculate Time to Restore Service?
The formula to calculate Time to Restore Service is:
Time to Restore Service = Total downtime for all incidents / Number of incidents
So, if your system was down for 1 hour in two separate incidents in a 24-hour period – 60 mins divided by two is 30, so your Time to Restore Service is 30 minutes.
To calculate Time to Restore Service, you should undertake these steps:
- Record the time when the incident occurred, i.e. the time when the system became unavailable.
- Record the time of incident resolution, i.e. the time the system was restored to its normal operating state.
- Calculate the downtime by subtracting the incident start time from the time of recovery.
- Count the number of incidents that occurred during the observed period.
- To get the average downtime per incident - compute the total downtime for all incidents and divide by the number of incidents.
Time to Restore Service Performance Benchmarks
DORA as a global benchmark and performance group estimated that the average incident Time to Restore Service is less than one day for high performers, and between one week and one month for low performers. This large gap is caused by several factors, including the tickets backlog, number of users (which impacts ticket’s handling time), and the complexity of tickets. More complex tickets result in longer resolution times and higher Time to Restore Service.
Different performance indicators are defined to facilitate the tracking of Time to Restore Service values:
- Average time to acknowledge an incident event – shows how promptly the organization can detect an incident.
- Average time to resolve the incident – how fast the incident can be resolved and the company can proceed with normal operation.
- Incident resolution rate – percentage of incidents that are successfully resolved in a given time period.
- Incident escalation rate - how often the incidents need to be escalated to higher-level management or assistance.
The Limitations of Time to Restore Service
Time to Restore Service is a useful metric to determine incident response times and identify ways for process improvement, but it still has some limitations:
- It doesn’t consider the downtime before repair – i.e. the time between incident occurrence and incident detection, which can be significant in some cases.
- Time to Restore Service can highly be affected by deviations, e.g. rare, complex incidents that require significant time to resolve. Such incidents will increase Time to Restore Service, making it difficult to assess the real incident response.
- It doesn’t consider the severity of incidents. As a result, it may not provide you with a complete picture of the effectiveness of the incident response process, or the impact of incidents on users.
- It doesn’t measure prevention (proactive maintenance) activities that prevent incidents to occur.
The Causes of a Time to Restore Service
Your code may be robust and well-tested code, and you have a good change failure rate, but still – you may end up with a low operational performance. If your application breaks, and your team doesn’t have a good process to detect the issue, fix it and deploy it quickly - then your Time to Restore Service (operational performance) will be poor. The causes for a poor Time to Restore Service can be multiple:
- No proper tools for monitoring and observability – Time to measure the Time to Restore Service begins at the moment your system becomes unavailable, not when you realize that. Slow problem detection will result in slow recovery. Problem detection is tricky for organizations that don’t use tools for monitoring and observability. To avoid end-user dissatisfaction and to successfully complete their tasks - DevOps team needs: uptime monitor, helpdesk tools, testing / alerting tools, etc.
- Clumsy and slow deployment processes – Even if your team quickly detects an incident and creates a fix for it – if you have poor automation around deployment, and/or manual deployment – your Time to Restore Service will suffer. A manual deployment process with steps that require human intervention negatively impacts both deployment frequency and Time to Restore Service. So, you must have a smooth and possibly automated deployment process.
- No plan for incident management - When an incident occurs, the DevOps team perceives the combined stress of a system outage, frustrated end-users, and disappointed business owners. Once an issue is detected and acknowledged, you must have allocated a responsible person and a procedure to resolve the situation. If your team doesn’t have a plan on how to handle an incident, the time needed to draft a plan will add-up into your MTTR.
How to improve Time to Restore Service?
To maintain a low Time to Restore Service, consider the following best practices:
- Adopt an effective incident handling procedure that includes clear roles, responsibilities, and escalation steps. Ensure that all team members are trained on the process and that it is regularly reviewed and updated.
- Set up effective monitoring and alerting for your software systems. This will help you detect issues quickly and proactively before they impact users.
- Automated testing that enables you to spot the incident quicker, since you have a suite of tests that cover the code, so you know faster where the problem is when you diagnose the bug.
- Making smaller changes - smaller changes means easier to "detect" the incident since the last change.
- To identify areas for improvement - regularly review and analyze your incident data. Implement changes to optimize the incident response process and reduce Time to Restore Service overtime.
- Perform deep root cause analysis for all incidents to identify the underlying causes and address them to prevent future incidents.
- Apply automation to streamline incident response processes, such as automated alerts, diagnostics, and fixes. It will reduce the time required to resolve incidents.
- Conduct regular tests to identify and resolve issues before they impact users. It will prevent incidents and reduce Time to Restore Service.
- Enable effective communication among your team members during incident response. It ensures that issues are resolved quickly and prevents delays due to miscommunication.
Simple way to analyze Time to Restore Service
Manual calculations can provide useful insights, but an automated tool like Axify.io can prognostically model system behaviour and track Time to Restore Service effortlessly. Axify is a single platform to observe all the key performance indicators that will help you improve your development and delivery processes. It is equipped with superior dashboards and provides constant tracking of DORA metrics in real-time, hence simplifying the whole process and empowering teams to concentrate on making improvements.
Axify implements all four DORA metrics:
- Time to Restore Service — measures the time needed for the system to recover from an incident in production.
- Deployment Frequency — measures how often an organization successfully deploys to production. This important metric makes it easier to test, deploy, provide feedback, and roll back problems, in addition to increasing perceived value for your customers.
- Lead Time for Changes — time it takes from the first commit to successfully executed code in production. This metric allows us to assess the efficiency of software development cycles and initiatives. It tends to lead to organizational, human and technical changes.
- Change Failure Rate — measures the percent of deployments causing a failure in production, and is calculated by dividing the number of failures by the total number of deployments.
Keeping your eye on the Time to Restore Service can yield revealing insights about the overall system health. By making diligent efforts to lower this rate, you will improve the system stability, fostering a streamlined operation and a satisfied team.
To find out more, read our article | Understanding DORA metrics: your complete guide